Community Articles
Find and share helpful community-sourced technical articles
Labels (2)
Cloudera Employee


Machine learning and supervised learning at scale is a common task that can be addressed with Apache Hive by leveraging Hivemall features and user-defined functions (UDFs) for HiveQL, which is strongly optimized for machine learning (ML) and data science. For example, logistic regression models, one of the basic tenants of ML, can also be easily built with Hivemall. The following example should explain the basic usage of Hivemall with examples of supervised learning of a simple regressor.

Setup the environment

-> HDP 3.0.1 Sandbox - Hivemall 0.5.2 see Machine Learning with Apache Hive SQL using Hivemall

Explore the Training Data

Database: foodmart - Table: sales_fact _dec_1998 and customer

Create Table customer_training as
select s1.customer_id,,  c1.gender, c1.member_card,c1.num_cars_owned, count(s1.product_id) as purchase_cnt
from sales_fact_dec_1998 s1
 join customer c1
WHERE s1.customer_id = c1.customer_id
group by s1.customer_id,,  c1.gender, c1.member_card, c1.num_cars_owned;
customer_id country gendermember_cardpurchase_cnt
14 USA F Bronze 3
34 USA F Silver 14
39 USA F Bronze 6
48 Mexico M Bronze 8
55 USA M Normal 7
57 Mexico M Golden 8
74 USA M Bronze 6
76 USA F Normal 10
80 USA F Bronze 7
84 Canada M Bronze 1
87 Canada F Bronze 7
93 USA M Golden 8
98 Canada F Bronze 2

Customer Feature Representation

Create the feature vector for the customer that relevant i.e. country, gender, member_card and num_cars_owned. Hivemall can have quantitive (mumeric) features like num_cars_owned or categorical (char/strings) as gender or country.

create table customer_features (customer_id int, features array<string>, purchase_cnt int);
insert overwrite table customer_features 
   hivemall.quantitative_features( array("num_cars_owned"),num_cars_owned ) ,
   hivemall.categorical_features( array("country", "gender","member_card"), country, gender, member_card )
	 ) as features,
  group by customer_id,
          hivemall.quantitative_features( array("num_cars_owned"),num_cars_owned ) ,
	 	  hivemall.categorical_features( array("country", "gender","member_card"), country, gender, member_card )

The customer_features should look like the following:

customer_idfeatures purchase_cnt
6 ["num_cars_owned:4.0","country#USA","gender#F","member_card#Bronze"] 4
10 ["num_cars_owned:5.0","country#USA","gender#M","member_card#Golden"] 5
12 ["num_cars_owned:2.0","country#USA","gender#F","member_card#Bronze"] 2
14 ["num_cars_owned:3.0","country#USA","gender#F","member_card#Bronze"] 3
24 ["num_cars_owned:3.0","country#Mexico","gender#F","member_card#Bronze"] 3
34 ["num_cars_owned:14.0","country#USA","gender#F","member_card#Silver"] 14
36 ["num_cars_owned:6.0","country#USA","gender#M","member_card#Normal"] 6
39 ["num_cars_owned:6.0","country#USA","gender#F","member_card#Bronze"] 6
42 ["num_cars_owned:7.0","country#Mexico","gender#F","member_card#Golden"] 7
48 ["num_cars_owned:8.0","country#Mexico","gender#M","member_card#Bronze"] 8
52 ["num_cars_owned:5.0","country#Mexico","gender#F","member_card#Silver"] 5
55 ["num_cars_owned:7.0","country#USA","gender#M","member_card#Normal"] 7


Now that we have the customers with a defined set of features, the next step is to create the regressor. This requires us to specify an appropriate -loss_function. As this is squared for train_regressor, this means this query builds a simple linear regressor with the squared loss.

create table if not exists customer_regressor as
    features, -- feature vector
    purchase_cnt, -- target value
    '-loss_function squared -optimizer AdaGrad -regularization l2' -- hyper-parameters
  ) as (feature, weight)

Table: Customer_Regression table

country#Canada 0.036762696
country#Mexico -0.07290672
country#USA 0.009568159
gender#F -0.035828743
gender#M -0.052134056
member_card#Bronze -0.015939986
member_card#Golden 0.0020267756
member_card#Normal -0.04439145
member_card#Silver -0.059466142
num_cars_owned 1.0083996


Now we can compute the prediction for the number of purchases of a new customer.

create table if not exists customer_new as
select 1 as id, array("num_cars_owned:3.0","country#USA","gender#F","member_card#Silver") as features
with features_exploded as (
    hivemall.extract_feature( fv ) as feature,
    hivemall.extract_weight( fv ) as value
  from customer_new t1 LATERAL VIEW explode(features) t2 as fv
  sum(p1.weight * t1.value) as predicted_num_purchases
  features_exploded t1
  LEFT OUTER JOIN customer_regressor p1 ON (t1.feature = p1.feature)
group by

Don't have an account?
Version history
Revision #:
1 of 1
Last update:
‎01-10-2019 12:00 PM
Updated by:
Top Kudoed Authors