Machine learning and supervised learning at scale is a common task that can be addressed with Apache Hive by leveraging Hivemall features and user-defined functions (UDFs) for HiveQL, which is strongly optimized for machine learning (ML) and data science. For example, logistic regression models, one of the basic tenants of ML, can also be easily built with Hivemall. The following example should explain the basic usage of Hivemall with examples of supervised learning of a simple regressor.
Database: foodmart - Table: sales_fact _dec_1998 and customer
Create Table customer_training as
select s1.customer_id, c1.country, c1.gender, c1.member_card,c1.num_cars_owned, count(s1.product_id) as purchase_cnt
from sales_fact_dec_1998 s1
join customer c1
WHERE s1.customer_id = c1.customer_id
group by s1.customer_id, c1.country, c1.gender, c1.member_card, c1.num_cars_owned;
customer_id
country
gender
member_card
purchase_cnt
14
USA
F
Bronze
3
34
USA
F
Silver
14
39
USA
F
Bronze
6
48
Mexico
M
Bronze
8
55
USA
M
Normal
7
57
Mexico
M
Golden
8
74
USA
M
Bronze
6
76
USA
F
Normal
10
80
USA
F
Bronze
7
84
Canada
M
Bronze
1
87
Canada
F
Bronze
7
93
USA
M
Golden
8
98
Canada
F
Bronze
2
Customer Feature Representation
Create the feature vector for the customer that relevant i.e. country, gender, member_card and num_cars_owned. Hivemall can have quantitive (mumeric) features like num_cars_owned or categorical (char/strings) as gender or country.
create table customer_features (customer_id int, features array<string>, purchase_cnt int);
Now that we have the customers with a defined set of features, the next step is to create the regressor. This requires us to specify an appropriate -loss_function. As this is squared for train_regressor, this means this query builds a simple linear regressor with the squared loss.
create table if not exists customer_regressor as
select
hivemall.train_regressor(
features, -- feature vector
purchase_cnt, -- target value
'-loss_function squared -optimizer AdaGrad -regularization l2' -- hyper-parameters
) as (feature, weight)
from
customer_features
;
Table: Customer_Regression table
feature
weight
country#Canada
0.036762696
country#Mexico
-0.07290672
country#USA
0.009568159
gender#F
-0.035828743
gender#M
-0.052134056
member_card#Bronze
-0.015939986
member_card#Golden
0.0020267756
member_card#Normal
-0.04439145
member_card#Silver
-0.059466142
num_cars_owned
1.0083996
Prediction
Now we can compute the prediction for the number of purchases of a new customer.
create table if not exists customer_new as
select 1 as id, array("num_cars_owned:3.0","country#USA","gender#F","member_card#Silver") as features
;
with features_exploded as (
select
id,
hivemall.extract_feature( fv ) as feature,
hivemall.extract_weight( fv ) as value
from customer_new t1 LATERAL VIEW explode(features) t2 as fv
)
select
t1.id,
sum(p1.weight * t1.value) as predicted_num_purchases
from
features_exploded t1
LEFT OUTER JOIN customer_regressor p1 ON (t1.feature = p1.feature)
group by
t1.id
;