Created on 01-10-2019 12:00 PM
Machine learning and supervised learning at scale is a common task that can be addressed with Apache Hive by leveraging Hivemall features and user-defined functions (UDFs) for HiveQL, which is strongly optimized for machine learning (ML) and data science. For example, logistic regression models, one of the basic tenants of ML, can also be easily built with Hivemall. The following example should explain the basic usage of Hivemall with examples of supervised learning of a simple regressor.
-> HDP 3.0.1 Sandbox - Hivemall 0.5.2 see Machine Learning with Apache Hive SQL using Hivemall
Create Table customer_training as select s1.customer_id, c1.country, c1.gender, c1.member_card,c1.num_cars_owned, count(s1.product_id) as purchase_cnt from sales_fact_dec_1998 s1 join customer c1 WHERE s1.customer_id = c1.customer_id group by s1.customer_id, c1.country, c1.gender, c1.member_card, c1.num_cars_owned;
customer_id | country | gender | member_card | purchase_cnt |
14 | USA | F | Bronze | 3 |
34 | USA | F | Silver | 14 |
39 | USA | F | Bronze | 6 |
48 | Mexico | M | Bronze | 8 |
55 | USA | M | Normal | 7 |
57 | Mexico | M | Golden | 8 |
74 | USA | M | Bronze | 6 |
76 | USA | F | Normal | 10 |
80 | USA | F | Bronze | 7 |
84 | Canada | M | Bronze | 1 |
87 | Canada | F | Bronze | 7 |
93 | USA | M | Golden | 8 |
98 | Canada | F | Bronze | 2 |
Create the feature vector for the customer that relevant i.e. country, gender, member_card and num_cars_owned. Hivemall can have quantitive (mumeric) features like num_cars_owned or categorical (char/strings) as gender or country.
create table customer_features (customer_id int, features array<string>, purchase_cnt int);
insert overwrite table customer_features select customer_id, hivemall.array_concat( hivemall.quantitative_features( array("num_cars_owned"),num_cars_owned ) , hivemall.categorical_features( array("country", "gender","member_card"), country, gender, member_card ) ) as features, sum(purchase_cnt) from customer_training group by customer_id, hivemall.array_concat( hivemall.quantitative_features( array("num_cars_owned"),num_cars_owned ) , hivemall.categorical_features( array("country", "gender","member_card"), country, gender, member_card ) ) ;
The customer_features should look like the following:
customer_id | features | purchase_cnt |
6 | ["num_cars_owned:4.0","country#USA","gender#F","member_card#Bronze"] | 4 |
10 | ["num_cars_owned:5.0","country#USA","gender#M","member_card#Golden"] | 5 |
12 | ["num_cars_owned:2.0","country#USA","gender#F","member_card#Bronze"] | 2 |
14 | ["num_cars_owned:3.0","country#USA","gender#F","member_card#Bronze"] | 3 |
24 | ["num_cars_owned:3.0","country#Mexico","gender#F","member_card#Bronze"] | 3 |
34 | ["num_cars_owned:14.0","country#USA","gender#F","member_card#Silver"] | 14 |
36 | ["num_cars_owned:6.0","country#USA","gender#M","member_card#Normal"] | 6 |
39 | ["num_cars_owned:6.0","country#USA","gender#F","member_card#Bronze"] | 6 |
42 | ["num_cars_owned:7.0","country#Mexico","gender#F","member_card#Golden"] | 7 |
48 | ["num_cars_owned:8.0","country#Mexico","gender#M","member_card#Bronze"] | 8 |
52 | ["num_cars_owned:5.0","country#Mexico","gender#F","member_card#Silver"] | 5 |
55 | ["num_cars_owned:7.0","country#USA","gender#M","member_card#Normal"] | 7 |
Now that we have the customers with a defined set of features, the next step is to create the regressor. This requires us to specify an appropriate -loss_function. As this is squared for train_regressor, this means this query builds a simple linear regressor with the squared loss.
create table if not exists customer_regressor as select hivemall.train_regressor( features, -- feature vector purchase_cnt, -- target value '-loss_function squared -optimizer AdaGrad -regularization l2' -- hyper-parameters ) as (feature, weight) from customer_features ;
Table: Customer_Regression table
feature | weight |
country#Canada | 0.036762696 |
country#Mexico | -0.07290672 |
country#USA | 0.009568159 |
gender#F | -0.035828743 |
gender#M | -0.052134056 |
member_card#Bronze | -0.015939986 |
member_card#Golden | 0.0020267756 |
member_card#Normal | -0.04439145 |
member_card#Silver | -0.059466142 |
num_cars_owned | 1.0083996 |
Now we can compute the prediction for the number of purchases of a new customer.
create table if not exists customer_new as select 1 as id, array("num_cars_owned:3.0","country#USA","gender#F","member_card#Silver") as features ; with features_exploded as ( select id, hivemall.extract_feature( fv ) as feature, hivemall.extract_weight( fv ) as value from customer_new t1 LATERAL VIEW explode(features) t2 as fv ) select t1.id, sum(p1.weight * t1.value) as predicted_num_purchases from features_exploded t1 LEFT OUTER JOIN customer_regressor p1 ON (t1.feature = p1.feature) group by t1.id ;
PREDICTED_NUM_PURCHASES 2.9394720904529095