Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Super Guru

Hivemall: Machine Learning on Hive, Pig and Spark SQL

Install HiveMall https://github.com/myui/hivemall/wiki/Installation

Pick latest release https://github.com/myui/hivemall/releases

# Setup Your Environment $HOME/.hiverc
add jar /home/myui/tmp/hivemall-core-xxx-with-dependencies.jar; 
source /home/myui/tmp/define-all.hive;

# Create a directory in HDFS for the JAR
hadoop fs -mkdir -p /apps/hivemall
hdfs dfs -chmod -R 777 /apps/hivemall
cp hivemall-core-0.4.2-rc.2-with-dependencies.jar hivemall-with-dependencies.jar
hdfs dfs -put hivemall-with-dependencies.jar /apps/hivemall/
hdfs dfs -put hivemall-with-dependencies.jar /apps/hive/warehouse/
hdfs dfs -put hivemall-core-0.4.2-rc.2-with-dependencies.jar /apps/hivemall
show functions "hivemall.*";
+-----------------------------------------+--+
|                tab_name                 |
+-----------------------------------------+--+
| hivemall.add_bias                       |
| hivemall.add_feature_index              |
| hivemall.amplify                        |
| hivemall.angular_distance               |
| hivemall.angular_similarity             |
| hivemall.argmin_kld                     |
| hivemall.array_avg                      |
...
| hivemall.x_rank                         |
| hivemall.zscore                         |
+-----------------------------------------+--+
149 rows selected (0.054 seconds)

Once installed the hivemall database will be filled with great functions to use for general processing as well as machine learning via SQL.

An example function is for Base91 encoding:

select hivemall.base91(hivemall.deflate('aaaaaaaaaaaaaaaabbbbccc'));
+----------------------+--+
|         _c0          |
+----------------------+--+
| AA+=kaIM|WTt!+wbGAA  |
+----------------------+--+ 

A more useful example is I ran tokenize on messages in a Hive table that I store tweets in.

select hivemall.tokenize(tweets.msg) from tweets limit 10;

| ["water","pipe","break","#TEST","#TEST","#WATERMAINBREAK","FakeMockTown","NJ","https","//t","co/hLYaJnvAdH"]                                                                        |
| ["RT","@CNNNewsource","Main","water","pipe","break","causes","flooding","sinkhole","swallows","car","in","Hoboken","NJ","NE-009MO","https","//t","co/SDALHbs7kx"]                   |
| ["RT","@PaaSDev","#TEST","water","pipe","break","#TEST","Water","Main","Break","in","Fakeville","NJ","https","//t","co/ekbNXK1VgI"]                                                 |
| ["Water","break","on","a","mountain","run","tonight","#saopaulo","#correr","#run","sdfdf,"https","//t","co/dvND6BkXl4"]                                                   |
| ["RT","@PaaSDev","water","pipe","break","#TEST","#TEST","#WATERMAINBREAK","FakeMockTown","NJ","https","//t","co/hLYaJnvAdH"]                                                        |
| ["Route","33","In","Wilton","Closed","Due","To","Water","Main","Break","https","//t","co/UQMksljRUm","https","//t","co/HRhin2QyOk"]                                                 |
| ["water","pipe","break","nj","#TEST","#TEST","#WATERMAINBREAK","https","//t","co/kvYNTG7wHf"]                                                                                       |
| ["water","pipe","break","nj","#TEST","test","https","//t","co/zjgjSaNvUz"]                                                                                                          |
| ["#TEST","#watermainbreak","water","main","break","pipe","test","nj","https","//t","co/qZEdnhlgYG"]                                                                                 |
| ["Customers","of","Langley","Water","and","Sewer","District","under","boil","water","advisory","-","Aiken","Standard","https","//t","co/yh3COaC70M","https","//t","co/LPRHBrtaTA"]  |
10 rows selected (4.848 seconds) 

For more examples of usage: https://github.com/myui/hivemall/wiki/webspam-dataset I will be using HiveMall in future projects, I am expecting to include into an NiFi workflow for process NLP and other machine learning operations. The project has just joined Apache.

1,155 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎11-22-2016 11:00 PM
Updated by:
 
Contributors
Top Kudoed Authors