Created on 11-08-2017 04:51 PM - edited 08-17-2019 10:20 AM
H2O is an open source deep learning technology for data scientists. Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark.
In this tutorial, I will walk you through the steps required to setup H2O Sparkling Water (specifically PySparkling Water) along with Zeppelin in order to execute your machine learning scripts.
Here are a few points to note before I get started:
1.) There is a known issue when running Sparkling Water within Zeppelin. This issue is documented in this Jira (AttributeError: 'Logger' object has no attribute 'isatty'). To bypass this issue, I use Zeppelin combined with Livy Server to execute the Sparkling Water jobs. If you are not familiar with Apache Livy, it is a service that enables easy interaction with a Spark cluster over a REST interface.
2.) Testing was performed within the following environment:
CentOS Linux release 7.2.1511 (Core)
Python 2.7.5
Now, let's walk through the steps:
Step 1: Download Sparkling Water from here, or login to your Spark client node and run:
wget http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.1/16/sparkling-water-2.1.16.zip
Step 2: Unzip and move the PySparkling Water dependency to HDFS:
# Unzip Sparkling Water unzip sparkling-water-2.1.16.zip # Move the .zip dependency to a location within HDFS (make sure that this location is accessible from Zeppelin/Livy) hadoop fs -put sparkling-water-2.1.16/py/build/dist/h2o_pysparkling_2.1-2.1.16.zip /tmp/.
Step 3: Ensure that required python libraries are installed on each datanode:
pip install tabulate pip install future pip install requests pip install six pip install numpy pip install colorama pip install --upgrade requests
Step 4: Edit the HDFS configs by adding the two parameters to custom core-site:
hadoop.proxyuser.livy.groups=* hadoop.proxyuser.livy.hosts=*
Step 5: Add an new directory to HDFS for "admin" or the user(s) issuing Sparkling Water code:
hadoop fs -mkdir /user/admin hadoop fs -chown admin:hdfs /user/admin
Step 6: Within Zeppelin, edit the Livy Interpreter and add a new parameter called livy.spark.submit.pyFiles (the value of this parameter should be your HDFS path to the PySparking Water .zip file):
Step 7: Within Zeppelin, import libraries, initialize the H2OContext, then run your PySparkling Water Scripts:
%livy2.pyspark from pysparkling import * hc = H2OContext.getOrCreate(spark) import h2o from h2o.estimators.gbm import H2OGradientBoostingEstimator from h2o.estimators.deeplearning import H2ODeepLearningEstimator from pyspark.sql.types import * from pyspark.sql.functions import * loans = h2o.import_file(path="hdfs://dzaratsian0.field.hortonworks.com:8020/tmp/loan_200k.csv", header=0) loans.head() df_loans = hc.as_spark_frame(loans,) df_loans.show(10)<br>
References:
Created on 11-14-2017 09:44 PM - edited 08-17-2019 10:20 AM
Below are the custom properties which would go in hand with H2O Sparkling Water . Use these properties to modify H2O Cluster Nodes, Memory, Cores etc.
User | Count |
---|---|
758 | |
379 | |
316 | |
309 | |
270 |