Community Articles
Find and share helpful community-sourced technical articles
Labels (2)

H2O is an open source deep learning technology for data scientists. Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark.

In this tutorial, I will walk you through the steps required to setup H2O Sparkling Water (specifically PySparkling Water) along with Zeppelin in order to execute your machine learning scripts.

Here are a few points to note before I get started:

1.) There is a known issue when running Sparkling Water within Zeppelin. This issue is documented in this Jira (AttributeError: 'Logger' object has no attribute 'isatty'). To bypass this issue, I use Zeppelin combined with Livy Server to execute the Sparkling Water jobs. If you are not familiar with Apache Livy, it is a service that enables easy interaction with a Spark cluster over a REST interface.

2.) Testing was performed within the following environment:

Hortonworks HDP 2.6.2

CentOS Linux release 7.2.1511 (Core)

Python 2.7.5

Spark 2.1.1

Zeppelin 0.7.2

Now, let's walk through the steps:

Step 1: Download Sparkling Water from here, or login to your Spark client node and run:


Step 2: Unzip and move the PySparkling Water dependency to HDFS:

# Unzip Sparkling Water

# Move the .zip dependency to a location within HDFS (make sure that this location is accessible from Zeppelin/Livy)
hadoop fs -put sparkling-water-2.1.16/py/build/dist/ /tmp/.

Step 3: Ensure that required python libraries are installed on each datanode:

pip install tabulate
pip install future
pip install requests
pip install six
pip install numpy
pip install colorama
pip install --upgrade requests

Step 4: Edit the HDFS configs by adding the two parameters to custom core-site:



Step 5: Add an new directory to HDFS for "admin" or the user(s) issuing Sparkling Water code:

hadoop fs -mkdir /user/admin
hadoop fs -chown admin:hdfs /user/admin

Step 6: Within Zeppelin, edit the Livy Interpreter and add a new parameter called livy.spark.submit.pyFiles (the value of this parameter should be your HDFS path to the PySparking Water .zip file):


Step 7: Within Zeppelin, import libraries, initialize the H2OContext, then run your PySparkling Water Scripts:


from pysparkling import *
hc = H2OContext.getOrCreate(spark)
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from pyspark.sql.types import *
from pyspark.sql.functions import *

loans = h2o.import_file(path="hdfs://", header=0)

df_loans = hc.as_spark_frame(loans,)<br>



Hortonworks HDP 2.6.2

H2O Downloads


Below are the custom properties which would go in hand with H2O Sparkling Water . Use these properties to modify H2O Cluster Nodes, Memory, Cores etc.