Created on 11-08-201704:51 PM - edited 08-17-201910:20 AM
H2O is an open source deep learning technology for data scientists. Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark.
In this tutorial, I will walk you through the steps required to setup H2O Sparkling Water (specifically PySparkling Water) along with Zeppelin in order to execute your machine learning scripts.
Here are a few points to note before I get started:
1.) There is a known issue when running Sparkling Water within Zeppelin. This issue is documented in this Jira (AttributeError: 'Logger' object has no attribute 'isatty'). To bypass this issue, I use Zeppelin combined with Livy Server to execute the Sparkling Water jobs. If you are not familiar with Apache Livy, it is a service that enables easy interaction with a Spark cluster over a REST interface.
2.) Testing was performed within the following environment:
Step 2: Unzip and move the PySparkling Water dependency to HDFS:
# Unzip Sparkling Water
# Move the .zip dependency to a location within HDFS (make sure that this location is accessible from Zeppelin/Livy)
hadoop fs -put sparkling-water-2.1.16/py/build/dist/h2o_pysparkling_2.1-2.1.16.zip /tmp/.
Step 3: Ensure that required python libraries are installed on each datanode: