Member since
07-18-2016
94
Posts
94
Kudos Received
20
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2575 | 08-11-2017 06:04 PM | |
2437 | 08-02-2017 11:22 PM | |
9733 | 07-10-2017 03:36 PM | |
17961 | 03-17-2017 01:27 AM | |
14808 | 02-24-2017 05:35 PM |
11-08-2017
04:51 PM
5 Kudos
H2O is an open source deep learning technology for data scientists. Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark. In this tutorial, I will walk you through the steps required to setup H2O Sparkling Water (specifically PySparkling Water) along with Zeppelin in order to execute your machine learning scripts. Here are a few points to note before I get started: 1.) There is a known issue when running Sparkling Water within Zeppelin. This issue is documented in this Jira (AttributeError: 'Logger' object has no attribute 'isatty'). To bypass this issue, I use Zeppelin combined with Livy Server to execute the Sparkling Water jobs. If you are not familiar with Apache Livy, it is a service that enables easy interaction with a Spark cluster over a REST interface. 2.) Testing was performed within the following environment: Hortonworks HDP 2.6.2 CentOS Linux release 7.2.1511 (Core) Python 2.7.5 Spark 2.1.1 Zeppelin 0.7.2 Now, let's walk through the steps: Step 1: Download Sparkling Water from here, or login to your Spark client node and run: wget http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.1/16/sparkling-water-2.1.16.zip Step 2: Unzip and move the PySparkling Water dependency to HDFS: # Unzip Sparkling Water
unzip sparkling-water-2.1.16.zip
# Move the .zip dependency to a location within HDFS (make sure that this location is accessible from Zeppelin/Livy)
hadoop fs -put sparkling-water-2.1.16/py/build/dist/h2o_pysparkling_2.1-2.1.16.zip /tmp/. Step 3: Ensure that required python libraries are installed on each datanode: pip install tabulate
pip install future
pip install requests
pip install six
pip install numpy
pip install colorama
pip install --upgrade requests Step 4: Edit the HDFS configs by adding the two parameters to custom core-site: hadoop.proxyuser.livy.groups=*
hadoop.proxyuser.livy.hosts=* Step 5: Add an new directory to HDFS for "admin" or the user(s) issuing Sparkling Water code: hadoop fs -mkdir /user/admin
hadoop fs -chown admin:hdfs /user/admin
Step 6: Within Zeppelin, edit the Livy Interpreter and add a new parameter called livy.spark.submit.pyFiles (the value of this parameter should be your HDFS path to the PySparking Water .zip file): Step 7: Within Zeppelin, import libraries, initialize the H2OContext, then run your PySparkling Water Scripts: %livy2.pyspark
from pysparkling import *
hc = H2OContext.getOrCreate(spark)
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from pyspark.sql.types import *
from pyspark.sql.functions import *
loans = h2o.import_file(path="hdfs://dzaratsian0.field.hortonworks.com:8020/tmp/loan_200k.csv", header=0)
loans.head()
df_loans = hc.as_spark_frame(loans,)
df_loans.show(10)<br> References: Hortonworks HDP 2.6.2 H2O Downloads
... View more
Labels:
08-23-2017
01:25 PM
Hi @Lukas Müller You need to import the required libraries: from pyspark.sql.functions import struct You could also just import all classes: from pyspark.sql.functions import *
... View more
08-22-2017
01:47 PM
@Lukas Müller
This should work for you: from pyspark.sql.types import *
from pyspark.sql.functions import udf
# Create your UDF object (which accepts your python function called "my_udf")
udf_object = udf(my_udf, ArrayType(StringType()))
# Apply the UDF to your Dataframe (called "df")
new_df = df.withColumn("new_column", udf_object(struct([df[x] for x in df.columns]))) That should work for you. If you want to make this better, replace "ArrayType(StringType())" with a schema such as: schema = ArrayType(StructType([
StructField("mychar", StringType(), False),
StructField("myint", IntegerType(), False)
])) Hope this helps!
... View more
08-11-2017
06:28 PM
@Lukas Müller Great, happy you got it working!
... View more
08-11-2017
06:04 PM
1 Kudo
Hi @Lukas Müller, does this work for you? ./bin/spark-submit --jars /path/to/file1.jar,/path/to/file2.jar --packages com.databricks:spark-csv_2.x:x.x.x pyspark_code.py What version of Spark are you using? If you are using Spark 2.0+, you should not need to specify this jar (just as an FYI for you). Please let me know if this helps.
... View more
08-04-2017
01:09 PM
Thanks @Hugo Felix I'll refer to your other posts to tackle the additional issues. I wonder if there's an issue in the way you are storing the twitter data within Hive. Here's an older post, but details the serde and Hive queries necessary to read the Twitter data on an older version of HDP. You may find this helpful: https://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/
... View more
08-03-2017
12:28 AM
@Hugo Felix It could solve the problem because serdes are build-in and updated along with Hive updates. It works fine for me in recent versions of HDP, so that is why I wanted to mention it. I saw that you opened another question specifically for the Hive serde issue. Your original question "How to make hive queries include scala and python functions" is answered as part of my posts, so when you get a chance, could you please accept the best answer. There are a lot of responses in this thread, so that may help someone else out. I do have one other thought to debug your JSON serde error. It could be that the way that you stored JSON within Hive is incorrect. If that is the case, then when you try to execute a python UDF against that Hive record, it isn't able to find the right structure. If you execute a "select *" against your hive table, how does the output look?
... View more
08-02-2017
11:45 PM
@Gordon Banker Open up the Zeppelin UI, then click on "Interpreter" within the dropdown menu in the upper right-hand corner. From there, you can scroll down to the Spark interpreter (or do a search for "python") and you will see a field called "zeppelin.pyspark.python". You can change this value to point to your alternative python location (i.e. change python to something like /path/to/new/bin/python). Let me know if that helps.
... View more
08-02-2017
11:22 PM
@chandramouli muthukumaran You'll want to install sklearn (pip install -U scikit-learn) and spark-sklearn on all datanodes of the cluster, as well as other relevant python packages such as numpy, scipy, etc. I'd also recommend using YARN as the resource manager, so you are on the right path there. Hope this helps!
... View more
08-01-2017
02:42 AM
@Hugo Felix Thanks for the update! Hortonworks is currently on HDP 2.6, so if you have the option, it sounds like it would be beneficial to upgrade. Also as a quick reference, the most recent Hortonworks Sandbox can be downloaded from here: https://hortonworks.com/downloads/#sandbox Here's a link to the documentation as well: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/index.html
... View more