About dzaratsian

dzaratsian · ‎11-08-2017

H2O is an open source deep learning technology for data scientists. Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark. In this tutorial, I will walk you through the steps required to setup H2O Sparkling Water (specifically PySparkling Water) along with Zeppelin in order to execute your machine learning scripts. Here are a few points to note before I get started: 1.) There is a known issue when running Sparkling Water within Zeppelin. This issue is documented in this Jira (AttributeError: 'Logger' object has no attribute 'isatty'). To bypass this issue, I use Zeppelin combined with Livy Server to execute the Sparkling Water jobs. If you are not familiar with Apache Livy, it is a service that enables easy interaction with a Spark cluster over a REST interface. 2.) Testing was performed within the following environment: Hortonworks HDP 2.6.2 CentOS Linux release 7.2.1511 (Core) Python 2.7.5 Spark 2.1.1 Zeppelin 0.7.2 Now, let's walk through the steps: Step 1: Download Sparkling Water from here, or login to your Spark client node and run: wget http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.1/16/sparkling-water-2.1.16.zip Step 2: Unzip and move the PySparkling Water dependency to HDFS: # Unzip Sparkling Water unzip sparkling-water-2.1.16.zip # Move the .zip dependency to a location within HDFS (make sure that this location is accessible from Zeppelin/Livy) hadoop fs -put sparkling-water-2.1.16/py/build/dist/h2o_pysparkling_2.1-2.1.16.zip /tmp/. Step 3: Ensure that required python libraries are installed on each datanode: pip install tabulate pip install future pip install requests pip install six pip install numpy pip install colorama pip install --upgrade requests Step 4: Edit the HDFS configs by adding the two parameters to custom core-site: hadoop.proxyuser.livy.groups=* hadoop.proxyuser.livy.hosts=* Step 5: Add an new directory to HDFS for "admin" or the user(s) issuing Sparkling Water code: hadoop fs -mkdir /user/admin hadoop fs -chown admin:hdfs /user/admin Step 6: Within Zeppelin, edit the Livy Interpreter and add a new parameter called livy.spark.submit.pyFiles (the value of this parameter should be your HDFS path to the PySparking Water .zip file): Step 7: Within Zeppelin, import libraries, initialize the H2OContext, then run your PySparkling Water Scripts: %livy2.pyspark from pysparkling import * hc = H2OContext.getOrCreate(spark) import h2o from h2o.estimators.gbm import H2OGradientBoostingEstimator from h2o.estimators.deeplearning import H2ODeepLearningEstimator from pyspark.sql.types import * from pyspark.sql.functions import * loans = h2o.import_file(path="hdfs://dzaratsian0.field.hortonworks.com:8020/tmp/loan_200k.csv", header=0) loans.head() df_loans = hc.as_spark_frame(loans,) df_loans.show(10)<br> References: Hortonworks HDP 2.6.2 H2O Downloads

dzaratsian · ‎08-23-2017

Hi @Lukas Müller You need to import the required libraries: from pyspark.sql.functions import struct You could also just import all classes: from pyspark.sql.functions import *

dzaratsian · ‎08-22-2017

@Lukas Müller This should work for you: from pyspark.sql.types import * from pyspark.sql.functions import udf # Create your UDF object (which accepts your python function called "my_udf") udf_object = udf(my_udf, ArrayType(StringType())) # Apply the UDF to your Dataframe (called "df") new_df = df.withColumn("new_column", udf_object(struct([df[x] for x in df.columns]))) That should work for you. If you want to make this better, replace "ArrayType(StringType())" with a schema such as: schema = ArrayType(StructType([ StructField("mychar", StringType(), False), StructField("myint", IntegerType(), False) ])) Hope this helps!

dzaratsian · ‎08-11-2017

@Lukas Müller Great, happy you got it working!

dzaratsian · ‎08-11-2017

Hi @Lukas Müller, does this work for you? ./bin/spark-submit --jars /path/to/file1.jar,/path/to/file2.jar --packages com.databricks:spark-csv_2.x:x.x.x pyspark_code.py What version of Spark are you using? If you are using Spark 2.0+, you should not need to specify this jar (just as an FYI for you). Please let me know if this helps.

dzaratsian · ‎08-04-2017

Thanks @Hugo Felix I'll refer to your other posts to tackle the additional issues. I wonder if there's an issue in the way you are storing the twitter data within Hive. Here's an older post, but details the serde and Hive queries necessary to read the Twitter data on an older version of HDP. You may find this helpful: https://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/

dzaratsian · ‎08-03-2017

@Hugo Felix It could solve the problem because serdes are build-in and updated along with Hive updates. It works fine for me in recent versions of HDP, so that is why I wanted to mention it. I saw that you opened another question specifically for the Hive serde issue. Your original question "How to make hive queries include scala and python functions" is answered as part of my posts, so when you get a chance, could you please accept the best answer. There are a lot of responses in this thread, so that may help someone else out. I do have one other thought to debug your JSON serde error. It could be that the way that you stored JSON within Hive is incorrect. If that is the case, then when you try to execute a python UDF against that Hive record, it isn't able to find the right structure. If you execute a "select *" against your hive table, how does the output look?

dzaratsian · ‎08-02-2017

@Gordon Banker Open up the Zeppelin UI, then click on "Interpreter" within the dropdown menu in the upper right-hand corner. From there, you can scroll down to the Spark interpreter (or do a search for "python") and you will see a field called "zeppelin.pyspark.python". You can change this value to point to your alternative python location (i.e. change python to something like /path/to/new/bin/python). Let me know if that helps.

dzaratsian · ‎08-02-2017

@chandramouli muthukumaran You'll want to install sklearn (pip install -U scikit-learn) and spark-sklearn on all datanodes of the cluster, as well as other relevant python packages such as numpy, scipy, etc. I'd also recommend using YARN as the resource manager, so you are on the right path there. Hope this helps!

dzaratsian · ‎08-01-2017

@Hugo Felix Thanks for the update! Hortonworks is currently on HDP 2.6, so if you have the option, it sounds like it would be beneficial to upgrade. Also as a quick reference, the most recent Hortonworks Sandbox can be downloaded from here: https://hortonworks.com/downloads/#sandbox Here's a link to the documentation as well: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/index.html