Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Python IDE for HDP Spark cluster

Solved Go to solution

Python IDE for HDP Spark cluster

New Contributor

Has anyone ever used any python IDEs on Spark cluster ? Is there any way someone can install some python IDEs like Eclipse, Spyder etc on local windows machine and to submit spark jobs on a remote cluster via pyspark ? I could see that Spyder is available with Anaconda, but hadoop nodes where Anaconda is installed don't have GUI tools and it's not possible to see Spider UI that is initialized on remote linux edge node.

Which is the best way to go about this ?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Python IDE for HDP Spark cluster

Expert Contributor

@tuxnet

Sure you can use any IDE with PySpark.

Here is short instructions for Eclipse and PyDev:

- set HADOOP_HOME variable referncing location of winutils.exe

- set SPARK_HOME variable referencing your local spark folder

- set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j)

- add %SPARK_HOME%/python/lib/pyspark.zip and

%SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter

For the testing purposes i'm adding code like:

spark = SparkSession.builder.set_master("my-cluster-master-node:7077")..

but with proper configuration file in SPARK_CONF_DIR it should work with just SparkSession.builder.getOrCreate()

Alternatively you can setup your run configurations to use spark-submit directly.

Hope it helps

8 REPLIES 8

Re: Python IDE for HDP Spark cluster

@tuxnet

I dont know if you have a specific need to use IDE only, but have you given a try to Zeppelin ? zeppelin-server is running on your cluster itself and you can access it via browser. You can submit your spark jobs through either %livy or %spark interpreters. %livy also provides additional features such as session timeout, impersonation etc.

Re: Python IDE for HDP Spark cluster

New Contributor
@Kshitij Badani

thanks for the reply. Forgot to mention, I am using Zeppelin and Jupyter right now. But, an IDE is more featureful and best suited in scenarios like module building. I have seen people using Spyder, pyCharm, Eclipse etc locally, but was looking to see if they could be integrated with remote multi-node Hadoop cluster.

Re: Python IDE for HDP Spark cluster

@tuxnet

I see. One of the ways you may try is to submit spark jobs remotely via Livy. It does not require you to have spark-client on your local machine. You need to have livy server installed and configured properly on your cluster. And then you can submit your jobs via REST API.

https://github.com/cloudera/livy#post-sessions

Re: Python IDE for HDP Spark cluster

New Contributor

Livy is a nice option, just that we will have to make curl calls to API outside the script(?). But, something like what @Michael M sounds more interesting.

Re: Python IDE for HDP Spark cluster

Expert Contributor

@tuxnet

Sure you can use any IDE with PySpark.

Here is short instructions for Eclipse and PyDev:

- set HADOOP_HOME variable referncing location of winutils.exe

- set SPARK_HOME variable referencing your local spark folder

- set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j)

- add %SPARK_HOME%/python/lib/pyspark.zip and

%SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter

For the testing purposes i'm adding code like:

spark = SparkSession.builder.set_master("my-cluster-master-node:7077")..

but with proper configuration file in SPARK_CONF_DIR it should work with just SparkSession.builder.getOrCreate()

Alternatively you can setup your run configurations to use spark-submit directly.

Hope it helps

Re: Python IDE for HDP Spark cluster

New Contributor
@Michael M

That's cool !So, this set up needs spark version > 2 ? Also, what would be the master ip and port if using Spark on YARN ? I am not a dev, so please excuse if these sound stupid :D

Re: Python IDE for HDP Spark cluster

Expert Contributor
@tuxnet

it should work with spark 1.6 as well. You can check master url in spark-defaults.config file in your cluster. If you setup SPARK_CONF_DIR variable and copy spark-defaults config from your cluster to it there is no need to specify master explicitly.

Re: Python IDE for HDP Spark cluster

New Contributor

Looks like this is a better approach. I got some clear info from http://theckang.com/2015/remote-spark-jobs-on-yarn/ that matches your solution.

Thanks much !