Support Questions

priyanxmail · ‎04-03-2017

Has anyone ever used any python IDEs on Spark cluster ? Is there any way someone can install some python IDEs like Eclipse, Spyder etc on local windows machine and to submit spark jobs on a remote cluster via pyspark ? I could see that Spyder is available with Anaconda, but hadoop nodes where Anaconda is installed don't have GUI tools and it's not possible to see Spider UI that is initialized on remote linux edge node.

Which is the best way to go about this ?

bluesmix · ‎04-03-2017

@tuxnet

Sure you can use any IDE with PySpark.

Here is short instructions for Eclipse and PyDev:

- set HADOOP_HOME variable referncing location of winutils.exe

- set SPARK_HOME variable referencing your local spark folder

- set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j)

- add %SPARK_HOME%/python/lib/pyspark.zip and

%SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter

For the testing purposes i'm adding code like:

spark = SparkSession.builder.set_master("my-cluster-master-node:7077")..

but with proper configuration file in SPARK_CONF_DIR it should work with just SparkSession.builder.getOrCreate()

Alternatively you can setup your run configurations to use spark-submit directly.

Hope it helps

View solution in original post

kbadani · ‎04-03-2017

@tuxnet

I dont know if you have a specific need to use IDE only, but have you given a try to Zeppelin ? zeppelin-server is running on your cluster itself and you can access it via browser. You can submit your spark jobs through either %livy or %spark interpreters. %livy also provides additional features such as session timeout, impersonation etc.

priyanxmail · ‎04-03-2017

@Kshitij Badani

thanks for the reply. Forgot to mention, I am using Zeppelin and Jupyter right now. But, an IDE is more featureful and best suited in scenarios like module building. I have seen people using Spyder, pyCharm, Eclipse etc locally, but was looking to see if they could be integrated with remote multi-node Hadoop cluster.

kbadani · ‎04-03-2017

@tuxnet

I see. One of the ways you may try is to submit spark jobs remotely via Livy. It does not require you to have spark-client on your local machine. You need to have livy server installed and configured properly on your cluster. And then you can submit your jobs via REST API.

https://github.com/cloudera/livy#post-sessions

priyanxmail · ‎04-05-2017

Livy is a nice option, just that we will have to make curl calls to API outside the script(?). But, something like what @Michael M sounds more interesting.

bluesmix · ‎04-03-2017

@tuxnet

Sure you can use any IDE with PySpark.

Here is short instructions for Eclipse and PyDev:

- set HADOOP_HOME variable referncing location of winutils.exe

- set SPARK_HOME variable referencing your local spark folder

- set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j)

- add %SPARK_HOME%/python/lib/pyspark.zip and

%SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter

For the testing purposes i'm adding code like:

spark = SparkSession.builder.set_master("my-cluster-master-node:7077")..

but with proper configuration file in SPARK_CONF_DIR it should work with just SparkSession.builder.getOrCreate()

Alternatively you can setup your run configurations to use spark-submit directly.

Hope it helps

priyanxmail · ‎04-05-2017

@Michael M

That's cool !So, this set up needs spark version > 2 ? Also, what would be the master ip and port if using Spark on YARN ? I am not a dev, so please excuse if these sound stupid 😄

bluesmix · ‎04-05-2017

@tuxnet

it should work with spark 1.6 as well. You can check master url in spark-defaults.config file in your cluster. If you setup SPARK_CONF_DIR variable and copy spark-defaults config from your cluster to it there is no need to specify master explicitly.

priyanxmail · ‎04-10-2017

Looks like this is a better approach. I got some clear info from http://theckang.com/2015/remote-spark-jobs-on-yarn/ that matches your solution.

Thanks much !

Cloudera Community

Support Questions

Python IDE for HDP Spark cluster