Created 04-03-2017 02:30 PM
Has anyone ever used any python IDEs on Spark cluster ? Is there any way someone can install some python IDEs like Eclipse, Spyder etc on local windows machine and to submit spark jobs on a remote cluster via pyspark ? I could see that Spyder is available with Anaconda, but hadoop nodes where Anaconda is installed don't have GUI tools and it's not possible to see Spider UI that is initialized on remote linux edge node.
Which is the best way to go about this ?
Created 04-03-2017 09:35 PM
Sure you can use any IDE with PySpark.
Here is short instructions for Eclipse and PyDev:
- set HADOOP_HOME variable referncing location of winutils.exe
- set SPARK_HOME variable referencing your local spark folder
- set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j)
- add %SPARK_HOME%/python/lib/pyspark.zip and
%SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter
For the testing purposes i'm adding code like:
spark = SparkSession.builder.set_master("my-cluster-master-node:7077")..
but with proper configuration file in SPARK_CONF_DIR it should work with just SparkSession.builder.getOrCreate()
Alternatively you can setup your run configurations to use spark-submit directly.
Hope it helps
Created 04-03-2017 06:47 PM
@tuxnet
I dont know if you have a specific need to use IDE only, but have you given a try to Zeppelin ? zeppelin-server is running on your cluster itself and you can access it via browser. You can submit your spark jobs through either %livy or %spark interpreters. %livy also provides additional features such as session timeout, impersonation etc.
Created 04-03-2017 07:00 PM
thanks for the reply. Forgot to mention, I am using Zeppelin and Jupyter right now. But, an IDE is more featureful and best suited in scenarios like module building. I have seen people using Spyder, pyCharm, Eclipse etc locally, but was looking to see if they could be integrated with remote multi-node Hadoop cluster.
Created 04-03-2017 07:08 PM
I see. One of the ways you may try is to submit spark jobs remotely via Livy. It does not require you to have spark-client on your local machine. You need to have livy server installed and configured properly on your cluster. And then you can submit your jobs via REST API.
Created 04-05-2017 08:18 PM
Livy is a nice option, just that we will have to make curl calls to API outside the script(?). But, something like what @Michael M sounds more interesting.
Created 04-03-2017 09:35 PM
Sure you can use any IDE with PySpark.
Here is short instructions for Eclipse and PyDev:
- set HADOOP_HOME variable referncing location of winutils.exe
- set SPARK_HOME variable referencing your local spark folder
- set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j)
- add %SPARK_HOME%/python/lib/pyspark.zip and
%SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter
For the testing purposes i'm adding code like:
spark = SparkSession.builder.set_master("my-cluster-master-node:7077")..
but with proper configuration file in SPARK_CONF_DIR it should work with just SparkSession.builder.getOrCreate()
Alternatively you can setup your run configurations to use spark-submit directly.
Hope it helps
Created 04-05-2017 08:22 PM
That's cool !So, this set up needs spark version > 2 ? Also, what would be the master ip and port if using Spark on YARN ? I am not a dev, so please excuse if these sound stupid 😄
Created 04-05-2017 08:29 PM
it should work with spark 1.6 as well. You can check master url in spark-defaults.config file in your cluster. If you setup SPARK_CONF_DIR variable and copy spark-defaults config from your cluster to it there is no need to specify master explicitly.
Created 04-10-2017 11:53 AM
Looks like this is a better approach. I got some clear info from http://theckang.com/2015/remote-spark-jobs-on-yarn/ that matches your solution.
Thanks much !