Created on 06-10-2016 10:37 PM - edited 08-17-2019 12:05 PM
Platform: HDP 2.4 (Sandbox)
Hadoop version: 2.7
OS: CentOS 6.8
For every data node that Spark will run on, Python 2.7 and all dependent libraries that your code uses must be installed on each data node - e.g.Pandas, Matplotlib, Scipy
Below are the steps I performed to get iPython up and running. My notebook jobs are executed via YARN. I did this install as ‘root’ user.
Install/configure the following 5 iPython dependencies by typing in the following commands:
iPython has a requirement for Python 2.7 or higher. Check which version of Python you are using:
In my case, I had an older version of Python so I then had to install a new version.
Install the “Development tools” dependency for Python 2.7
Command to run: yum groupinstall "Development tools"
Install Python 2.7:
Verify 2.7 is there
Download easy_install to configure pip (Python package installer):
These are a good set of libraries to start with.
Install the following data science packages:
Install iPython notebook on a node with the Spark Client
Command to run: pip install "ipython[notebook]"
Create a IPython profile for pyspark
Command to run: ipython profile create pyspark
Create a jupyter config file
Command to run: jupyter notebook --generate-config
This will create file /root/.jupyter/jupyter_notebook_config.py
Create the shell script: start_ipython_notebook.sh
Add this to file:
#!/bin/bash source /opt/rh/python27/enable IPYTHON_OPTS="notebook --port 8889 --notebook-dir='/usr/hdp/2.3.2.0-2950/spark/' --ip='*' --no-browser" pyspark
Give the shell script execute permissions.
Chmod 755 start_ipython_notebook.sh
Add a port forwarding rule for port 8889
Edit bash_profile and set PYSPARK_PYTHON with the path to python 2.7, or you could set this in your notebook. My PySpark code example shows how to do this.
If this is not set, then you will see the following error in your notebook after submitting code:
Cannot run program "python2.7": error=2, No such file or directory
You might need to set the following also, however, I did not:
HADOOP_HOME
JAVA_HOME
PYTHONPATH
Run iPython:
./start_ipython_notebook.sh
Start a Notebook
Below is my PySpark code (from notebook), a screenshot of my my job running in Resource Manager showing YARN resource allocation, and the output shown in the notebook.
CODE:
## stop existing SparkContext ## i did this because i create a new SparkContext with my specific properties sc.stop() import os from pyspark import SparkContext, SparkConf ## set path to Python 2.7 os.environ["PYSPARK_PYTHON"] = "/usr/local/bin/python2.7" sc = SparkContext( conf = SparkConf() .setMaster("yarn-client") .setAppName("ipython_yarn_test") .set("spark.dynamicAllocation.enabled", "false") .set("spark.executor.instances", "4") .set("spark.executor.cores", 1) .set("spark.executor.memory", "1G")) ## get a word count from a file in HDFS and list them in order by counts ## only showing top 10 text_file = sc.textFile("/tmp/test_data.dat") word_counts = text_file .flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) .map(lambda x:(x[1],x[0])) .sortByKey(ascending=False) word_counts.take(10)
Finally, run the code in the Notebook. Below is the output from the Notebook and Resource Manager showing the job.