Created on 06-10-201610:37 PM - edited 08-17-201912:05 PM
Platform: HDP 2.4 (Sandbox)
Hadoop version: 2.7
OS: CentOS 6.8
For every data node that Spark will run on, Python 2.7 and all dependent libraries that your code uses must be installed on each data node - e.g.Pandas, Matplotlib, Scipy
Below are the steps I performed to get iPython up and running. My notebook jobs are executed via YARN.
I did this install as ‘root’ user.
STEP 1
Install/configure the following 5 iPython dependencies by typing in the following commands:
Edit bash_profile and set PYSPARK_PYTHON with the path to
python 2.7, or you could set this in your notebook. My PySpark code example
shows how to do this.
If this is not set, then you will see the following error in
your notebook after submitting code:
Cannot run program "python2.7": error=2, No such
file or directory
You might need to set the following also, however, I did
not:
HADOOP_HOME
JAVA_HOME
PYTHONPATH
STEP 10
Run iPython:
./start_ipython_notebook.sh
STEP 11
Start a Notebook
Below is my PySpark code (from notebook), a screenshot of my
my job running in Resource Manager showing YARN resource allocation, and the
output shown in the notebook.
CODE:
## stop existing SparkContext
## i did this because i create a new SparkContext with my specific properties
sc.stop()
import os
from pyspark import SparkContext, SparkConf
## set path to Python 2.7
os.environ["PYSPARK_PYTHON"] = "/usr/local/bin/python2.7"
sc = SparkContext(
conf = SparkConf()
.setMaster("yarn-client")
.setAppName("ipython_yarn_test")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.executor.instances", "4")
.set("spark.executor.cores", 1)
.set("spark.executor.memory", "1G"))
## get a word count from a file in HDFS and list them in order by counts
## only showing top 10
text_file = sc.textFile("/tmp/test_data.dat")
word_counts = text_file .flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) .map(lambda x:(x[1],x[0])) .sortByKey(ascending=False)
word_counts.take(10)
STEP 12
Finally, run the code in the Notebook. Below is the output from the Notebook and Resource Manager showing the job.