Edit bash_profile and set PYSPARK_PYTHON with the path to
python 2.7, or you could set this in your notebook. My PySpark code example
shows how to do this.
If this is not set, then you will see the following error in
your notebook after submitting code:
Cannot run program "python2.7": error=2, No such
file or directory
You might need to set the following also, however, I did
Start a Notebook
Below is my PySpark code (from notebook), a screenshot of my
my job running in Resource Manager showing YARN resource allocation, and the
output shown in the notebook.
## stop existing SparkContext
## i did this because i create a new SparkContext with my specific properties
from pyspark import SparkContext, SparkConf
## set path to Python 2.7
os.environ["PYSPARK_PYTHON"] = "/usr/local/bin/python2.7"
sc = SparkContext(
conf = SparkConf()
## get a word count from a file in HDFS and list them in order by counts
## only showing top 10
text_file = sc.textFile("/tmp/test_data.dat")
word_counts = text_file .flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) .map(lambda x:(x,x)) .sortByKey(ascending=False)
Finally, run the code in the Notebook. Below is the output from the Notebook and Resource Manager showing the job.