Community Articles

bmathew · ‎06-10-2016

Platform: HDP 2.4 (Sandbox)

Hadoop version: 2.7

OS: CentOS 6.8

For every data node that Spark will run on, Python 2.7 and all dependent libraries that your code uses must be installed on each data node - e.g.Pandas, Matplotlib, Scipy

Below are the steps I performed to get iPython up and running. My notebook jobs are executed via YARN. I did this install as ‘root’ user.

STEP 1

Install/configure the following 5 iPython dependencies by typing in the following commands:

yum install nano centos-release-SCL zlib-devel
yum install bzip2-devel openssl-devel ncurses-devel
yum install sqlite-devel readline-devel tk-devel
yum install gdbm-devel db4-devel libpcap-devel xz-devel
yum install libpng-devel libjpg-devel atlas-devel

STEP 2

iPython has a requirement for Python 2.7 or higher. Check which version of Python you are using:

In my case, I had an older version of Python so I then had to install a new version.

Install the “Development tools” dependency for Python 2.7

Command to run: yum groupinstall "Development tools"

Install Python 2.7:

wget http://python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz
tar xf Python-2.7.6.tar.xz
cd Python-2.7.6
./configure --prefix=/usr/local --enable-unicode=ucs4 --enable-shared LDFLAGS="-Wl,-rpath /usr/local/lib"
make && make altinstall
source /opt/rh/python27/enable

Verify 2.7 is there

STEP 3

Download easy_install to configure pip (Python package installer):

wget https://bitbucket.org/pypa/setuptools/raw/0.7.4/ez_setup.py
python ez_setup.py

These are a good set of libraries to start with.

Install the following data science packages:

pip install numpy scipy pandas
pip install scikit-learn tornado pyzmq
pip install pygments matplotlib jsonschema
pip install jinja2 --upgrade

STEP 4

Install iPython notebook on a node with the Spark Client

Command to run: pip install "ipython[notebook]"

STEP 5

Create a IPython profile for pyspark

Command to run: ipython profile create pyspark

STEP 6

Create a jupyter config file

Command to run: jupyter notebook --generate-config

This will create file /root/.jupyter/jupyter_notebook_config.py

STEP 7

Create the shell script: start_ipython_notebook.sh

Add this to file:

#!/bin/bash

source /opt/rh/python27/enable
IPYTHON_OPTS="notebook --port 8889 --notebook-dir='/usr/hdp/2.3.2.0-2950/spark/' --ip='*' --no-browser" pyspark

Give the shell script execute permissions.

Chmod 755 start_ipython_notebook.sh

STEP 8

Add a port forwarding rule for port 8889

STEP 9

Edit bash_profile and set PYSPARK_PYTHON with the path to python 2.7, or you could set this in your notebook. My PySpark code example shows how to do this.

If this is not set, then you will see the following error in your notebook after submitting code:

Cannot run program "python2.7": error=2, No such file or directory

You might need to set the following also, however, I did not:

HADOOP_HOME

JAVA_HOME

PYTHONPATH

STEP 10

Run iPython:

./start_ipython_notebook.sh

STEP 11

Start a Notebook

Below is my PySpark code (from notebook), a screenshot of my my job running in Resource Manager showing YARN resource allocation, and the output shown in the notebook.

CODE:

## stop existing SparkContext
## i did this because i create a new SparkContext with my specific properties
sc.stop()

import os
from pyspark import SparkContext, SparkConf

## set path to Python 2.7
os.environ["PYSPARK_PYTHON"] = "/usr/local/bin/python2.7"

sc = SparkContext(
        conf = SparkConf()
        .setMaster("yarn-client")
        .setAppName("ipython_yarn_test")
        .set("spark.dynamicAllocation.enabled", "false")
        .set("spark.executor.instances", "4")
        .set("spark.executor.cores", 1)
        .set("spark.executor.memory", "1G"))


## get a word count from a file in HDFS and list them in order by counts
## only showing top 10
text_file = sc.textFile("/tmp/test_data.dat")
word_counts = text_file .flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) .map(lambda x:(x[1],x[0])) .sortByKey(ascending=False)

word_counts.take(10)

STEP 12

Finally, run the code in the Notebook. Below is the output from the Notebook and Resource Manager showing the job.

Cloudera Community

Community Articles

Tutorial: Install/Configure iPython and create/run PySpark Notebook

Apache Ranger

Apache Spark

STEP 1

STEP 2

STEP 3

STEP 4

STEP 5

STEP 6

STEP 7

STEP 8

STEP 9

STEP 10

STEP 11

STEP 12

Tutorial: Using IPython Notebook with Apache Spark...

Zeppelin vs. IPython notebook(Jupyter)

NiFi Debugging Tutorial

Automated Kerberos Installation and Configuration

How to Create an Iceberg Table with PySpark in Clo...

sqoop import/export tutorial

Using VirtualEnv with PySpark

Using VirtualEnv with PySpark

Running PySpark with Conda Env

Distributed XGBoost with PySpark in Cloudera Machi...