Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (2)
avatar

Platform: HDP 2.4 (Sandbox)

Hadoop version: 2.7

OS: CentOS 6.8

For every data node that Spark will run on, Python 2.7 and all dependent libraries that your code uses must be installed on each data node - e.g.Pandas, Matplotlib, Scipy

Below are the steps I performed to get iPython up and running. My notebook jobs are executed via YARN. I did this install as ‘root’ user.

STEP 1

Install/configure the following 5 iPython dependencies by typing in the following commands:

  1. yum install nano centos-release-SCL zlib-devel
  2. yum install bzip2-devel openssl-devel ncurses-devel
  3. yum install sqlite-devel readline-devel tk-devel
  4. yum install gdbm-devel db4-devel libpcap-devel xz-devel
  5. yum install libpng-devel libjpg-devel atlas-devel

STEP 2

iPython has a requirement for Python 2.7 or higher. Check which version of Python you are using:4940-screen-shot-2016-06-10-at-113400-am.png

In my case, I had an older version of Python so I then had to install a new version.

Install the “Development tools” dependency for Python 2.7

Command to run: yum groupinstall "Development tools"

Install Python 2.7:

  1. wget http://python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz
  2. tar xf Python-2.7.6.tar.xz
  3. cd Python-2.7.6
  4. ./configure --prefix=/usr/local --enable-unicode=ucs4 --enable-shared LDFLAGS="-Wl,-rpath /usr/local/lib"
  5. make && make altinstall
  6. source /opt/rh/python27/enable

Verify 2.7 is there 4953-screen-shot-2016-06-10-at-113831-am.png

STEP 3

Download easy_install to configure pip (Python package installer):

  1. wget https://bitbucket.org/pypa/setuptools/raw/0.7.4/ez_setup.py
  2. python ez_setup.py

These are a good set of libraries to start with.

Install the following data science packages:

  1. pip install numpy scipy pandas
  2. pip install scikit-learn tornado pyzmq
  3. pip install pygments matplotlib jsonschema
  4. pip install jinja2 --upgrade

STEP 4

Install iPython notebook on a node with the Spark Client

Command to run: pip install "ipython[notebook]"

STEP 5

Create a IPython profile for pyspark

Command to run: ipython profile create pyspark

STEP 6

Create a jupyter config file

Command to run: jupyter notebook --generate-config

This will create file /root/.jupyter/jupyter_notebook_config.py

STEP 7

Create the shell script: start_ipython_notebook.sh

Add this to file:

#!/bin/bash

source /opt/rh/python27/enable
IPYTHON_OPTS="notebook --port 8889 --notebook-dir='/usr/hdp/2.3.2.0-2950/spark/' --ip='*' --no-browser" pyspark

Give the shell script execute permissions.

Chmod 755 start_ipython_notebook.sh

STEP 8

Add a port forwarding rule for port 8889

4957-screen-shot-2016-06-10-at-114925-am.png

STEP 9

Edit bash_profile and set PYSPARK_PYTHON with the path to python 2.7, or you could set this in your notebook. My PySpark code example shows how to do this.

If this is not set, then you will see the following error in your notebook after submitting code:

Cannot run program "python2.7": error=2, No such file or directory

You might need to set the following also, however, I did not:

HADOOP_HOME

JAVA_HOME

PYTHONPATH

STEP 10

Run iPython:

./start_ipython_notebook.sh

STEP 11

Start a Notebook

Below is my PySpark code (from notebook), a screenshot of my my job running in Resource Manager showing YARN resource allocation, and the output shown in the notebook.

CODE:

## stop existing SparkContext
## i did this because i create a new SparkContext with my specific properties
sc.stop()

import os
from pyspark import SparkContext, SparkConf

## set path to Python 2.7
os.environ["PYSPARK_PYTHON"] = "/usr/local/bin/python2.7"

sc = SparkContext(
        conf = SparkConf()
        .setMaster("yarn-client")
        .setAppName("ipython_yarn_test")
        .set("spark.dynamicAllocation.enabled", "false")
        .set("spark.executor.instances", "4")
        .set("spark.executor.cores", 1)
        .set("spark.executor.memory", "1G"))


## get a word count from a file in HDFS and list them in order by counts
## only showing top 10
text_file = sc.textFile("/tmp/test_data.dat")
word_counts = text_file .flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) .map(lambda x:(x[1],x[0])) .sortByKey(ascending=False)

word_counts.take(10)

STEP 12

Finally, run the code in the Notebook. Below is the output from the Notebook and Resource Manager showing the job.

4958-screen-shot-2016-06-10-at-115556-am.png

4959-screen-shot-2016-06-10-at-115602-am.png


screen-shot-2016-06-10-at-113831-am.png
7,767 Views