Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Cloudera Employee

This article will demonstrate how to install anaconda on an HDP 3.0.1 instance and the configuration to enable Zeppelin to utilize the anaconda python libraries to use with apache spark.

Pre-requisites:

  • bzip2 library needs to be installed prior to installing anaconda

Step 1. Download and Install Anaconda

https://www.anaconda.com/download/#linux

From the following link, download the anaconda installer. It will be a large bash script around 600MB in size. Please land this file in an appropriate directory like /tmp.

95434-screen-shot-2018-12-13-at-92909-am.png

Once downloaded, provide the installer with executable permissions.

95435-screen-shot-2018-12-13-at-93922-am.png

One completed, the installer can be executed. Please execute with a user with sudo privileges. Follow the prompts and accept the license agreements. As a best practice, please install anaconda in a separate directory than the default. /opt is the default for my installations.

95436-screen-shot-2018-12-13-at-94334-am.png

At the very end of the installation it will prompt the user to initialize the install in the installing users basrc file.

95437-screen-shot-2018-12-13-at-94614-am.png

If you have a mulit-node environment, please re-run this process on the remaining nodes.

Step 2. Configure Zeppelin

Log into Zeppelin and open the page for interpreters. Scroll down to the spark interpreter and look for the zeppelin.pyspark.python. This parameter points to the python binary that is used when executing python applications. This is defaulted to the python binary bundled with the OS. This will need to be updated to the anaconda python binary.

95441-screen-shot-2018-12-13-at-95859-am.png

Next, navigate to the location you installed anaconda. In this example the location is /opt/anaconda2/bin.

95442-screen-shot-2018-12-13-at-100845-am.png

Add the full path to the python binary in the /opt/anaconda2/bin directory.

95443-screen-shot-2018-12-13-at-101703-am.png

Make the changes and save the settings for the interpreter. Finally, restart the interpreter and you should be good to go!

95444-screen-shot-2018-12-13-at-125155-pm.png

If you run into any issues, make sure that zeppelin has the correct file permissions to run the new python libraries.


95433screen-shot-2018-12-13-at-125155-pm.png95438screen-shot-2018-12-13-at-94614-am.pngscreen-shot-2018-12-13-at-101703-am.pngscreen-shot-2018-12-13-at-100845-am.pngscreen-shot-2018-12-13-at-94334-am.pngscreen-shot-2018-12-13-at-95859-am.pngscreen-shot-2018-12-13-at-92909-am.pngscreen-shot-2018-12-13-at-95813-am.pngscreen-shot-2018-12-13-at-94730-am.pngscreen-shot-2018-12-13-at-94614-am.pngscreen-shot-2018-12-13-at-125155-pm.pngscreen-shot-2018-12-13-at-93922-am.png
5,419 Views
Comments
avatar

@jtaras, thank you for this wonderful article!


I have a follow-up question (just curious) : do Spark workers use the same custom Python path as Zeppelin driver?


If yes, how do they know it? (the Zeppelin UI setting seems to apply to the Zeppelin driver application only and its SparkContext)


If no, why is there no version conflict between the driver's python and the workers' python? (I tested with the default of 2.X and the custom Anaconda python of 3.7)


NOTE: I had to additionally set the PYSPARK_PYTHON environment variable in Spark > Config to the same path in Ambari, in order to be able to use all the Python libraries in the scripts submitted via "spark-submit", but this does not affect the Zeppelin functionality in any way.