Community Articles

VidyaSargur · ‎09-29-2020

By default, Cloudera Machine Learning (CML) ships Jupyter kernel as part of the base engine images. Data Scientists often prefer to use a specialized custom kernel in Jupyter that makes their work more efficient. In this community post, we will walk through how to customize a Docker container image with a sparkmagic Jupyter kernel and how to deploy it to a CML workspace.

Prerequisites:

Admin privileges in a CML workspace
Local Docker client with access to Docker Hub or internal Docker registry

Step 1. Choose a custom Jupyter kernel.

Jupyter kernels are purpose-built add-ons to the basic Python notebook editor. For this tutorial, I chose sparkmagic as the kernel that provides convenient features for working with Spark, like keeping SQL syntax clean in a cell. Sparkmagic relies on Livy to communicate with the Spark cluster. As of this writing, Livy is not supported in CML when running Spark on Kubernetes. However, your classic Spark cluster (for example on Data Hub) will work with Levy and therefore sparkmagic. For now, you simply need to know that installing sparkmagic is done with the following sequence of commands:

pip3 install sparkmagic
jupyter nbextension enable --py --sys-prefix widgetsnbextension
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel

Note: The third line is executed once you cd in the directory that is created after the install. This location is platform dependent and is determined by running pip3 show sparkmagic after the install. We’ll have to take care of this in the docker image definition.

Step 2. Customize your Docker Image

To create a custom Docker image we first create a text file, I called it magic-dockr, that specifies the base image (CML base engine on Ubuntu) along with additional libraries we want to install. I will use CML to do the majority of the work.

First, create the below docker file in your CML project.

# Dockerfile

# Specify a Cloudera Machine Learning base image
FROM docker.repository.cloudera.com/cdsw/engine:9-cml1.1

# Update packages on the base image and install beautifulsoup4
RUN apt-get update
RUN pip3 install sparkmagic
RUN jupyter nbextension enable --py --sys-prefix widgetsnbextension
RUN jupyter-kernelspec install --user $(pip3 show sparkmagic | grep Location | cut -d" " -f2)/sparkmagic/kernels/pysparkkernel

Now we use this image definition to build a deployable Docker container. Run the following commands in an environment where docker.io binaries are installed.

docker build -t <your-repository>/cml-sparkmagic:v1.0 . -f magic-dockr
docker push <your-repository>/cml-sparkmagic:v1.0

This will build and distribute your Docker image to a repository of your choosing.

Step 3. Adding a custom image to CML

There are two steps to make the custom kernel available in your project. One to add the image to CML Workspace and the other to enable the image for the Project you are working on.

The first step requires Admin privileges. From the blade menu on the left, select Admin then click on the Engines tab. In the Engine Images section, enter the name of your custom image (e.g. Sparkmagic Kernel) and the repository tag you used in Step 2. Click Add.

pasted image 0.png

Once the engine is added, we’ll need to tell CML how to launch a Jupyter notebook when this image is used to run a session. Click the Edit button next to the Sparkmagic Kernel you’ve added. Click + New Editor in the window that opens.

pasted image 1.png

Enter the editor name as Jupyter Notebook and for the command use the following:

/usr/local/bin/jupyter-notebook --no-browser --ip=127.0.0.1 --port=8090 
--NotebookApp.token= --NotebookApp.allow_remote_access=True --log-level=ERROR

Note that port 8090 is the default port, unless your administrator changed it.

unnamed 3.png

Then click Save and Save again.. At this point CML knows where to find your custom kernel and what editor to launch when a session starts.

Now we are ready to enable this custom engine inside a project.

Step 4. Enable custom engine in your project.

Open a project where you would like to use your custom kernel. For me, it’s a project called Custom Kernel Project (yes, I’m not very creative when it comes to names). In the left panel, click on Project Settings, then go to the Engine tab. From the Engine Image section, drop-down select your custom engine image.

pasted image 4.png

To test the engine, go to Sessions, and create a new session. You’ll see that Engine Image is the custom Docker image you’ve created in Step 2. Name your session and select Jupyter Notebook as your Editor. pasted image 5.png

When the session launches, in the Jupyter notebook interface you’ll be able to select PySpark when creating a new notebook.

unnamed 6.png

You can start with %%help magic and follow along with Sparkmagic documentation. Specifically, you’ll want to configure a connection to a Spark cluster using a JSON template provided.

pasted image 6.png

That’s it!

CML brings you the flexibility to run any third-party editor on the platform, making development more efficient for Data Scientists and Data Engineers. Note that while this article talked about sparkmagic custom kernel, the same procedure can be applied to any kernel you wish to run with Jupyter notebook or Jupyter Lab.

Reference:

CML Docs: Creating a Customized Engine Image
Sparkmagic Docs

Cloudera Community

Community Articles

How to setup custom Jupyter Kernel in CML

Apache Spark

Cloudera Data Platform (CDP)

Prerequisites:

Step 1. Choose a custom Jupyter kernel.

Step 2. Customize your Docker Image

Step 3. Adding a custom image to CML

Step 4. Enable custom engine in your project.