Created on 09-29-2020 02:38 PM - edited on 10-12-2020 07:51 AM by VidyaSargur
By default, Cloudera Machine Learning (CML) ships Jupyter kernel as part of the base engine images. Data Scientists often prefer to use a specialized custom kernel in Jupyter that makes their work more efficient. In this community post, we will walk through how to customize a Docker container image with a sparkmagic Jupyter kernel and how to deploy it to a CML workspace.
Jupyter kernels are purpose-built add-ons to the basic Python notebook editor. For this tutorial, I chose sparkmagic as the kernel that provides convenient features for working with Spark, like keeping SQL syntax clean in a cell. Sparkmagic relies on Livy to communicate with the Spark cluster. As of this writing, Livy is not supported in CML when running Spark on Kubernetes. However, your classic Spark cluster (for example on Data Hub) will work with Levy and therefore sparkmagic. For now, you simply need to know that installing sparkmagic is done with the following sequence of commands:
pip3 install sparkmagic
jupyter nbextension enable --py --sys-prefix widgetsnbextension
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
Note: The third line is executed once you cd in the directory that is created after the install. This location is platform dependent and is determined by running pip3 show sparkmagic after the install. We’ll have to take care of this in the docker image definition.
To create a custom Docker image we first create a text file, I called it magic-dockr, that specifies the base image (CML base engine on Ubuntu) along with additional libraries we want to install. I will use CML to do the majority of the work.
First, create the below docker file in your CML project.
# Dockerfile
# Specify a Cloudera Machine Learning base image
FROM docker.repository.cloudera.com/cdsw/engine:9-cml1.1
# Update packages on the base image and install beautifulsoup4
RUN apt-get update
RUN pip3 install sparkmagic
RUN jupyter nbextension enable --py --sys-prefix widgetsnbextension
RUN jupyter-kernelspec install --user $(pip3 show sparkmagic | grep Location | cut -d" " -f2)/sparkmagic/kernels/pysparkkernel
Now we use this image definition to build a deployable Docker container. Run the following commands in an environment where docker.io binaries are installed.
docker build -t <your-repository>/cml-sparkmagic:v1.0 . -f magic-dockr
docker push <your-repository>/cml-sparkmagic:v1.0
This will build and distribute your Docker image to a repository of your choosing.
There are two steps to make the custom kernel available in your project. One to add the image to CML Workspace and the other to enable the image for the Project you are working on.
The first step requires Admin privileges. From the blade menu on the left, select Admin then click on the Engines tab. In the Engine Images section, enter the name of your custom image (e.g. Sparkmagic Kernel) and the repository tag you used in Step 2. Click Add.
Once the engine is added, we’ll need to tell CML how to launch a Jupyter notebook when this image is used to run a session. Click the Edit button next to the Sparkmagic Kernel you’ve added. Click + New Editor in the window that opens.
Enter the editor name as Jupyter Notebook and for the command use the following:
/usr/local/bin/jupyter-notebook --no-browser --ip=127.0.0.1 --port=8090
--NotebookApp.token= --NotebookApp.allow_remote_access=True --log-level=ERROR
Note that port 8090 is the default port, unless your administrator changed it.
Then click Save and Save again.. At this point CML knows where to find your custom kernel and what editor to launch when a session starts.
Now we are ready to enable this custom engine inside a project.
Open a project where you would like to use your custom kernel. For me, it’s a project called Custom Kernel Project (yes, I’m not very creative when it comes to names). In the left panel, click on Project Settings, then go to the Engine tab. From the Engine Image section, drop-down select your custom engine image.
To test the engine, go to Sessions, and create a new session. You’ll see that Engine Image is the custom Docker image you’ve created in Step 2. Name your session and select Jupyter Notebook as your Editor.
When the session launches, in the Jupyter notebook interface you’ll be able to select PySpark when creating a new notebook.
You can start with %%help magic and follow along with Sparkmagic documentation. Specifically, you’ll want to configure a connection to a Spark cluster using a JSON template provided.
That’s it!
CML brings you the flexibility to run any third-party editor on the platform, making development more efficient for Data Scientists and Data Engineers. Note that while this article talked about sparkmagic custom kernel, the same procedure can be applied to any kernel you wish to run with Jupyter notebook or Jupyter Lab.
Reference: