Member since
12-29-2016
12
Posts
1
Kudos Received
0
Solutions
10-05-2021
06:43 AM
2 Kudos
A recent update to Cloudera Machine Learning brings the ability to create custom code editors with ML runtimes. This article shows the process of creating and adding an ML Runtime to CML that uses a different editor. First, you will create a Docker image that is configured to use a custom editor, specifically RStudio, and then add it to your workspace. Step 1: Create and upload the Docker Image Note: If you just want to use RStudio, you can skip this step and use an image that has already been uploaded: peterableda/rstudio-cloudera-runtime:2022.04-8 You will need to have Docker installed and running to do this step. First, clone the repo and use the RStudio 1.4 directory. In a terminal window, run the following commands listed below. $ git clone https://github.com/cloudera/community-ml-runtimes
$ cd rstudio_1.4 Now build the Docker image. You need to replace the peterableda/ tag details with the details of the container registry you need to use. The rstudio-cloudera-runtime:2022.04-8 part of the tag is up to you. The 2022.04-8 is the CalVer naming convention that we use for community images. The Dockerfile has some useful comments about the structure of the file and can help you customize it for your own requirements. $ docker build -t peterableda/rstudio-cloudera-runtime:2022.04-8 . -f Dockerfile The next step is to push the image to your container registry. $ docker push peterableda/rstudio-cloudera-runtime:2022.04-8 Assuming the image push worked, you are good for the next step. Step 2: Add the Runtime image to CML Note: This step requires that you have the CML Public Cloud - August 31 or a newer version to add a custom runtime. If you don't have the Runtime Catalog navigation item or the Runtime Catalog page doesn't have the Add Runtime button, you might not have the right version or the right permissions. Please check with whoever manages your CDP environment. Navigate to the Runtime Catalog for the CML start page, and click Add Runtime. In the next step, paste in the link to the image you pushed in Step 1 and click Validate. Your CML instance will need to have to access this container registry to pull the image. If this is a restricted or air-gapped installation, public container registries might not work and will require a private container registry deployed in an accessible network location. The validation process will confirm the image has the correct labels and can be imported. Click Add to Catalog when you are ready. Step 3: Use RStudio Assuming all went well in the last step, you should now be able to use RStudio as an editor when starting a new session. From the New Session page, select RStudio as the editor: Once the session launches, you will see a familiar view of RStudio embedded into the CML UI, in the same way, JupyterLab is embedded. While this process is specific to RStudio, it should work for any web-based editor that can be configured to run on a specific port.
... View more
09-21-2021
06:21 AM
With the new runtimes feature available for both CML and CDSW, it is now possible to make better use of the remote editing capabilities with VS Code.
To use this feature you will need VS Code installed locally and a CML/ CDSW instance that supports runtimes, and has the remote editing enabled (it is enabled by default but it can be disabled in the Admin settings) and that you can reach the server from the remote point you're connecting from. If you have all of that enabled, you are good to go.
Step 1 - Configure cdswctl CLI
For the first step, you will need to have a copy of the cdswctl command line tool on your local machine. You can get the cli directly from the CML/CDSW instance by going to User Settings > Remote Editing.
The installation process is documented here and you need to get a version that works for your local OS. I'm on a Mac and I have the cdswctl cli in my /usr/local/bin directory:
% which cdswctl /usr/local/bin/cdswctl
To connect to your CML/CDSW instance you need to know the URL for the main page, your username and your Legacy API key. The first two you should have, your Legacy API key can be found by going to User Settings > API Keys
Currently, CML/CDSW is still using the Legacy API key for remote authentication, but this will be converted to the new API Key format in an upcoming release. Make a note of the Legacy API key.
As the connection to the CML/CDSW instance is over SSH, you will need an SSH key pair on your local machine that you can use to authenticate with. If you don't have an SSH key pair, you can generate your own one. You then need to add the public key to the CML/CDSW server in User Settings > Remote Editing.
Following is an example of the public SSH key I used for this setup:
% cat ~/.ssh/id_rsa_hadoop.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDbKNjtDWoeATXCj6byhs.....
I copied that value from the terminal window and pasted it into the SSH Public Key box in the User Settings > Remote Editing page, and clicked Add.
If it's a valid key, you should see a fingerprint ID for that key in the list.
Step 2 - Connect cdswctl cli to CML/CDSW Instance
Once you have the cdswctl cli installed and your SSH key added to the CML/CDSW instance, you can create a remote connection.
The command you use to connect is:
% cdswctl login -u http(s)://[your-cml-cdsw-instance-url]/ -n [your username] -y [your-legacy-api-key]
Example:
% cdswctl login -u https://ml-1651c51d-946.jf-ml-aw.a465-9q4k.cloudera.site/ -n jfletcher -y ud5r3hx3zlunjuazzvfhd5dj0y77ib2l
If it works, you will get back the response: Login succeeded
You have now configured the cdswctl cli to connect to CML/CDSW instance you wish to use. The next step is to set up an ssh-endpoint that creates a tunnel from your local machine to a session running on the CML/CDSW instance in the project you want to work on. However, there are new steps here as this project uses ML Runtimes and works slightly differently than the legacy engine implementation. The cdswctl cli has a requirement that you provide the runtime identifier to use to start the session. For this, you need the numerical value of the runtime you want to use to pass it in to the cdswctl cli. The cdswctl cli can provide you a list of available runtimes that you can pick from using the runtimes list option.
However when you run the command you will get a lot of hard to read JSON:
% cdswctl runtimes list
{"runtimes":[{"id":39,"imageIdentifier":" 3.6","edition":"Nvidia GPU","shortVersion":"2021.06","fullVersion":"2021.06.1-b5","maintenanceVersion":1,"description":"Python runtime with CUDA libraries provided by Cloudera"},{"id":40,"imageIdentifier":" 3.6","edition":"Standard","shortVersion":"2021.06","fullVersion":"2021.06.1-b5","maintenanceVersion":1,"description":"Standard edition JupyterLab Python runtime provided by Cloudera"},{"id":41,"imageIdentifier":" 3.7","edition":"Nvidia GPU","shortVersion":"2021.06","fullVersion":"2021.06.1-b5","maintenanceVersion":1,"description":"Python runtime with CUDA libraries provided by Cloudera"}
With 20+ runtimes by default, this becomes difficult to read. To fix this, use the jq tool. Once installed, you can pipe the output from cdswctl to jq and format and filter the results. Without any filtering, the JSON is presented in a much more readable format. For this project, let's assume we are not using GPUs so we need a Runtime that has Jupyterlab (for VS Code to use), Python 3.7 (why? because!), and the Standard Runtime version as we don't need any CUDA stuff. We can filter for this runtime using the following query in jq:
% cdswctl runtimes list | jq '.runtimes| .[] | select( .imageIdentifier | contains("docker.repository.cloudera.com/cdsw/ml-runtime-jupy
terlab-python3.7-standard" ))'
{
"id": 42,
"imageIdentifier": "docker.repository.cloudera.com/cdsw/ml-runtime-jupyterlab-python3.7-standard:2021.06.1-b5",
"editor": "JupyterLab",
"kernel": "Python 3.7",
"edition": "Standard",
"shortVersion": "2021.06",
"fullVersion": "2021.06.1-b5",
"maintenanceVersion": 1,
"description": "Standard edition JupyterLab Python runtime provided by Cloudera"
}
The Runtime ID value we need is 42. If you don't want the additional JSON info, you can add | .id to the query to return only the Runtime ID value.
% cdswctl runtimes list | jq '.runtimes| .[] | select( .imageIdentifier | contains("docker.repository.cloudera.com/cdsw/ml-runtime-jupyterlab-python3.7-standard" )) | .id'
42
Now that you have the required info, you can create a remote ssh-endpoint connection using:
% cdswctl ssh-endpoint -p test -r 42 -c 2 -m 4
Forwarding local port 4540 to port 2222 on session tkfm59z7hbowvv9p in project jfletcher/test.
You can SSH to the session using
ssh -p 4540 cdsw@localhost
The ssh-endpoint command takes a few options. -p test sets the project (with the incredibly creative name) to connect to. -r 42 is the runtime to use when launching the session. -c 2 -m 4 sets the number of CPU and GB of RAM for the session respectively. This is the same as when you launch a session from the Workbench directly. The default session size for remote access sessions is too small for VS Code as it installs and runs some helper files on the remote session to do useful things. If you don't have enough memory, it will kill the session.
Once this is working you can try connecting to the remote session from your local machine by running the SSH command as shown.
% ssh -p 4540 cdsw@localhost
cdsw@tkfm59z7hbowvv9p:~$
Each time you configure a new project or new CML/CDSW instance, the cli will allocate a random port number, but the port number remains the same for successive connections to that same project and CML/CDSW instance.
Step 3 - Configure VS Code
Now that you have established a remote connection, you need to configure VS Code. This process uses the Remote SSH extension for VS Code. You can find and install this in the Extensions section.
There are a couple of ways of connecting, but the easiest is to go to the new Remote SSH section and add a new target.
This will prompt you for the connection details; in this case: ssh -p 4540 cdsw@localhost
The process will ask you to add this host to your SSH config file and you will see the new SSH target available.
You can connect to this session now. There are a few ways to do this, but the easiest is just right click on the new host:
You can connect your existing window or open a new window.
Note: The first time you connect to this server, you'll be prompted to accept the server's SSH fingerprint.
You are now connected to CML/CDSW. VS Code still needs to install and run some helper services on the remote server, so you will see this running on the first connection and it takes a few minutes to complete.
Once completed, you can start editing code.
Click 'Open Folder' and navigate to /home/cdsw - this should auto-populate the path. You will now see the project files.
Step 4 - Configure VS Code Extensions
As it stands now, you can edit files and get access to the CML/CDSW terminal with the current remote connection. However, VS Code is significantly more useful if you install the language extensions. For this example, we will install the Python extensions in the remote session. If you do not have the VS Code python extensions installed in your local VS Code instance, do that first.
From there you can use the Extensions section to install the VS Code Python extension into the remote instance. VS Code includes Pylance and Jupyter extensions with the Python extension.
From here you are good to go for editing Python files and notebooks with all of VS Code's Python coding capabilities.
There is one last useful feature that I use a lot and that is the ability to run code selections in an interactive window. It's like a temporary Jupyter notebook and behaves more like the CML/CDSW workbench than a normal Jupyter notebook. To start the process, select some code and either hit Shift + Enter (that is on a Mac, but I think it's the same for Windows/Linux), or right-click and select - Run Selection / Line in Interactive Window.
This will start the temporary Jupyter session. The first time you do this or open a Jupyter Notebook for the first time, you will be prompted for the Jupyter connection method. Use the Default.
From there you will have access to a Jupyter like session and that you can interact with from a normal Python file.
... View more
09-29-2020
04:13 AM
2 Kudos
The webinar for this Prototype is available on this Cloudera Events page.
This is the third Applied Machine Learning Prototypes. (Here are the links for the Customer Churn and Anomaly Detection prototypes). These are prototypes that will help you build a fully working machine learning example in CML. The Templates will include source data, and walk through various steps:
Ingest data into a useful place in CDP (e.g. a Hive Table)
Explore the data set
Create a plan to build a model
Train the model
Deploy the model
Build and deploy an application
Once you have deployed the template and all the CML artifacts that go with it, you can unpick and work it backward to map the process to your own data in your own environment.
Our latest Prototype - Fraud Detection - is now available. To get up and running with it, do the following:
Log in to your CML workspace and create a project using the following repo: https://github.com/fastforwardlabs/cml_sentiment_analysis
This is the URL to the Git section in the Initial Setup:
This will deploy the files into your CML instance and will look like the following:
From here, follow the instructions in the README. It will take through all the steps required to build a Sentiment Analysis application using Shiny that lets you interact with two different sentiment prediction models.
... View more
07-15-2020
02:04 AM
See the webinar for this Prototype here: https://www.cloudera.com/about/events/webinars/building-diverse-anomaly-detection-capabilities-for-business-with-cdp.html We now have our second Applied Machine Learning Prototypes available (Go here for the previous Prototype - Customer Churn). These are prototypes (or templates as they were called) that will help you build a fully working machine learning example in CML. The Templates will include source data, and walk through various steps: Ingest data into a useful place in CDP (e.g. a Hive Table) Explore the data set Create a plan to build a model Train the model Deploy the model Build and deploy an application Once you have deployed the template and all the CML artifacts that go with it, you can unpick and work it backward to map the process to your own data in your own environment. Our latest Prototype - Fraud Detection - is now available. To get up and running with it, do the following: Log in to your CML workspace and create a project using the following repo: https://github.com/fastforwardlabs/cml_fraud_demo This is the URL to the Git section in the Initial Setup: This will deploy the files into your CML instance and will look like the following: From here, follow the instructions in the README. If you just want to deploy the whole project and get the application up and running quickly, launch a new Workbench session: Once the Workbench is open, open file 6_build_project.py and run the file: When the script completes the run, your project will look like the following: Launch the application from the Applications tab and click on the blue arrow next to the name: This will open the application in a new window. This app is a Dash app that shows some sample data and the prediction that the model made.
... View more
05-18-2020
03:05 AM
Go here for the latest Prototype - Fraud Detection
See the webinar for this Prototype here: https://www.cloudera.com/about/events/webinars/build-a-customer-churn-insights-application-with-cdp.html
The complexity of creating an end-to-end machine learning workflow is one of the biggest hurdles data science and machine learning engineers are facing. It's not a running model.fit() that is hard, but it is ingesting data, getting the data in the right format for the model training process, deploying the model in a way that is accessible to other parts of the business, and running applications that consume the model that is hard. Machine Learning is useful when it's deployed with an end-to-end workflow.
We have been working to create Applied Machine Learning Prototypes for CML that will help you build a fully working machine learning example in CML. The Prototypes will include source data, and walk through various steps:
Ingest data into a useful place in CDP (e.g. a Hive Table)
Explore the data set
Create a plan to build a model
Train the model
Deploy the model
Build and deploy an application
Once you have deployed the template and all the CML artifacts that go with it, you can unpick and work it backward to map the process to your own data in your own environment.
The first Applied Machine Learning Prototype is now available - Churn. To get up and running with it, do the following:
Log in to your CML workspace and create a project using the following repo:
https://github.com/fastforwardlabs/cml_churn_demo_mlops
This is the URL to the Git section in the Initial Setup:
This will deploy the files into your CML instance and will look like the following:
From here, follow the instructions in the README. If you just want to deploy the whole project and get the application up and running quickly, launch a new Workbench session:
Once the Workbench is open, open file 8_build_project.py and run the file:
When the script completes the run, your project will look like the following:
Launch the application from the Applications tab and click on the blue arrow next to the name:
This will open the application in a new window. The initial view is a randomly selected table from the dataset. This shows a global view of which features are most important for the predictor model. The reds show increased importance for predicting a customer that will churn and the blues for customers that will not.
Click on any single row to view a "local" interpreted model for that particular data point instance. Here, you can see how adjusting any one of the features will change the instance's churn prediction.
Changing the InternetService to DSL lowers the probability of churn. Note: This does not mean that changing the Internet Service to DSL cause the probability to go down, this is just what the model would predict for a customer with those data points.
... View more
11-13-2019
01:11 AM
5 Kudos
A few years ago I switched to using VS Code as my main code / text editor. I find it meets all my personal code development needs. With the release of the new BYOE functionality in CDSW 1.6 and CML, you can now use VS Code to remotely edit (and debug) Python, R and probably Scala code too. Plus you can also run and edit Jupyter Notebooks, all inside VS Code. This is a quick how-to to get it working. Getting Connected To start, you need to set up the Remote Editing feature for your CDSW/CML cluster. You must download the CLI client and add an SSH public key. The next step is to authenticate and connect to the CDSW server using the CLI client from your local machine. One you are connected, you should see something like this: $ cdswctl ssh-endpoint -p ml-at-scale -m 4 -c 2 Forwarding local port 7847 to port 2222 on session bhsb7k4eqmonap62 in project jfletcher/ml-at-scale. You can SSH to the session using ssh -p 7847 cdsw@localhost Now you need to add an entry into your SSH config file. On my Mac, I created the following: $ cat ~/.ssh/config Host cdsw-public HostName localhost IdentityFile ~/.ssh/id_rsa_hadoop User cdsw Port 7847 HostName is always localhost and User is always cdsw. You will get the Port number from the previous step. Now for setting up VS Code. At a minimum you need to install the Remote SSH extension. I find the Remote SSH - Edit useful for adding different servers to my ssh config file quickly as well. Additionally you will probably want to install the Python and R extensions to help with coding tasks. With everything installed and ready to go, you start a remote connection to your CDSW/CML server. Start by opening the command pallet and connecting to a remote host. Then connect to the host you added previously. ' For the first connection, you need to accept the fingerprint. You might not see it pop up, so pay attention to VS Code. If it's the first time your are connecting to a new session, or the port number changed, you will have to accept the fingerprint. While VS Codes connects and sets up the remote connection, it installs some helper applications on the CDSW/CML server. Sometimes the remote session dies. Just click Retry or if it's taking a long time, restart the remote session and it will recover. Note: If you get stuck in a loop during setup with VS Code reconnecting every 30 secs or so, the issue is with the lock file VS Code creates during the install. Close VS Code and in CML terminal, delete the /home/cdsw/.vscode-server/ directory and start again. Once you are connected, you can then open the Explorer and view and edit the files in the /home/cdsw directory. And from there you can edit any of the files on your CDSW/CML server. This already gets you to a good place to remotely edit and modify your CDSW/CML files but VS Code has some powerful coding tools that you can take advantage of over the remote connection. Python To take full advantage of VS Codes python tools, you must install the Python extension into the remote ssh session. You have to install the extension the first time you connect a newly configured remote session, but it's reasonably quick. With the Extension installed, once you open your first python file, you will be prompted to install pylint Linter. When you click Install, VS Code will open a terminal and run the code needed to install the linter. Its important to note that this is a remote terminal, running on an engine in CDSW/CML. It's the same as if you launched a terminal inside a running workbench. If you want to run arbitrary python code inside VS Code, open a python file, select some code, right click and “Run Selection/Line in Python Terminal”. You can also just hit Shift-Enter in the code editor window. This will open up a new terminal if there isn’t one and run the selected code. And since this is a remote session, you can run pyspark directly inside VS Code. For more complex code requirements, you can also use the Python Debugging feature in VS Code. R The R extension provides similar capabilities as the python one. This means you can edit R files with code completion and execute arbitrary code in the terminal. With sparklyr, you can run spark code using R inside VS Code. There is a trick to R though, you will have to set the path to the R binary correctly as the default might not won’t work. Check where your R binary lives in CDSW by running which R and then pasting that into the R > Rterm: Linux setting in VS Code. Its is most likely /usr/local/bin/R but its best to check. Jupyter The other really nice feature with VS Code is that you can work on Jupyter Notebooks within VS Code. This gives you all the great code completion, syntax highlighting and documentation hints that are part of the VS Code experience and the interactivity of a Jupyter Notebook. Any changes you make to the Notebook will be reflected on the CDSW / CML server and can be viewed online through using Jupyter Notebook as a browser based editor. To get Jupyter working on CML is slightly tricky though. Because of the way CML uses the internal networking and port forwarding of Kubernetes ,then VS Code launches a Jupyter Server it binds to the wrong address and access is blocked. You therefore have to launch your own Jupyter Server and tell VS code to connect to that. Note that this does not apply to CDSW The first setting you need to set is the Python > Data Science: Jupyter Server URI setting. Set this to http://127.0.0.1:8888/?token=[some-token] Then you need to open a terminal to launch a Jupiter Notebook server. You can launch it using: /usr/local/bin/jupyter-notebook —no-browser —ip=127.0.0.1 —NotebookApp.token=[some-token] —NotebookApp.allow_remote_access=True This creates a Jupyter server that any new Notebooks you launch will run in. Another feature that you can use with VS Code is running a temporary Notebook for executing random code snippets. Select code you want to run, right click and click on "Run Current File in Python Interactive Window”. This is less robust though and will create loads of Untitled*.ipynb files in your home directory. Git Integration VS Code also has substantial Git integration. If you created your project from a git repo or a custom template, your changes and outside changes made to the repo will automatically appear. One Final Tip You can limit the number of files shown in the Explorer view. If you end up with loads of .[something] directories, in /home/cdsw, it can be hard to navigate. If you add the **/.* pattern to the Files: Exclude setting, it will hide all those files and directories for you.
... View more