Community Articles

subratadas · ‎02-18-2021

Running 'pyspark' applications in CML for model generation and prediction, with data residing in COD

With the recent addition of Cloudera Operation Database Experience to the CDP Public Cloud, we want to explore how it can be leveraged in the real-life 'DataFlow' end-user scenario. This article talks about how to execute Spark/pyspark job in CML to run modeling task using the data residing in COD. We read the table present in COD and also write back the score table back to the COD once the prediction is done.

Getting Started

CDP Runtime (supporting COD) >=7.2.2

We assume that CDP environment, datalake, datahub (Data Engineering) have been provisioned. We further assume that experiences COD and CML have been provisioned for the CDP target environment.

Note: Please refer to The world’s first enterprise data cloud, if you are just starting with CDP, and get to know how all the requirements can be in place with ease.

Some of the following steps are already documented in this blog (thanks @shlomi Tubul). On top of this, we further elaborate and expanded on what needs to be done for CML-COD use case.

Main components used in this demo:

Cloudera Operational Database (COD), as mentioned in my previous post, is a managed dbPaaS solution available as an experience in Cloudera Data Platform (CDP)
CML is designed for data scientists and ML engineers, enabling them to create and manage ML projects from code to production. Main features of CML:
- Development Environment for Data Scientists, Isolated, Containerized, and Elastic
- Production ML Toolkit – Deploying, Serving, Monitoring, and Governance of ML models
- App Serving – Build and Serve Custom applications for ML use-cases

Setting Up the Environment

The first thing we need to do is to create a database in COD:

Log in to Cloudera Data Platform (CDP) Public Cloud 'Control Plane' (CP)
Select Operational Database and then click Create Database
Select the environment to which the COD will be attached and give a unique name for the COD, and then click Create Database
Once created, open the COD page and use the HBase Client Configuration URL to get the hbase-site.xml needed in CML

Next, Provision CML:

Log in to CDP Public Cloud CP
Select Machine Learning and click Provision Workspace
Select the environment for which the CML workspace will be provisioned and give a unique name for the same, and then click Provision Workspace

Create Project in CML: Model and Prediction

Once CML is provisioned, we go ahead and create a project in the workspace. We will be using the local template and upload the required files to it. create_model_and_score_phoenixTable.py is the pyspark script that we will be using for the task. Screenshot 2021-02-18 at 3.19.47 PM.png

CML: Configuration for use in CML session

Upload the configuration files we downloaded from COD (A.4); we will require the hbase-site.xml file for use in the CML session to connect to the COD (see picture above).
We also need to configure the spark-defaults.conf file with jars to be used, and if there are any external cloud storage in use (from where data is being read), we will need to configure that too for Spark to authenticate with IDBroker and get access.
Note: Since we have the data in an external S3 bucket, we added appropriate IDBorker mapping to allow the user access to this external bucket.

Running the Task

The pyspark script we used can be found here.

Though the code in this file is written for CML-CDSW integration (for On-prem set-ups), we modified it a little bit to work for the Cloud native platform i.e. CDP Public Cloud.

Firstly we added two lines at the start of the script file- these lines are required as of now to move the hbase-site.xml config to Spark's default conf dir in order for connection to COD to work and allow the file to be read by all users. (There is no way to override this as of now, so this workaround is needed).

Also, we modified the target_path for the temp files (that will be generated by the Spark job), since the user we executed this job (use has been given "MLUser" permission on the environment) needs to have access to the location specified.

!cp /home/cdsw/hbase-site.xml /etc/spark/conf/
!chmod 644 /etc/spark/conf/hbase-site.xml

"""""""
same code section from the git file
""""""""
    target_path = "<path to the location(in out case, external s3 bucket) where data is residing>"
"""""""
same code section from the git file
""""""""

Rest all is the same in the file.

Start running the project

Screenshot 2021-02-18 at 3.02.11 PM.png

Click New Session
Give the session a name and click the Start Session button at the bottom (adjust Workbench, kernel, and Resource Profile if required for the project)
Once the session has started, select the pyspark script file, and click the Run icon at the menu on top of the file contents.
Once the execution starts, the session logs and task logs tabs will appear on the right half of the screen.

The logs will end on completion of the script execution (Success or Failure)

There we have it, on Success the table (BatchTable2) gets created in COD.

The session can be closed manually by clicking the Stop button at the top right corner (or it will be killed by auto timeout if not in use for a certain amount of time.