Created on 02-18-2021 02:32 AM - edited on 02-18-2021 08:36 PM by subratadas
With the recent addition of Cloudera Operation Database Experience to the CDP Public Cloud, we want to explore how it can be leveraged in the real-life 'DataFlow' end-user scenario. This article talks about how to execute Spark/pyspark job in CML to run modeling task using the data residing in COD. We read the table present in COD and also write back the score table back to the COD once the prediction is done.
CDP Runtime (supporting COD) >=7.2.2
We assume that CDP environment, datalake, datahub (Data Engineering) have been provisioned. We further assume that experiences COD and CML have been provisioned for the CDP target environment.
Note: Please refer to The world’s first enterprise data cloud, if you are just starting with CDP, and get to know how all the requirements can be in place with ease.
Some of the following steps are already documented in this blog (thanks @shlomi Tubul). On top of this, we further elaborate and expanded on what needs to be done for CML-COD use case.
The first thing we need to do is to create a database in COD:
Next, Provision CML:
Once CML is provisioned, we go ahead and create a project in the workspace. We will be using the local template and upload the required files to it. create_model_and_score_phoenixTable.py is the pyspark script that we will be using for the task.
The pyspark script we used can be found here.
Though the code in this file is written for CML-CDSW integration (for On-prem set-ups), we modified it a little bit to work for the Cloud native platform i.e. CDP Public Cloud.
!cp /home/cdsw/hbase-site.xml /etc/spark/conf/
!chmod 644 /etc/spark/conf/hbase-site.xml
"""""""
same code section from the git file
""""""""
target_path = "<path to the location(in out case, external s3 bucket) where data is residing>"
"""""""
same code section from the git file
""""""""
Rest all is the same in the file.