Community Articles

pauldefusco · ‎11-16-2023

The Cloudera Data Platform (CDP) is a hybrid data platform designed to deliver faster and easier data management, analytics and AI at enterprise scale. Cloudera Machine Learning (CML) is one of CDP’s Cloud Native Data Services designed to enable secure, governed Data Science.

Data Scientists and Engineers use CML to securely analyze large amounts of data via interactive notebooks. Since 2022 CML enhances this capability with the Data Connections feature by providing boiler template code to access and push down SQL queries to the CDW service. This allows the user to run large scale queries directly in the Data Warehouse while accessing and visualizing results from a simple notebook with a small consumption footprint.

In 2023 CML doubled down with CML Custom Data Connections. Custom Data Connections allow the CML user to create their own boiler template code so they or their collaborators can easily connect to 3rd party tools such as Snowflake, Postgres, legacy on-prem databases (Oracle, MSSQL, MySQL), serverless cloud databases (Redshift, Snowflake, SAP HANA Cloud, BigQuery), APIs, and specialized data stores (Neo4j).

This article provides a tutorial on how to create a MySQL Custom Data Connection in your own CML Workspace. The code referenced is provided in this git repository.

Data Connections Refresher

You can use Data Connections from a CML Session to connect to a CDP environment (CDW or Spark) regardless of the Editor choice by following the below steps:

Open your CML Session and click on the “Data” tab at the top right of your screen.
Scroll down and select the CDW or Spark Data Lake connection of your choice.
Copy and paste the code in your Editor.
Run the sample SHOW DATABASES command and create your SQL statements.

Learning from the Postgres Custom Data Connection

The CML Engineering Team has created two custom data connection templates that you can reuse to connect to Snowflake and Postgres. In this section we will use the Postgres template to learn how to deploy an existing custom connection.

Steps to Deploy the Postgres Custom Connection to your Project

Create a CML Project by cloning the provided template with this URL: https://github.com/pdefusco/Using_CustomConn_CML.git
Notice the mysqlconn and postgresconn folders are included in the project. Open the postgres folder and familiarize yourself with the code in pg-conn.py. This is the source code for the Custom Connection.

The PostgresCustomImp class is a child of CustomConnection which is imported from cml.data_v1.customconnection. You must extend this class in order to create a custom connection.
The get_base_connection method uses the psycopg2 module to establish a connection to a given Postgres source. The connection requires a hostname, a default port, a database name, a user and password. When implementing your own custom connection you can choose which connection parameters to include as needed.
The get_pandas_dataframe method executes the provided SQL via the Pandas read_sql method.
The get_cursor method returns a cursor object. This is optional as the SQL command is executed in get_pandas_dataframe.
The parameters required to initialize a connection are set manually via the UI (see below). The override_parameters and check_params_or_env methods ensure that these are set correctly. While these methods are also optional, we recommend implementing them.

Navigate to the Workspace Site Administration page and open the Data Connections Tab. Click on the “New Connection” icon.
Fill out the form as shown below:

Name: Custom Postgres
Type: Custom Connection
Type Display: Postgres
Project: select your project
Connection Files: select the “postgresconn” folder
Custom Parameters: create the following four key value pairs:

PG_HOST : hh-pgsql-public.ebi.ac.uk
PG_PORT : 5432
PG_DB : pfmegrnargs
PG_USER : reader

Note: in this example we are connecting to the RNAcentral Public Postgres Database. Please visit this URL for more information: https://rnacentral.org/help/public-database

Navigate back to your project. Open the “Project Settings” page and then the “Data Connections” tab. Make the new connection available in the project by clicking on “Sync with Workspace”.
Launch a CML Session with JupyterLab as your Editor. A small resource profile without GPUs is ok. There is no need to enable a Spark Runtime Add-On.
The Data Connections window will load automatically. Notice the new “Custom Postgres” connection with a reusable template code block.
Open the “using_connections” notebook. Notice the code block to connect to the Postgres database has already been prepopulated for you. Normally you would copy and paste from the Data Connections pop up window.
Execute the first four cells and validate that query results are output in the notebook (do not run the entire notebook!).

Creating a MySQL Custom Data Connection

Now that we have connected to Postgres with a provided Custom Connection template we can follow the same steps to create a custom MySQL connection.

Open the mysqlconn folder and familiarize yourself with the code.

Notice the source code for this connection is in a separate folder. This is required in order to select a folder in the Custom Data Connection creation form in the Workspace Settings.
As in the previous example, we create a MySQLCustomImp class which inherits from the CustomConnection interface in the cml.data_v1.customconnection module. This module does not have to be pip installed and is already provided to you in CML by default.
The implemented class methods are similar to the Postgres example. Notice that in this case we don’t have a method to return the connection cursor. In addition, we leverage the mysql-connector-python package rather than psycopg2.
To use this custom connection, go through the same steps and set the following Custom Parameters:

MYSQL_HOST : mysql-rfam-public.ebi.ac.uk
MYSQL_PORT : 4497
MYSQL_DB : Rfam
MYSQL_USER: rfamro

Note: in this example we are connecting to the Rfam MySQL public database. For more information please visit this URL: https://docs.rfam.org/en/latest/database.html

Conclusions

In this article we highlighted CML Custom Data Connections. In summary:

CML is a Cloud Native Platform for Enterprise Machine Learning. The built-in integrations with the Cloudera Data Platform (CDP) allow CML Users to operate in a secure, governed Machine Learning environment at scale.
CML Data Connections enhance Data Analysis, Exploration, and Model Experimentation by providing Data Scientists with an easy interface to process large amounts of data from a notebook with minimum compute resources.
Custom Data Connections augment this capability by opening access to 3rd party systems such as external RDBMSs or Cloud Vendor Databases such as Snowflake.
CML Users are free to implement their own Custom Connections while giving the CML Admins the ability to approve them at the Workspace level.

What Custom Data Connection are you using in CML? Please don’t hesitate to comment with your favorite external data sources.

Cloudera Community

Community Articles

Using Custom Data Connections in Cloudera Machine Learning

Cloudera Machine Learning (CML)

PandasOnSpark in Cloudera Machine Learning (CML)

New Feature in Cloudera Machine Learning: Data Con...

Using Cloudera Machine Learning to run a Data Engi...

Writing files to Cloudera Machine Learning using A...

Cloudera Machine Learning (CML) - Questions & Answ...

Machine Learning with SQL using Apache Hive and Hi...

How to provision Workspaces in Cloudera Machine Le...

Spark Auto-Scaling with Kubernetes in Cloudera Dat...

Validating Jet Engine Predictive Models Using Clou...

Fraud Detection - Applied Machine Learning Prototy...