Community Articles

subratadas · ‎02-09-2021

Cloudera Machine Learning provides support for Python3. It is very straightforward to connect a session with an operational database.

Provision an Operational Database

Log into a CDP instance
Select Operational Database
Select Create Database
Choose the Cloud environment
Provide a unique name for the database
Click Create Database

Once the database has started, make a copy of the Phoenix (Thin) JDBC URL. This will be used as the connection string.

Create a Machine Learning Project

Within your Cloudera Machine Learning (CML) workspace, create a new project.
Provide a name, and choose a blank initial setup. Create a session, and install phoenixdb using the command:
```
!pip3 install phoenixdb
```

Create a new Python file and paste the following code into the notebook.
Import the required dependencies

import phoenixdb
import io
import json

Setup the parameters required to establish the connection with ODB. Refer to the Thin client details.

opts = {}
opts['authentication'] = 'BASIC'
opts['serialization'] = 'PROTOBUF'
opts['avatica_user'] = 'xxxxxxxx'
opts['avatica_password'] = 'xxxxxxxx'
database_url = 'https://<the jdbc url copied from the ODB console>/'
TABLENAME = "us_population"
conn = phoenixdb.connect(database_url, autocommit=True,**opts)

For the URL, remove everything before the https and remove the parameters at the end, while retaining any path details.

Example:
```
https://<server>/<instance name>/cdp-proxy-api/avatica/
```

Create the table into which to insert the data

curs = conn.cursor()
query = """
      CREATE TABLE IF NOT EXISTS """+TABLENAME+""" (
      state CHAR(2) NOT NULL,
      city VARCHAR NOT NULL,
      population BIGINT
      CONSTRAINT my_pk PRIMARY KEY (state, city))
        """
curs.execute(query)

Bulk insert a set of data, using nested arrays for each record, and executing multiple upserts.

sql = "upsert into " + TABLENAME + \
            " (state ,city, population) \
             values (?,?,?)"

data =[['NY','New York',8143197],
      ['CA','Los Angeles',3844829],
      ['IL','Chicago',2842518],
      ['TX','Houston',2016582],
      ['PA','Philadelphia',1463281],
      ['AZ','Phoenix',1461575],
      ['TX','San Antonio',1256509],
      ['CA','San Diego',1255540],
      ['TX','Dallas',1213825],
      ['CA','San Jose',912332]]
results = curs.executemany(sql,data)

Finally, run a query to return an aggregated group-by and return as a Dictionary object.

curs = conn.cursor(cursor_factory=phoenixdb.cursor.DictCursor)
query = """SELECT state as "State",count(city) as "City Count",sum(population) as "Population Sum"
FROM us_population
GROUP BY state
ORDER BY sum(population) DESC"""

curs.execute(query)
print(curs.fetchall())

When the above is run in a session, it will return the following results.

[{'State': 'NY', 'City Count': 1, 'Population Sum': 8143197}, {'State': 'CA', 'City Count': 3, 'Population Sum': 6012701}, {'State': 'TX', 'City Count': 3, 'Population Sum': 4486916}, {'State': 'IL', 'City Count': 1, 'Population Sum': 2842518}, {'State': 'PA', 'City Count': 1, 'Population Sum': 1463281}, {'State': 'AZ', 'City Count': 1, 'Population Sum': 1461575}]

This example is based on the post: Phoenix in 15 minutes or less

Cloudera Community

Community Articles

Using phoenixdb to connect Cloudera Machine Learning to Cloudera Operational Database

Apache Phoenix

Provision an Operational Database

Create a Machine Learning Project

Example: