Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Contributor

Cloudera Machine Learning provides support for Python3. It is very straightforward to connect a session with an operational database.

Provision an Operational Database

  1. Log into a CDP instance
  2. Select Operational Database
  3. Select Create Database
  4. Choose the Cloud environment
  5. Provide a unique name for the database
  6. Click Create Database

Once the database has started, make a copy of the Phoenix (Thin) JDBC URL. This will be used as the connection string.

Create a Machine Learning Project

  1. Within your Cloudera Machine Learning (CML) workspace, create a new project.
  2. Provide a name, and choose a blank initial setup. Create a session, and install phoenixdb using the command:
    !pip3 install phoenixdb
  3. Create a new Python file and paste the following code into the notebook.
    Import the required dependencies
    import phoenixdb
    import io
    import json

    Setup the parameters required to establish the connection with ODB. Refer to the Thin client details.

    opts = {}
    opts['authentication'] = 'BASIC'
    opts['serialization'] = 'PROTOBUF'
    opts['avatica_user'] = 'xxxxxxxx'
    opts['avatica_password'] = 'xxxxxxxx'
    database_url = 'https://<the jdbc url copied from the ODB console>/'
    TABLENAME = "us_population"
    conn = phoenixdb.connect(database_url, autocommit=True,**opts) 
  4. For the URL, remove everything before the https and remove the parameters at the end, while retaining any path details.

    Example:

    https://<server>/<instance name>/cdp-proxy-api/avatica/
  5. Create the table into which to insert the data
    curs = conn.cursor()
    query = """
          CREATE TABLE IF NOT EXISTS """+TABLENAME+""" (
          state CHAR(2) NOT NULL,
          city VARCHAR NOT NULL,
          population BIGINT
          CONSTRAINT my_pk PRIMARY KEY (state, city))
            """
    curs.execute(query)
  6. Bulk insert a set of data, using nested arrays for each record, and executing multiple upserts.
    sql = "upsert into " + TABLENAME + \
                " (state ,city, population) \
                 values (?,?,?)"
    
    data =[['NY','New York',8143197],
          ['CA','Los Angeles',3844829],
          ['IL','Chicago',2842518],
          ['TX','Houston',2016582],
          ['PA','Philadelphia',1463281],
          ['AZ','Phoenix',1461575],
          ['TX','San Antonio',1256509],
          ['CA','San Diego',1255540],
          ['TX','Dallas',1213825],
          ['CA','San Jose',912332]]
    results = curs.executemany(sql,data)
  7. Finally, run a query to return an aggregated group-by and return as a Dictionary object.
    curs = conn.cursor(cursor_factory=phoenixdb.cursor.DictCursor)
    query = """SELECT state as "State",count(city) as "City Count",sum(population) as "Population Sum"
    FROM us_population
    GROUP BY state
    ORDER BY sum(population) DESC"""
    
    curs.execute(query)
    print(curs.fetchall())

When the above is run in a session, it will return the following results.

[{'State': 'NY', 'City Count': 1, 'Population Sum': 8143197}, {'State': 'CA', 'City Count': 3, 'Population Sum': 6012701}, {'State': 'TX', 'City Count': 3, 'Population Sum': 4486916}, {'State': 'IL', 'City Count': 1, 'Population Sum': 2842518}, {'State': 'PA', 'City Count': 1, 'Population Sum': 1463281}, {'State': 'AZ', 'City Count': 1, 'Population Sum': 1461575}]

This example is based on the post: Phoenix in 15 minutes or less

957 Views
0 Kudos