Member since
09-23-2015
12
Posts
1
Kudos Received
0
Solutions
02-09-2021
09:32 PM
1 Kudo
Cloudera Machine Learning provides a number of methods of connecting to other CDP services and experiences such as a Cloudera Data Warehouse. In this post, we will connect using Python and the Impyla library, as well as using the embedded Cloudera Data Visualization.
Using Impyla
Within Cloudera Machine Learning, create a new project and set the language to Python 3.6. The connection details are available from the Data Warehouse console by copying the JDBC connection details which will look like. jdbc:impala://coordinator-aws-2-impala-prod.env-j2ln9x.dw.ylcu-atmi.cloudera.site:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1;UID=<workload username>;PWD=<workload password>
Use the following Python code to install Impyla and configure a connection: !pip3 install impyla==0.16a3
USERNAME='<workload username>'
IMPALA_HOST='coordinator-aws-2-impala-prod.env-j2ln9x.dw.ylcu-atmi.cloudera.site'
IMPALA_PORT='443'
from impala.dbapi import connect
conn = connect(host=IMPALA_HOST,
port=IMPALA_PORT,
auth_mechanism='LDAP',
user=USERNAME,
password=os.environ['PASS'],
use_http_transport=True,
http_path='/cliservice',
use_ssl=True)
cursor = conn.cursor()
cursor.execute('show databases')
for row in cursor:
print(row)
Note: The PASS variable is an Environment variable set in the Project settings under the Advanced tab. This does not protect your password but will mitigate the risk of it being copied into a version control service.
Using Visual Applications
Create a Cloudera Data Visualization App by following the instructions at Accessing Data Visualization in CML.
Log out as your default user and log back into Cloudera Data Visualization using the local admin user account. Note: You can raise a support request if you don't have access to this.
Add a new connection under Basic settings using the following parameters.
Connection Name: Name your Connection
Hostname or IP Address: Use the hostname from the JDBC string
Port #: Use the SSL port of 443
Username: CDP Workload Username
Password: CDP Workload Password
Under Advanced Settings, set the following parameters.
Connection Type: HTTP
HTTP path: /cliservice
Socket Type: SSL
Test the connection.
... View more
02-09-2021
01:58 AM
Cloudera Machine Learning provides support for Python3. It is very straightforward to connect a session with an operational database.
Provision an Operational Database
Log into a CDP instance
Select Operational Database
Select Create Database
Choose the Cloud environment
Provide a unique name for the database
Click Create Database
Once the database has started, make a copy of the Phoenix (Thin) JDBC URL. This will be used as the connection string.
Create a Machine Learning Project
Within your Cloudera Machine Learning (CML) workspace, create a new project.
Provide a name, and choose a blank initial setup. Create a session, and install phoenixdb using the command: !pip3 install phoenixdb
Create a new Python file and paste the following code into the notebook. Import the required dependencies import phoenixdb
import io
import json
Setup the parameters required to establish the connection with ODB. Refer to the Thin client details.
opts = {}
opts['authentication'] = 'BASIC'
opts['serialization'] = 'PROTOBUF'
opts['avatica_user'] = 'xxxxxxxx'
opts['avatica_password'] = 'xxxxxxxx'
database_url = 'https://<the jdbc url copied from the ODB console>/'
TABLENAME = "us_population"
conn = phoenixdb.connect(database_url, autocommit=True,**opts)
For the URL, remove everything before the https and remove the parameters at the end, while retaining any path details.
Example:
https://<server>/<instance name>/cdp-proxy-api/avatica/
Create the table into which to insert the data curs = conn.cursor()
query = """
CREATE TABLE IF NOT EXISTS """+TABLENAME+""" (
state CHAR(2) NOT NULL,
city VARCHAR NOT NULL,
population BIGINT
CONSTRAINT my_pk PRIMARY KEY (state, city))
"""
curs.execute(query)
Bulk insert a set of data, using nested arrays for each record, and executing multiple upserts. sql = "upsert into " + TABLENAME + \
" (state ,city, population) \
values (?,?,?)"
data =[['NY','New York',8143197],
['CA','Los Angeles',3844829],
['IL','Chicago',2842518],
['TX','Houston',2016582],
['PA','Philadelphia',1463281],
['AZ','Phoenix',1461575],
['TX','San Antonio',1256509],
['CA','San Diego',1255540],
['TX','Dallas',1213825],
['CA','San Jose',912332]]
results = curs.executemany(sql,data)
Finally, run a query to return an aggregated group-by and return as a Dictionary object. curs = conn.cursor(cursor_factory=phoenixdb.cursor.DictCursor)
query = """SELECT state as "State",count(city) as "City Count",sum(population) as "Population Sum"
FROM us_population
GROUP BY state
ORDER BY sum(population) DESC"""
curs.execute(query)
print(curs.fetchall())
When the above is run in a session, it will return the following results.
[{'State': 'NY', 'City Count': 1, 'Population Sum': 8143197}, {'State': 'CA', 'City Count': 3, 'Population Sum': 6012701}, {'State': 'TX', 'City Count': 3, 'Population Sum': 4486916}, {'State': 'IL', 'City Count': 1, 'Population Sum': 2842518}, {'State': 'PA', 'City Count': 1, 'Population Sum': 1463281}, {'State': 'AZ', 'City Count': 1, 'Population Sum': 1461575}]
This example is based on the post: Phoenix in 15 minutes or less
... View more
Labels:
01-18-2021
09:29 PM
2 Kudos
For a recent project I was tasked with configuring DBeaver to connect to Phoenix running in an instance of the CDP Datahub. CDP provides a means of creating a Datahub for running an Operational Database (HBase) and using JDBC via Phoenix to query it. Let's start a Datahub from an Operational Database template. Provision an Operational Database Log into a CDP Instance Select Data Hub Clusters Select Create Data Hub Choose the Cloud environment Choose the template 7.2.2 - Operational Database with SQL Provide a unique name for the cluster Click Provision Cluster This will start a Datahub cluster running HBase and Phoenix as well as all of the security dependencies provided by SDX. For example, Knox - this will be important when connecting to our instance. Once the cluster has started we need to collect some configuration details. This is best done on the Datahub information page and in the Cloudera Manager Console. We will use the Phoenix thin driver, and this requires a JDBC string of the form. jdbc:phoenix:thin:url=https://<knox endpoint>:443/<cluster name>/cdp-proxy-api/avatica/;serialization=PROTOBUF;authentication=BASIC;avatica_user=<workload username>;avatica_password=<workload password> Once the cluster has started, select Endpoints and make a note of the Phoenix Query Server URI. It should look like this: https://<server>/opdbtest/cdp-proxy-api/avatica/ The path details here are important as they provide the proxy and the cluster name format that we need. The next piece of information we need is the Knox server endpoint. This can be found in the Cloudera Manager Console, under Knox/ Instances. This will replace the <server> part above. The final component we need is the JAR file containing the Phoenix Thin Client, and that can be sourced from the Cloudera repository here. https://repository.cloudera.com/ Search for phoenix-queryserver-client Download the latest release. Configuring DBeaver To Install DBeaver, you can download a version from https://dbeaver.io/ In this example, we are using the OSX version. Configuration fields and terms may vary my installed type. Create. new Apache Phoenix connection to provide a baseline For the host, use the machine that Knox is running on For the port, use 443 (default https port) Provide your workload username and password Edit Driver Configuration and set Class name: org.apache.phoenix.queryserver.client.Driver URL Template: jdbc:phoenix:thin:url={host}[:{port}]/opdbtest/cdp-proxy-api/avatica/;serialization=PROTOBUF;authentication=BASIC;avatica_user={user};avatica_password={password} Add the driver JAR using add File and select the JAR downloaded from the Cloudera repository Note: Don't search for class as it may automatically discover an invalid Driver class. The Class name configuration will override this. You may need to restart DBeaver, if the class is set incorrectly. Close the configuration and test the connection. The URL uses Knox so that access control can be managed centrally. Knox takes the https:// messages and proxies them through to the backend Phoenix services automatically. A note on Operational Database Experience Shortly we will be providing an Operational Database Experience. We have significantly streamlined the provisioning of the Datahubs and publish a lot more metadata to help with configuring external clients. For example, the Maven links to the correct clients are provided directly and examples of the JDBC links are presented right in the user interface. All of these improvements have been made to help make provisioning new instances easy, and to make connecting to those instances from applications and tools very quick. Our objective is to help you integrate CDP with your applications quickly and efficiently. We welcome your feedback on areas of our platform and documentations, which can be improved to help us with this goal. Tips If you receive 404 or 401 errors, please check you are connecting to Knox, and that the full https:// url is correct. If you receive errors related to serialisation, make sure you have serialization=PROTOBUF set. Documentation references Setting up connections with a CDP Datahub Connect to PQS through Apache Knox Connecting to Apache Phoenix Query Server using the JDBC client Connect to PQS directly Setting up connections with CDP Operational Database Experience Cloudera Operational Database JDBC support
... View more
Labels:
06-12-2020
06:00 AM
As noted node labels are not supported in CDH 6.3.3 https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_633_unsupported_features.html#yarn_600_unsupported This is in part because GPU support varies across the Capacity Scheduler, and Fair Scheduler. CDH implements the Fair Scheduler. You may want to review the yarn configuration steps to enable GPUs, however this may not meet your neess. https://docs.cloudera.com/documentation/enterprise/6/properties/6.3/topics/cm_props_cdh630_yarn_mr2included_.html#concept_6.3.x_nodemanager_props__section_gpu_management_props Current we have a couple of techniques for using GPU resources. CDSW can be deployed alongside CDH, and this uses containerisation to target workloads toward a GPU resource. This is specifically designed to run Machine Leaning workloads that can benefit from GPU. For example Tensorflow. This could be installed via Cloudera Manager directly alongside your cluster and used to target workloads to the GPU resources. https://docs.cloudera.com/documentation/data-science-workbench/1-7-x/topics/cdsw_gpu.html The latest release CML CDP Public Cloud has developed this further and enables Spark and other frameworks to run in containers and target GPU resources. https://docs.cloudera.com/machine-learning/cloud/gpu/topics/ml-gpu.html You may also want to explore Spark 3 which has additional features for running Spark ML workloads on GPU. This uses the rapids.ai plug-in to optimise spark processes into GPU. https://docs.cloudera.com/runtime/7.0.3/cds-3/topics/spark-install-spark-3-parcel.html
... View more
06-12-2020
05:43 AM
Cloudera Navigator is included in our CDH release. It provides powerful search and audit features as well as data policy lifecycle management. https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cn_iu_introduce_navigator.html Apache Atlas is included in our HDP releases and going forward in our latest CDP release. https://docs.cloudera.com/runtime/7.1.0/concepts-governance.html In CDP we have brought across Navigators powerful search features into Apache Atlas. Apache Atlas will be the Cloudera solution for data governance going forward. Apache Atlas as part of CDP provides broader support for services including Kafka, Spark and events from CML Machine Learning models. The metadata options are further extended to provide a hierarchal model and relationships between entities. Apache Atlas also has deep integration with the Cloudera SDX services so that labels applied to data objects, can be assigned attribute based permissions in Ranger. The easiest way to experience the latest features in Cloudera Data Governance is via CDP. If you would like more information or a demonstration please reach out to your account team.
... View more
06-12-2020
05:28 AM
Can I check the logic here? Are you using HIVE QL to create the table, and to add the new columns? It is the Spark read that is then giving inconsistent results? There are known issues in how Spark 2.2 handles HIVE schemas. For example https://issues.apache.org/jira/browse/SPARK-21841 If you have the example Spark code, that may help. It looks like being explicit in how spark reads the hive table may help in this case.
... View more
05-01-2020
04:50 AM
For a comprehensive list of the 7.0.3 CDP-DC release you can reference the documentation here https://docs.cloudera.com/runtime/7.0.3/release-notes/topics/rt-runtime-component-versions.html When 7.1 is available this page will be updated to include the release components.
... View more
03-24-2020
03:07 AM
2 Kudos
Apache Airflow is a popular environment for scheduling and monitoring workflows. Running workflows is useful in machine learning for automation of data acquisition or the building and monitoring of machine learning models.
This tutorial will take you through the process of setting up Airflow as an Application in CML. This can be very useful in building out testing and prototype pipelines. It also provides a mechanism for calling chains of Models or Jobs.
Create a new Python project in CML using a blank template: New Project
Open up a Workspace and install Airflow, this can be scripted using the install instructions here.
The following shell script and python scripts can be used to automate this process. The installation needs to be performed only once: install.py: import os
os.system("./install.sh") install.sh: #!/bin/bash -x
export AIRFLOW_HOME=~/airflow
# install from pypi using pip
pip3 install apache-airflow
# initialize the database
airflow initdb
This will install the Airflow components into your project and will persist with the lifecycle of the project.
Now the software is installed the next step is starting the process as a long-lived application. This requires a python file and shell script to start up the services. install.py: import os
os.system("./start.sh") start.sh: #!/bin/bash -x
PORT=${CDSW_APP_PORT:-8090}
export AIRFLOW_HOME=~/airflow
# start the web server, default port is 8090
airflow webserver -p $PORT -hn 127.0.0.1
# start the scheduler
airflow scheduler
# visit localhost:8080 in the browser and enable the example dag in the home page
The PORT variable is set to the environment Application Port and started on localhost. This script can be tested in a workbench, or they can be added to an Application with the following settings. Create Application
Validate that the application starts up cleanly by accessing the Applications logs: Monitor Logs
Opening the application will load directly into the Airflow management application. Access Airflow UX
Using the Airflow HTTP Operator it is possible to call CML Jobs directly via the CML Job API.
This is hosting Airflow within a CML container. This deployment will be isolated to the Project and will continue to run and use resources associated to the project. This provides a good prototyping environment for building out a more complete end-to-end data engineering pipelines that combine scheduling and flow with python, shell, Scala, and Spark processes.
We would be interested to hear from you, if you apply this to a project.
... View more
- Find more articles tagged with:
- cml
Labels:
02-07-2020
01:20 AM
Can we confirm a detail. The DSSD service is related to using EMC DSSD based storage on the data nodes. Are you using DSSD mode in Cloudera Manager?
... View more
02-07-2020
01:03 AM
1 Kudo
We would suggest using the Jobs function. Jobs have an API, so they can be tasked externally to CDSW. https://docs.cloudera.com/documentation/data-science-workbench/1-6-x/topics/cdsw_rest_apis.html You can then wrap a shell script or python script with a Job to perform data file operations for example using wget, s3 sync etc. Would this meet your requirements? If not can you expand on your requirements further please?
... View more
09-23-2015
08:54 AM
I think (Speculation) the reason this fault occurs. If Navigator is added to QuickStart, then it will update the tables. This results in conflicts. For some reason my Activity Monitor under CM become configured with the Navigator DB (nav). This resulted in the Activity Monitor when it restarted not being able to configure the schema correctly. By creating a separate DB (amon) and pointing the activity monitor at this, things appear to resolve.
... View more
09-23-2015
08:52 AM
I managed to resolve by. In a console, > su root > Password: > mysql -u root -p >Enter password: > show databases; > create database amon DEFAULT CHARACTER SET utf8; > grant all on amon.* TO 'amon'@'%' IDENTIFIED BY '{password of your choice}'; Then in Cloudera Manager Click on - Cloudera Manager Service (bottom) - Activity Monitor - Configuration Update the " Activity Monitor Database Name" and "Activity Monitor Database Username" / "Password" to match the config above (amon). Restart the Activity Monitor.
... View more