Member since
07-10-2018
63
Posts
71
Kudos Received
0
Solutions
10-28-2020
02:54 AM
Thanks Paul, very helpful for when setting up external clients and minimizing the privileges they have in Ranger.
... View more
10-20-2020
02:20 PM
1 Kudo
Cloudera Data Platform Public Cloud recently introduced the ability to backup and restore datalake from a saved location. Specifically, the backup operation saves a full snapshot of data from all SDX services:
Atlas:
Audit events, saved in HBase tables
Lineage data, saved as Janus graph data in HBase tables
Edge, vertex, and full text indexes, saved in Solr collections
Ranger:
Audit logs, saved as a Solr collection
Permissions and tags, saved in RDBMS tables
HMS Metadata, saved in RDBMS tables
In this article, I will detail how to run backup and restore in CDP Public Cloud in AWS, via the CDP CLI.
Pre-Requisites
Stop operations that could affect backup
Make sure that no HMS affecting operations are running (e.g. creating a table from CDW or a datahub)
Go to your Datalake Cloudera Manager, and shut down:
Atlas
Ranger
HMS
Make sure you have the proper IAM permissions
Datalake backup uses both the Ranger Audit Role and Datalake Admin Roles to write the backups (more details on these roles here)
Therefore, the policies attached to both the IAM role must give write permissions to the location of your backup.
Here is an example of a policy attached to the Ranger Audit Role:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FullObjectAccessUnderAuditDir",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::bckp-cdp-bucket/ranger/audit/*"
},
{
"Sid": "FullObjectAccessUnderBackupDir",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::bckp-cdp-bucket/backups/*"
},
{
"Sid": "LimitedAccessToDataLakeBucket",
"Effect": "Allow",
"Action": [
"s3:AbortMultipartUpload",
"s3:ListBucket",
"s3:ListBucketMultipartUploads"
],
"Resource": "arn:aws:s3:::bckp-cdp-bucket"
}
]
}
Install and configure CDP CLI
This is fairly straightforward, and documented in your management console, under Help > Download CLI:
Step 1: Running back-up
Initiate backup
$ cdp datalake backup-datalake --datalake-name bckp-cdp-dl --backup-location s3a://bckp-cdp-bucket/backups/
{
"accountId": "558bc1d2-8867-4357-8524-311d51259233",
"backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e",
"internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=IN_PROGRESS, DATABASE=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS}",
"status": "IN_PROGRESS",
"startTime": "2020-10-20 21:11:27.821",
"endTime": "",
"backupLocation": "s3a://bckp-cdp-bucket/backups/",
"failureReason": "null"
}
Monitor backup
$ cdp datalake backup-datalake-status --datalake-name bckp-cdp-dl
{
"accountId": "558bc1d2-8867-4357-8524-311d51259233",
"backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e",
"userCrn": "crn:altus:iam:us-west-1:558bc1d2-8867-4357-8524-311d51259233:user:86c4e7d9-1560-4afa-ac14-794bdeec0896",
"internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=IN_PROGRESS, DATABASE=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS}",
"status": "IN_PROGRESS",
"startTime": "2020-10-20 21:11:27.821",
"endTime": "",
"backupLocation": "s3a://bckp-cdp-bucket/backups/",
"backupName": "",
"failureReason": "null"
}
Step 2: Restoring backup
Initiate restore
$ cdp datalake restore-datalake --datalake-name bckp-cdp-dl --backup-id 6c59a259-51ac-4db4-80d6-22f71f84cc4e
{
"accountId": "558bc1d2-8867-4357-8524-311d51259233",
"restoreId": "06c0bde4-cfc7-4b9e-a8e0-d9f2ddfcb5c5",
"backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e",
"internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, DATABASE=IN_PROGRESS, EDGE_INDEX_COLLECTION_DELETE=IN_PROGRESS, RANGER_AUDITS_COLLECTION_DELETE=IN_PROGRESS, VERTEX_INDEX_COLLECITON_DELETE=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION_DELETE=IN_PROGRESS}",
"status": "IN_PROGRESS",
"startTime": "2020-10-20 21:15:11.757",
"endTime": "",
"backupLocation": "s3a://bckp-cdp-bucket/backups/",
"failureReason": "null"
}
Monitor restore
$ cdp datalake restore-datalake-status --datalake-name bckp-cdp-dl
{
"accountId": "558bc1d2-8867-4357-8524-311d51259233",
"restoreId": "06c0bde4-cfc7-4b9e-a8e0-d9f2ddfcb5c5",
"backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e",
"userCrn": "crn:altus:iam:us-west-1:558bc1d2-8867-4357-8524-311d51259233:user:86c4e7d9-1560-4afa-ac14-794bdeec0896",
"internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=SUCCESSFUL, DATABASE=SUCCESSFUL, FULLTEXT_INDEX_COLLECTION=SUCCESSFUL, EDGE_INDEX_COLLECTION_DELETE=SUCCESSFUL, VERTEX_INDEX_COLLECITON_DELETE=SUCCESSFUL, RANGER_AUDITS_COLLECTION_DELETE=SUCCESSFUL, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION_DELETE=SUCCESSFUL}",
"status": "IN_PROGRESS",
"startTime": "2020-10-20 21:15:11.757",
"endTime": "",
"backupLocation": "s3a://bckp-cdp-bucket/backups/",
"failureReason": "null"
}
Note: you can also monitor these events in the CDP Control Plane:
... View more
Labels:
10-14-2020
09:53 PM
Hi Paul, I agree, a call with a shared screen would be better. So yes, I'm already in the Partner Program and I also got the Partner Development License in order to do this very same exercise with CDP. However I can't access the cases in the Support Portal (I get a 'restricted access' message). If I look at the Partner Development Subscription I see that Support is included with Gold and Platinum, but we are Silver... Maybe it depends on this? I contacted our Partner Sales Manager to see if he can help out. Best regards, Valerio
... View more
09-28-2020
05:14 AM
1 Kudo
Ha! Good catch!
... View more
08-31-2020
02:25 PM
2 Kudos
Cloudera Data Warehouse (CDW) brings one of the most efficient compute isolation and rapid scale up and down of data warehousing workloads, leveraging the latest container and caching technologies.
One of the great features of this architecture is the ability to only bring compute on-demand, as illustrated by the figure below:
This default setup is the most cost-effective setup as only a few shared services nodes (small nodes running services like UIs, Viz, Zookeeper, etc.) are long lasting. Each Virtual Warehouse has a set of nodes that are only running if compute is needed (i.e. a new query on a non-cached dataset).
The caveat to this approach is that on a completely cold warehouse, the warm up type from zero to compute is 1 minute or 2.
An alternative to this default architecture is to leverage compute-reserved nodes, that will be shared between virtual warehouses, as depicted below:
With this architecture, a pool of reserved nodes can be used to enable the immediate availability of compute across nodes. In this article, I will showcase how to set up reserved instances in CDW.
Note: This article is a high-level tutorial. It is not my intent to detail the behavior of how reserved nodes are shared across warehouses, or recommend generic sizing. The number of instances and the VW behavior will depend on your implementation.
Step 1: Get your Kubeconfig
In CDW, go to your environment, click on the 3 dots on the environment box > Show Kubeconfig:
Grant your ARN access to the environment, and copy/download the kubeconfig (see this article for more details).
Step 2: Connect to your cluster
$ export KUBECONFIG=[path_to_your_kubeconfig] $ kubectl get deployments -n cluster NAME READY UP-TO-DATE AVAILABLE AGE ardent-ferret-efs-provisioner 2/2 2 2 4h46m compute-reserved-node 0/0 0 0 4h46m crusty-abalone-cluster-autoscaler 1/1 1 1 4h46m nginx-default-backend 1/1 1 1 4h46m nginx-service 3/3 3 3 4h46m shared-services-reserved-node 0/0 0 0 4h46m
Step 3: Modify the replicas of compute reserved nodes
kubectl edit deployment compute-reserved-node -n cluster
apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" creationTimestamp: "2020-08-31T16:28:52Z" generation: 1 labels: app.kubernetes.io/instance: trendy-mastiff app.kubernetes.io/managed-by: Tiller app.kubernetes.io/name: cluster-overprovisioner cluster-overprovisioner-name: compute-reserved-node helm.sh/chart: cluster-overprovisioner-0.2.5 name: compute-reserved-node namespace: cluster resourceVersion: "3476" selfLink: /apis/extensions/v1beta1/namespaces/cluster/deployments/compute-reserved-node uid: a5cb9ea1-729a-4665-9734-94c2f669984f spec: progressDeadlineSeconds: 600 replicas: 3
Step 4: Verify your config
After a few minutes, you should see your configuration being applied:
$ kubectl get deployments -n cluster NAME READY UP-TO-DATE AVAILABLE AGE ardent-ferret-efs-provisioner 2/2 2 2 4h54m compute-reserved-node 3/3 3 3 4h54m crusty-abalone-cluster-autoscaler 1/1 1 1 4h54m nginx-default-backend 1/1 1 1 4h54m nginx-service 3/3 3 3 4h54m shared-services-reserved-node 0/0 0 0 4h54m
... View more
Labels:
08-19-2020
01:56 PM
1 Kudo
Inspired by @sunile_manjee 's article How to use K9s to fetch metrics and logs for Cloudera Data Warehouse Experience , I decided to create the same tutorial for Cloudera Data Engineering.
The process is very similar, as you can see below.
Pre-Requisites
CDE environment setup
K9s installed on your machine (instructions here)
AWS user ARN
AWS configure (CLI) pointing to your AWS env (instructions here)
Step 1: Add your AWS user to the CDE environment
In your AWS account console, go to IAM > Users > Search for your user name > copy your ARN:
In the CDE main page, go to environment details:
Then, go to access and add your ARN:
Step 2: Download the kubeconfig
In the CDE main page, click on the 3 dots, then Download Kube Config:
Step 3: Run K9s
You can now run k9s, by running: k9s --kubeconfig ~/Downloads/kubeconfig Note: the path to your downloaded kubeconfig is subject to change of course.
... View more
08-10-2020
02:14 PM
2 Kudos
Moving data from your local machine to the cloud has never been easier using NiFi site to site protocol and CDP Datahub. In this article, I will focus on how to set up a site to site communication between your local machine and CDP Cloud, without using the default Knox CDP Proxy.
This configuration assumes that you already have a local instance of NiFi (or MiNiFi) and a CDP Datahub Cluster running NiFi. If you want to learn how to use CDP Public Cloud, please visit our overview page and documentation.
This setup will be executed in 4 steps:
Step 1: Open CDP to your local IP
Step 2: Download and configure stores on your local machine
Step 3: Configure a simple site-to-site flow
Step 4: Authorize this flow in Ranger
Step 1: Open CDP to your local IP
Go to your CDP Management Console, and find your datahub (here pvn-nifi).
At the bottom of the datahub page, click on Hardware and locate one of the instances running NiFi:
Click on the instances and you will be redirected to your cloud provider (here AWS😞
At the bottom of the screen, click on the security group associated with your instance, and you will be redirected to that security group config page:
Click on Edit inbound rules and add a rule opening TCP port 8443 to your local IP:
Save these changes.
Step 2: Download and configure stores on your local machine
Connect to one of the NiFi machines with the Cloudbreak user and the key you used at deployment: $ ssh -i [path_to_private_key] cloudbreak@[your_nifi_host]
Copy and authorize the key and trust stores: $ sudo su
$ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-host_keystore.jks /tmp
$ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks /tmp
$ chmod a+rw /tmp/cm-auto-host_keystore.jks
$ chmod a+rw /tmp/cm-auto-global_truststore.jks
Disconnect from the remote machine and copy these stores: $ cd ~/Desktop
$ scp -i [path_to_private_key] cloudbreak@[your_nifi_host]:/tmp/cm-auto-host_keystore.jks cm-auto-host_keystore.jks
$ scp -i [path_to_private_key] cloudbreak@[your_nifi_host]:/tmp/cm-auto-global_truststore.jks cm-auto-global_truststore.jks
Configure your local NiFi with these stores, by modifying your nifi.properties: nifi.security.keystore=/Users/pvidal/Desktop/cm-auto-host_keystore.jks
nifi.security.keystoreType=JKS
nifi.security.keystorePasswd=[keystore_pw]
nifi.security.keyPasswd=[keystore_pw]
nifi.security.truststore=/Users/pvidal/Desktop/cm-auto-global_truststore.jks
nifi.security.truststoreType=JKS
nifi.security.truststorePasswd=[truststore_pw] Note: To know the passwords of these stores, please connect with your Cloudera team.
Restart your local NiFi instance: nifi restart
Step 3: Configure a simple site-to-site flow
Local instance
Create a process group to host your flow (here called S2S Cloud:
In this process group, create a remote process group instance and configure it with one of your cloud NiFi instances address, and the HTTP protocol:
Create a simple Generate flow file processor and connect it to the remote processor: Note: Without configuring Ranger, you will get a Forbidden warning (see step 4).
CDP Public Instance
Create a process group to host your flow (here called Receive from on prem):
In this process group, create an input port accepting remote connections:
Finally, create a flow that takes the data and logs it:
Start your flow.
Step 4: Authorize this flow in Ranger
From the Cloudera Management console, go to Ranger and your NiFi service:
From the list of policies, create a new policy (here called s2s) that will allow access to your specific process group and the site-to-site protocol (Ranger does auto completion):
Save this policy, and go back to your local machine; you can now enable the remote process group and start sending files!
Example of successful flows
Local Flow
CDP Public Flow
... View more
06-04-2020
11:38 AM
1 Kudo
Cloudera Machine Learning (and Cloudera Data Science Workbench) is built on a very robust and flexible framework to ease integration with third parties. In this article, I decided to explore the integration of Teradata with CML via ODBC.
A few notes before I dive in:
I could have easily used JDBC via JayDeBeApi (see my previous article), but where is the fun in that
This article could be generalized to more ODBC connections, provided proper parameters are setup
To get a fully rounded tutorial, I will go through these 3 steps:
Step 1 (optional if you already have a Teradata instance): Setting up Teradata Vantage in AWS
Step 2: Create a custom image with Teradata driver
Step 3: Configure and run python to Teradata ODBC
Step 1: Setting up Teradata Vantage in AWS
If you want to do development testing of Teradata, you can use Teradata Vantage Developer on the AWS Market Place. I am not going to dive too much into this as this configuration may change as Teradata evolves.
Pre-Requisites
The Teradata Vantage Developer is an AWS CloudFormation template, that requires the following assets to be pre-created:
An Internet Gateway
A VPC associated with this IGW (and DNS / DNS Hostnames support)
A subnet (with a route to the IGW and Public IP creation on launch)
A placement group
Setup
Once the assets are created, go to Teradata Vantage Developer and select your configuration (make sure you select the same region as the assets you created):
From this configuration, launch the CloudFormation template and fill the parameters with the AWS assets you created:
The template will create two nodes by default. We will connect to one of them to create a test user: $ ssh ec2-user@[your_public_hostname]
ec2-user@SMP001-01:~> bteq
Teradata BTEQ 16.20.00.01 for LINUX. PID: 17205
Copyright 1984-2017, Teradata Corporation. ALL RIGHTS RESERVED.
Enter your logon or BTEQ command:
.LOGON
UserId: dbc
Password: [password_set_in_cf_template]
CREATE user test AS password=test perm = 200000000, SPOOL = 100000000;
You can then connect with your favorite SQL editor to test the connection and run table creation: CREATE TABLE "test".TEST_TABLE (
COLUMN1 VARCHAR(100),
COLUMN2 VARCHAR(100)
) ;
Step 2: Create a custom image with Teradata driver
I created the following image by following the steps the documentation on custom engines: #Dockerfile
FROM docker.repository.cloudera.com/cdsw/engine:11-cml-2020.04-1
WORKDIR /tmp
#The RUN commands that install an editor
#For example: RUN apt-get install myeditor
RUN apt-get autoremove unixodbc -y
RUN apt-get update -y
RUN apt-get install lib32stdc++6 -y
RUN apt-get install wget -y
RUN wget [LOCATION_OF_DRIVERS]
RUN tar -xzvf tdodbc1620__ubuntu_indep.16.20.00.87-1.tar.gz
RUN dpkg -i tdodbc1620/tdodbc1620-16.20.00.87-1.noarch.deb
RUN apt-get install -y python-pyodbc A few notes on this image:
I removed the unixodbc because I read that it was causing issues with pyodbc but it may not be required
You can find a built image on my dockerhub here
Finally, connect to CML and add this new engine:
Step 3: Configure and run python to Teradata ODBC
Go to your workbench in a new project, and create a session with the engine you created.
Then run the following: pip3 install pyodbc
After it is installed, go to your odbc.ini file: vi /home/cdsw/.odbc.ini
Configure your file as follows: [ODBC Data Sources]
Teradata ODBC DSN = Teradata ODBC Driver 16.20
TEST = My DSN Description
[TEST]
Driver = /opt/teradata/client/16.20/odbc_64/lib/tdataodbc_sb64.so
Description = Teradata database
DBCName = [your_teradata_hostname]
UID = test
PWD = test
Finally, you can run the following code to test: import pyodbc
import pandas as pd
conn = pyodbc.connect('DSN=TEST')
# Define Cursor
cus=conn.cursor()
query = "select * from test.TEST_TABLE;"
# Execute SQL statement to get current datetime and store result in cursor
cus.execute(query)
# Display the content of cursor
row = cus.fetchone()
print(row)
# Use Pandas to execute and retrieve results
df = pd.read_sql(query, conn)
print(df)
The output in the workbench should look something like the following:
... View more
05-12-2020
08:11 AM
Check out our series of SDX videos on our youtube channel for some examples of Atlas + Ranger in action as part of the CDP Shared Data Experience (SDX): https://www.youtube.com/playlist?list=PLe-h9HrA9qfCj2SI5BrvCdQOEOjHUWMBB
... View more
05-05-2020
06:11 PM
4 Kudos
This article explains how to setup Hive Warehouse Connector (HWC), in CDP Public Cloud (tested with CDP Public Cloud Datahub runtime 7.1).
Note: you must have admin privileges on your datahub cluster do this configuration.
Step 1: Whitelist the path to the HWC jar
In your CDP Datahub, open Cloudera Manager (CM) via the link in the management console:
In CM, go to your cluster > Livy > Configuration, and search for livy-conf/livy.conf:
Add the following safety valve to the configuration file: livy.file.local-dir-whitelist=/path_for_hwc/ In our example, we are using the /tmp/ folder.
Restart the Livy service via CM to propagate the configuration.
Step 2: Copy the HWC jar to the whitelisted location
Find the hostname of the node where Livy is installed (master3 here):
Connect to the node using your user/workload password, e.g.: ssh pvidal@viz-data-engineering-master3.viz-cdp.a465-9q4k.cloudera.site
Find the HWC jar: [pvidal@viz-data-engineering-master3 /]$ find / -name *hive-warehouse-connector* 2>/dev/null
./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
Copy it and add the right permissions: cp /opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar /tmp
chmod a+rw /tmp/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
Step 3: Add jar path to Zeppelin Livy interpreter
From your management console, open Zeppelin:
Go to the top right, and configure your Interpreters:
Edit the livy interpreter and add the following properties: HWC Jar Location Name: livy.spark.jars
Value: file:///[LOCATION_OF_YOUR_HWC_JAR] Hive JDBC URL Name: livy.spark.sql.hive.hiveserver2.jdbc.url
Value: [JDBC_URL_FROM_MANAGEMENT_CONSOLE];user=[your_user];password=[your_password]
You can find the JDBC URL in your datahub management console: Read via LLAP Name: livy.spark.datasource.hive.warehouse.read.via.llap
Value: false JDBC mode Name: livy.spark.datasource.hive.warehouse.read.jdbc.mode
Value: client Staging Dir Name: livy.spark.datasource.hive.warehouse.load.staging.dir
Value: /tmp Metastore URI Name: livy.spark.datasource.hive.warehouse.metastoreUri
Value: [VALUE_FROM_HIVE_SITE_XML]
You can download hive-site.xml from CM, by going to Your Cluster > Hive > Download Client Configuration:
Save your configuration, and restart your interpreter.
Step 4: Code away
Here is a simple example of Spark reading from a CSV and writing to a Hive table using HWC:
Read raw location data
val locationDf = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true")).csv("s3a://viz-cdp-bucket/raw/locations.csv") locationDf.printSchema()
Setup HWC session
import com.hortonworks.hwc.HiveWarehouseSession import com.hortonworks.hwc.HiveWarehouseSession._ val hive = HiveWarehouseSession.session(spark).build()
Create database and save dataset to table
hive.executeUpdate("CREATE DATABASE worldwidebank"); hive.setDatabase("worldwidebank"); locationDf.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").option("table", "locations").save()
Query data
val ds = hive.sql("select * from locations limit 10") ds.show()
... View more