About pvidal

smarinov · ‎10-28-2020

Thanks Paul, very helpful for when setting up external clients and minimizing the privileges they have in Ranger.

pvidal · ‎10-20-2020

Cloudera Data Platform Public Cloud recently introduced the ability to backup and restore datalake from a saved location. Specifically, the backup operation saves a full snapshot of data from all SDX services: Atlas: Audit events, saved in HBase tables Lineage data, saved as Janus graph data in HBase tables Edge, vertex, and full text indexes, saved in Solr collections Ranger: Audit logs, saved as a Solr collection Permissions and tags, saved in RDBMS tables HMS Metadata, saved in RDBMS tables In this article, I will detail how to run backup and restore in CDP Public Cloud in AWS, via the CDP CLI. Pre-Requisites Stop operations that could affect backup Make sure that no HMS affecting operations are running (e.g. creating a table from CDW or a datahub) Go to your Datalake Cloudera Manager, and shut down: Atlas Ranger HMS Make sure you have the proper IAM permissions Datalake backup uses both the Ranger Audit Role and Datalake Admin Roles to write the backups (more details on these roles here) Therefore, the policies attached to both the IAM role must give write permissions to the location of your backup. Here is an example of a policy attached to the Ranger Audit Role: { "Version": "2012-10-17", "Statement": [ { "Sid": "FullObjectAccessUnderAuditDir", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": "arn:aws:s3:::bckp-cdp-bucket/ranger/audit/*" }, { "Sid": "FullObjectAccessUnderBackupDir", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": "arn:aws:s3:::bckp-cdp-bucket/backups/*" }, { "Sid": "LimitedAccessToDataLakeBucket", "Effect": "Allow", "Action": [ "s3:AbortMultipartUpload", "s3:ListBucket", "s3:ListBucketMultipartUploads" ], "Resource": "arn:aws:s3:::bckp-cdp-bucket" } ] } Install and configure CDP CLI This is fairly straightforward, and documented in your management console, under Help > Download CLI: Step 1: Running back-up Initiate backup $ cdp datalake backup-datalake --datalake-name bckp-cdp-dl --backup-location s3a://bckp-cdp-bucket/backups/ { "accountId": "558bc1d2-8867-4357-8524-311d51259233", "backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e", "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=IN_PROGRESS, DATABASE=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS}", "status": "IN_PROGRESS", "startTime": "2020-10-20 21:11:27.821", "endTime": "", "backupLocation": "s3a://bckp-cdp-bucket/backups/", "failureReason": "null" } Monitor backup $ cdp datalake backup-datalake-status --datalake-name bckp-cdp-dl { "accountId": "558bc1d2-8867-4357-8524-311d51259233", "backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e", "userCrn": "crn:altus:iam:us-west-1:558bc1d2-8867-4357-8524-311d51259233:user:86c4e7d9-1560-4afa-ac14-794bdeec0896", "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=IN_PROGRESS, DATABASE=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS}", "status": "IN_PROGRESS", "startTime": "2020-10-20 21:11:27.821", "endTime": "", "backupLocation": "s3a://bckp-cdp-bucket/backups/", "backupName": "", "failureReason": "null" } Step 2: Restoring backup Initiate restore $ cdp datalake restore-datalake --datalake-name bckp-cdp-dl --backup-id 6c59a259-51ac-4db4-80d6-22f71f84cc4e { "accountId": "558bc1d2-8867-4357-8524-311d51259233", "restoreId": "06c0bde4-cfc7-4b9e-a8e0-d9f2ddfcb5c5", "backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e", "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, DATABASE=IN_PROGRESS, EDGE_INDEX_COLLECTION_DELETE=IN_PROGRESS, RANGER_AUDITS_COLLECTION_DELETE=IN_PROGRESS, VERTEX_INDEX_COLLECITON_DELETE=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION_DELETE=IN_PROGRESS}", "status": "IN_PROGRESS", "startTime": "2020-10-20 21:15:11.757", "endTime": "", "backupLocation": "s3a://bckp-cdp-bucket/backups/", "failureReason": "null" } Monitor restore $ cdp datalake restore-datalake-status --datalake-name bckp-cdp-dl { "accountId": "558bc1d2-8867-4357-8524-311d51259233", "restoreId": "06c0bde4-cfc7-4b9e-a8e0-d9f2ddfcb5c5", "backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e", "userCrn": "crn:altus:iam:us-west-1:558bc1d2-8867-4357-8524-311d51259233:user:86c4e7d9-1560-4afa-ac14-794bdeec0896", "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=SUCCESSFUL, DATABASE=SUCCESSFUL, FULLTEXT_INDEX_COLLECTION=SUCCESSFUL, EDGE_INDEX_COLLECTION_DELETE=SUCCESSFUL, VERTEX_INDEX_COLLECITON_DELETE=SUCCESSFUL, RANGER_AUDITS_COLLECTION_DELETE=SUCCESSFUL, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION_DELETE=SUCCESSFUL}", "status": "IN_PROGRESS", "startTime": "2020-10-20 21:15:11.757", "endTime": "", "backupLocation": "s3a://bckp-cdp-bucket/backups/", "failureReason": "null" } Note: you can also monitor these events in the CDP Control Plane:

ValerioDiMatteo · ‎10-14-2020

Hi Paul, I agree, a call with a shared screen would be better. So yes, I'm already in the Partner Program and I also got the Partner Development License in order to do this very same exercise with CDP. However I can't access the cases in the Support Portal (I get a 'restricted access' message). If I look at the Partner Development Subscription I see that Support is included with Gold and Platinum, but we are Silver... Maybe it depends on this? I contacted our Partner Sales Manager to see if he can help out. Best regards, Valerio

pvidal · ‎09-28-2020

Ha! Good catch!

pvidal · ‎08-31-2020

Cloudera Data Warehouse (CDW) brings one of the most efficient compute isolation and rapid scale up and down of data warehousing workloads, leveraging the latest container and caching technologies. One of the great features of this architecture is the ability to only bring compute on-demand, as illustrated by the figure below: This default setup is the most cost-effective setup as only a few shared services nodes (small nodes running services like UIs, Viz, Zookeeper, etc.) are long lasting. Each Virtual Warehouse has a set of nodes that are only running if compute is needed (i.e. a new query on a non-cached dataset). The caveat to this approach is that on a completely cold warehouse, the warm up type from zero to compute is 1 minute or 2. An alternative to this default architecture is to leverage compute-reserved nodes, that will be shared between virtual warehouses, as depicted below: With this architecture, a pool of reserved nodes can be used to enable the immediate availability of compute across nodes. In this article, I will showcase how to set up reserved instances in CDW. Note: This article is a high-level tutorial. It is not my intent to detail the behavior of how reserved nodes are shared across warehouses, or recommend generic sizing. The number of instances and the VW behavior will depend on your implementation. Step 1: Get your Kubeconfig In CDW, go to your environment, click on the 3 dots on the environment box > Show Kubeconfig: Grant your ARN access to the environment, and copy/download the kubeconfig (see this article for more details). Step 2: Connect to your cluster $ export KUBECONFIG=[path_to_your_kubeconfig] $ kubectl get deployments -n cluster NAME READY UP-TO-DATE AVAILABLE AGE ardent-ferret-efs-provisioner 2/2 2 2 4h46m compute-reserved-node 0/0 0 0 4h46m crusty-abalone-cluster-autoscaler 1/1 1 1 4h46m nginx-default-backend 1/1 1 1 4h46m nginx-service 3/3 3 3 4h46m shared-services-reserved-node 0/0 0 0 4h46m Step 3: Modify the replicas of compute reserved nodes kubectl edit deployment compute-reserved-node -n cluster apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" creationTimestamp: "2020-08-31T16:28:52Z" generation: 1 labels: app.kubernetes.io/instance: trendy-mastiff app.kubernetes.io/managed-by: Tiller app.kubernetes.io/name: cluster-overprovisioner cluster-overprovisioner-name: compute-reserved-node helm.sh/chart: cluster-overprovisioner-0.2.5 name: compute-reserved-node namespace: cluster resourceVersion: "3476" selfLink: /apis/extensions/v1beta1/namespaces/cluster/deployments/compute-reserved-node uid: a5cb9ea1-729a-4665-9734-94c2f669984f spec: progressDeadlineSeconds: 600 replicas: 3 Step 4: Verify your config After a few minutes, you should see your configuration being applied: $ kubectl get deployments -n cluster NAME READY UP-TO-DATE AVAILABLE AGE ardent-ferret-efs-provisioner 2/2 2 2 4h54m compute-reserved-node 3/3 3 3 4h54m crusty-abalone-cluster-autoscaler 1/1 1 1 4h54m nginx-default-backend 1/1 1 1 4h54m nginx-service 3/3 3 3 4h54m shared-services-reserved-node 0/0 0 0 4h54m

pvidal · ‎08-19-2020

Inspired by @sunile_manjee 's article How to use K9s to fetch metrics and logs for Cloudera Data Warehouse Experience , I decided to create the same tutorial for Cloudera Data Engineering. The process is very similar, as you can see below. Pre-Requisites CDE environment setup K9s installed on your machine (instructions here) AWS user ARN AWS configure (CLI) pointing to your AWS env (instructions here) Step 1: Add your AWS user to the CDE environment In your AWS account console, go to IAM > Users > Search for your user name > copy your ARN: In the CDE main page, go to environment details: Then, go to access and add your ARN: Step 2: Download the kubeconfig In the CDE main page, click on the 3 dots, then Download Kube Config: Step 3: Run K9s You can now run k9s, by running: k9s --kubeconfig ~/Downloads/kubeconfig Note: the path to your downloaded kubeconfig is subject to change of course.

pvidal · ‎08-10-2020

Moving data from your local machine to the cloud has never been easier using NiFi site to site protocol and CDP Datahub. In this article, I will focus on how to set up a site to site communication between your local machine and CDP Cloud, without using the default Knox CDP Proxy. This configuration assumes that you already have a local instance of NiFi (or MiNiFi) and a CDP Datahub Cluster running NiFi. If you want to learn how to use CDP Public Cloud, please visit our overview page and documentation. This setup will be executed in 4 steps: Step 1: Open CDP to your local IP Step 2: Download and configure stores on your local machine Step 3: Configure a simple site-to-site flow Step 4: Authorize this flow in Ranger Step 1: Open CDP to your local IP Go to your CDP Management Console, and find your datahub (here pvn-nifi). At the bottom of the datahub page, click on Hardware and locate one of the instances running NiFi: Click on the instances and you will be redirected to your cloud provider (here AWS😞 At the bottom of the screen, click on the security group associated with your instance, and you will be redirected to that security group config page: Click on Edit inbound rules and add a rule opening TCP port 8443 to your local IP: Save these changes. Step 2: Download and configure stores on your local machine Connect to one of the NiFi machines with the Cloudbreak user and the key you used at deployment: $ ssh -i [path_to_private_key] cloudbreak@[your_nifi_host] Copy and authorize the key and trust stores: $ sudo su $ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-host_keystore.jks /tmp $ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks /tmp $ chmod a+rw /tmp/cm-auto-host_keystore.jks $ chmod a+rw /tmp/cm-auto-global_truststore.jks Disconnect from the remote machine and copy these stores: $ cd ~/Desktop $ scp -i [path_to_private_key] cloudbreak@[your_nifi_host]:/tmp/cm-auto-host_keystore.jks cm-auto-host_keystore.jks $ scp -i [path_to_private_key] cloudbreak@[your_nifi_host]:/tmp/cm-auto-global_truststore.jks cm-auto-global_truststore.jks Configure your local NiFi with these stores, by modifying your nifi.properties: nifi.security.keystore=/Users/pvidal/Desktop/cm-auto-host_keystore.jks nifi.security.keystoreType=JKS nifi.security.keystorePasswd=[keystore_pw] nifi.security.keyPasswd=[keystore_pw] nifi.security.truststore=/Users/pvidal/Desktop/cm-auto-global_truststore.jks nifi.security.truststoreType=JKS nifi.security.truststorePasswd=[truststore_pw] Note: To know the passwords of these stores, please connect with your Cloudera team. Restart your local NiFi instance: nifi restart Step 3: Configure a simple site-to-site flow Local instance Create a process group to host your flow (here called S2S Cloud: In this process group, create a remote process group instance and configure it with one of your cloud NiFi instances address, and the HTTP protocol: Create a simple Generate flow file processor and connect it to the remote processor: Note: Without configuring Ranger, you will get a Forbidden warning (see step 4). CDP Public Instance Create a process group to host your flow (here called Receive from on prem): In this process group, create an input port accepting remote connections: Finally, create a flow that takes the data and logs it: Start your flow. Step 4: Authorize this flow in Ranger From the Cloudera Management console, go to Ranger and your NiFi service: From the list of policies, create a new policy (here called s2s) that will allow access to your specific process group and the site-to-site protocol (Ranger does auto completion): Save this policy, and go back to your local machine; you can now enable the remote process group and start sending files! Example of successful flows Local Flow CDP Public Flow

pvidal · ‎06-04-2020

Cloudera Machine Learning (and Cloudera Data Science Workbench) is built on a very robust and flexible framework to ease integration with third parties. In this article, I decided to explore the integration of Teradata with CML via ODBC. A few notes before I dive in: I could have easily used JDBC via JayDeBeApi (see my previous article), but where is the fun in that This article could be generalized to more ODBC connections, provided proper parameters are setup To get a fully rounded tutorial, I will go through these 3 steps: Step 1 (optional if you already have a Teradata instance): Setting up Teradata Vantage in AWS Step 2: Create a custom image with Teradata driver Step 3: Configure and run python to Teradata ODBC Step 1: Setting up Teradata Vantage in AWS If you want to do development testing of Teradata, you can use Teradata Vantage Developer on the AWS Market Place. I am not going to dive too much into this as this configuration may change as Teradata evolves. Pre-Requisites The Teradata Vantage Developer is an AWS CloudFormation template, that requires the following assets to be pre-created: An Internet Gateway A VPC associated with this IGW (and DNS / DNS Hostnames support) A subnet (with a route to the IGW and Public IP creation on launch) A placement group Setup Once the assets are created, go to Teradata Vantage Developer and select your configuration (make sure you select the same region as the assets you created): From this configuration, launch the CloudFormation template and fill the parameters with the AWS assets you created: The template will create two nodes by default. We will connect to one of them to create a test user: $ ssh ec2-user@[your_public_hostname] ec2-user@SMP001-01:~> bteq Teradata BTEQ 16.20.00.01 for LINUX. PID: 17205 Copyright 1984-2017, Teradata Corporation. ALL RIGHTS RESERVED. Enter your logon or BTEQ command: .LOGON UserId: dbc Password: [password_set_in_cf_template] CREATE user test AS password=test perm = 200000000, SPOOL = 100000000; You can then connect with your favorite SQL editor to test the connection and run table creation: CREATE TABLE "test".TEST_TABLE ( COLUMN1 VARCHAR(100), COLUMN2 VARCHAR(100) ) ; Step 2: Create a custom image with Teradata driver I created the following image by following the steps the documentation on custom engines: #Dockerfile FROM docker.repository.cloudera.com/cdsw/engine:11-cml-2020.04-1 WORKDIR /tmp #The RUN commands that install an editor #For example: RUN apt-get install myeditor RUN apt-get autoremove unixodbc -y RUN apt-get update -y RUN apt-get install lib32stdc++6 -y RUN apt-get install wget -y RUN wget [LOCATION_OF_DRIVERS] RUN tar -xzvf tdodbc1620__ubuntu_indep.16.20.00.87-1.tar.gz RUN dpkg -i tdodbc1620/tdodbc1620-16.20.00.87-1.noarch.deb RUN apt-get install -y python-pyodbc A few notes on this image: I removed the unixodbc because I read that it was causing issues with pyodbc but it may not be required You can find a built image on my dockerhub here Finally, connect to CML and add this new engine: Step 3: Configure and run python to Teradata ODBC Go to your workbench in a new project, and create a session with the engine you created. Then run the following: pip3 install pyodbc After it is installed, go to your odbc.ini file: vi /home/cdsw/.odbc.ini Configure your file as follows: [ODBC Data Sources] Teradata ODBC DSN = Teradata ODBC Driver 16.20 TEST = My DSN Description [TEST] Driver = /opt/teradata/client/16.20/odbc_64/lib/tdataodbc_sb64.so Description = Teradata database DBCName = [your_teradata_hostname] UID = test PWD = test Finally, you can run the following code to test: import pyodbc import pandas as pd conn = pyodbc.connect('DSN=TEST') # Define Cursor cus=conn.cursor() query = "select * from test.TEST_TABLE;" # Execute SQL statement to get current datetime and store result in cursor cus.execute(query) # Display the content of cursor row = cus.fetchone() print(row) # Use Pandas to execute and retrieve results df = pd.read_sql(query, conn) print(df) The output in the workbench should look something like the following:

na · ‎05-12-2020

Check out our series of SDX videos on our youtube channel for some examples of Atlas + Ranger in action as part of the CDP Shared Data Experience (SDX): https://www.youtube.com/playlist?list=PLe-h9HrA9qfCj2SI5BrvCdQOEOjHUWMBB

pvidal · ‎05-05-2020

This article explains how to setup Hive Warehouse Connector (HWC), in CDP Public Cloud (tested with CDP Public Cloud Datahub runtime 7.1). Note: you must have admin privileges on your datahub cluster do this configuration. Step 1: Whitelist the path to the HWC jar In your CDP Datahub, open Cloudera Manager (CM) via the link in the management console: In CM, go to your cluster > Livy > Configuration, and search for livy-conf/livy.conf: Add the following safety valve to the configuration file: livy.file.local-dir-whitelist=/path_for_hwc/ In our example, we are using the /tmp/ folder. Restart the Livy service via CM to propagate the configuration. Step 2: Copy the HWC jar to the whitelisted location Find the hostname of the node where Livy is installed (master3 here): Connect to the node using your user/workload password, e.g.: ssh pvidal@viz-data-engineering-master3.viz-cdp.a465-9q4k.cloudera.site Find the HWC jar: [pvidal@viz-data-engineering-master3 /]$ find / -name *hive-warehouse-connector* 2>/dev/null ./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar ./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar Copy it and add the right permissions: cp /opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar /tmp chmod a+rw /tmp/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar Step 3: Add jar path to Zeppelin Livy interpreter From your management console, open Zeppelin: Go to the top right, and configure your Interpreters: Edit the livy interpreter and add the following properties: HWC Jar Location Name: livy.spark.jars Value: file:///[LOCATION_OF_YOUR_HWC_JAR] Hive JDBC URL Name: livy.spark.sql.hive.hiveserver2.jdbc.url Value: [JDBC_URL_FROM_MANAGEMENT_CONSOLE];user=[your_user];password=[your_password] You can find the JDBC URL in your datahub management console: Read via LLAP Name: livy.spark.datasource.hive.warehouse.read.via.llap Value: false JDBC mode Name: livy.spark.datasource.hive.warehouse.read.jdbc.mode Value: client Staging Dir Name: livy.spark.datasource.hive.warehouse.load.staging.dir Value: /tmp Metastore URI Name: livy.spark.datasource.hive.warehouse.metastoreUri Value: [VALUE_FROM_HIVE_SITE_XML] You can download hive-site.xml from CM, by going to Your Cluster > Hive > Download Client Configuration: Save your configuration, and restart your interpreter. Step 4: Code away Here is a simple example of Spark reading from a CSV and writing to a Hive table using HWC: Read raw location data val locationDf = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true")).csv("s3a://viz-cdp-bucket/raw/locations.csv") locationDf.printSchema() Setup HWC session import com.hortonworks.hwc.HiveWarehouseSession import com.hortonworks.hwc.HiveWarehouseSession._ val hive = HiveWarehouseSession.session(spark).build() Create database and save dataset to table hive.executeUpdate("CREATE DATABASE worldwidebank"); hive.setDatabase("worldwidebank"); locationDf.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").option("table", "locations").save() Query data val ds = hive.sql("select * from locations limit 10") ds.show()

Online	Offline
Last Visited	‎01-13-2021 10:35 AM

Member Since	‎07-10-2018 02:51 PM
Last Visited	‎01-13-2021 10:35 AM
Posts	63
Kudos received	73

Cloudera Community

Re: Downloading Ranger policies with a read-only u...

CDP Public Cloud: Datalake backup and restore

Re: CDP on Azure: Creation failed (freelpa creatio...

Re: How to connect CML to Hive using Python

How to setup compute reserved instances in CDW

How to configure K9s for Cloudera Data Engineering

How to configure NiFi S2S from local machine to CD...

How to connect CML to Teradata via PyODBC

Re: Differnce Between Cloudera Navigator and Apach...

How to setup Hive Warehouse Connector in Zeppelin ...