About pvidal

pvidal · ‎09-24-2020

Hey, Have you checked this article? https://community.cloudera.com/t5/Community-Articles/How-to-connect-to-CDP-Impala-from-python/ta-p/296405

pvidal · ‎08-31-2020

Cloudera Data Warehouse (CDW) brings one of the most efficient compute isolation and rapid scale up and down of data warehousing workloads, leveraging the latest container and caching technologies. One of the great features of this architecture is the ability to only bring compute on-demand, as illustrated by the figure below: This default setup is the most cost-effective setup as only a few shared services nodes (small nodes running services like UIs, Viz, Zookeeper, etc.) are long lasting. Each Virtual Warehouse has a set of nodes that are only running if compute is needed (i.e. a new query on a non-cached dataset). The caveat to this approach is that on a completely cold warehouse, the warm up type from zero to compute is 1 minute or 2. An alternative to this default architecture is to leverage compute-reserved nodes, that will be shared between virtual warehouses, as depicted below: With this architecture, a pool of reserved nodes can be used to enable the immediate availability of compute across nodes. In this article, I will showcase how to set up reserved instances in CDW. Note: This article is a high-level tutorial. It is not my intent to detail the behavior of how reserved nodes are shared across warehouses, or recommend generic sizing. The number of instances and the VW behavior will depend on your implementation. Step 1: Get your Kubeconfig In CDW, go to your environment, click on the 3 dots on the environment box > Show Kubeconfig: Grant your ARN access to the environment, and copy/download the kubeconfig (see this article for more details). Step 2: Connect to your cluster $ export KUBECONFIG=[path_to_your_kubeconfig] $ kubectl get deployments -n cluster NAME READY UP-TO-DATE AVAILABLE AGE ardent-ferret-efs-provisioner 2/2 2 2 4h46m compute-reserved-node 0/0 0 0 4h46m crusty-abalone-cluster-autoscaler 1/1 1 1 4h46m nginx-default-backend 1/1 1 1 4h46m nginx-service 3/3 3 3 4h46m shared-services-reserved-node 0/0 0 0 4h46m Step 3: Modify the replicas of compute reserved nodes kubectl edit deployment compute-reserved-node -n cluster apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" creationTimestamp: "2020-08-31T16:28:52Z" generation: 1 labels: app.kubernetes.io/instance: trendy-mastiff app.kubernetes.io/managed-by: Tiller app.kubernetes.io/name: cluster-overprovisioner cluster-overprovisioner-name: compute-reserved-node helm.sh/chart: cluster-overprovisioner-0.2.5 name: compute-reserved-node namespace: cluster resourceVersion: "3476" selfLink: /apis/extensions/v1beta1/namespaces/cluster/deployments/compute-reserved-node uid: a5cb9ea1-729a-4665-9734-94c2f669984f spec: progressDeadlineSeconds: 600 replicas: 3 Step 4: Verify your config After a few minutes, you should see your configuration being applied: $ kubectl get deployments -n cluster NAME READY UP-TO-DATE AVAILABLE AGE ardent-ferret-efs-provisioner 2/2 2 2 4h54m compute-reserved-node 3/3 3 3 4h54m crusty-abalone-cluster-autoscaler 1/1 1 1 4h54m nginx-default-backend 1/1 1 1 4h54m nginx-service 3/3 3 3 4h54m shared-services-reserved-node 0/0 0 0 4h54m

pvidal · ‎08-19-2020

Inspired by @sunile_manjee 's article How to use K9s to fetch metrics and logs for Cloudera Data Warehouse Experience , I decided to create the same tutorial for Cloudera Data Engineering. The process is very similar, as you can see below. Pre-Requisites CDE environment setup K9s installed on your machine (instructions here) AWS user ARN AWS configure (CLI) pointing to your AWS env (instructions here) Step 1: Add your AWS user to the CDE environment In your AWS account console, go to IAM > Users > Search for your user name > copy your ARN: In the CDE main page, go to environment details: Then, go to access and add your ARN: Step 2: Download the kubeconfig In the CDE main page, click on the 3 dots, then Download Kube Config: Step 3: Run K9s You can now run k9s, by running: k9s --kubeconfig ~/Downloads/kubeconfig Note: the path to your downloaded kubeconfig is subject to change of course.

pvidal · ‎08-10-2020

Moving data from your local machine to the cloud has never been easier using NiFi site to site protocol and CDP Datahub. In this article, I will focus on how to set up a site to site communication between your local machine and CDP Cloud, without using the default Knox CDP Proxy. This configuration assumes that you already have a local instance of NiFi (or MiNiFi) and a CDP Datahub Cluster running NiFi. If you want to learn how to use CDP Public Cloud, please visit our overview page and documentation. This setup will be executed in 4 steps: Step 1: Open CDP to your local IP Step 2: Download and configure stores on your local machine Step 3: Configure a simple site-to-site flow Step 4: Authorize this flow in Ranger Step 1: Open CDP to your local IP Go to your CDP Management Console, and find your datahub (here pvn-nifi). At the bottom of the datahub page, click on Hardware and locate one of the instances running NiFi: Click on the instances and you will be redirected to your cloud provider (here AWS😞 At the bottom of the screen, click on the security group associated with your instance, and you will be redirected to that security group config page: Click on Edit inbound rules and add a rule opening TCP port 8443 to your local IP: Save these changes. Step 2: Download and configure stores on your local machine Connect to one of the NiFi machines with the Cloudbreak user and the key you used at deployment: $ ssh -i [path_to_private_key] cloudbreak@[your_nifi_host] Copy and authorize the key and trust stores: $ sudo su $ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-host_keystore.jks /tmp $ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks /tmp $ chmod a+rw /tmp/cm-auto-host_keystore.jks $ chmod a+rw /tmp/cm-auto-global_truststore.jks Disconnect from the remote machine and copy these stores: $ cd ~/Desktop $ scp -i [path_to_private_key] cloudbreak@[your_nifi_host]:/tmp/cm-auto-host_keystore.jks cm-auto-host_keystore.jks $ scp -i [path_to_private_key] cloudbreak@[your_nifi_host]:/tmp/cm-auto-global_truststore.jks cm-auto-global_truststore.jks Configure your local NiFi with these stores, by modifying your nifi.properties: nifi.security.keystore=/Users/pvidal/Desktop/cm-auto-host_keystore.jks nifi.security.keystoreType=JKS nifi.security.keystorePasswd=[keystore_pw] nifi.security.keyPasswd=[keystore_pw] nifi.security.truststore=/Users/pvidal/Desktop/cm-auto-global_truststore.jks nifi.security.truststoreType=JKS nifi.security.truststorePasswd=[truststore_pw] Note: To know the passwords of these stores, please connect with your Cloudera team. Restart your local NiFi instance: nifi restart Step 3: Configure a simple site-to-site flow Local instance Create a process group to host your flow (here called S2S Cloud: In this process group, create a remote process group instance and configure it with one of your cloud NiFi instances address, and the HTTP protocol: Create a simple Generate flow file processor and connect it to the remote processor: Note: Without configuring Ranger, you will get a Forbidden warning (see step 4). CDP Public Instance Create a process group to host your flow (here called Receive from on prem): In this process group, create an input port accepting remote connections: Finally, create a flow that takes the data and logs it: Start your flow. Step 4: Authorize this flow in Ranger From the Cloudera Management console, go to Ranger and your NiFi service: From the list of policies, create a new policy (here called s2s) that will allow access to your specific process group and the site-to-site protocol (Ranger does auto completion): Save this policy, and go back to your local machine; you can now enable the remote process group and start sending files! Example of successful flows Local Flow CDP Public Flow

pvidal · ‎06-04-2020

Cloudera Machine Learning (and Cloudera Data Science Workbench) is built on a very robust and flexible framework to ease integration with third parties. In this article, I decided to explore the integration of Teradata with CML via ODBC. A few notes before I dive in: I could have easily used JDBC via JayDeBeApi (see my previous article), but where is the fun in that This article could be generalized to more ODBC connections, provided proper parameters are setup To get a fully rounded tutorial, I will go through these 3 steps: Step 1 (optional if you already have a Teradata instance): Setting up Teradata Vantage in AWS Step 2: Create a custom image with Teradata driver Step 3: Configure and run python to Teradata ODBC Step 1: Setting up Teradata Vantage in AWS If you want to do development testing of Teradata, you can use Teradata Vantage Developer on the AWS Market Place. I am not going to dive too much into this as this configuration may change as Teradata evolves. Pre-Requisites The Teradata Vantage Developer is an AWS CloudFormation template, that requires the following assets to be pre-created: An Internet Gateway A VPC associated with this IGW (and DNS / DNS Hostnames support) A subnet (with a route to the IGW and Public IP creation on launch) A placement group Setup Once the assets are created, go to Teradata Vantage Developer and select your configuration (make sure you select the same region as the assets you created): From this configuration, launch the CloudFormation template and fill the parameters with the AWS assets you created: The template will create two nodes by default. We will connect to one of them to create a test user: $ ssh ec2-user@[your_public_hostname] ec2-user@SMP001-01:~> bteq Teradata BTEQ 16.20.00.01 for LINUX. PID: 17205 Copyright 1984-2017, Teradata Corporation. ALL RIGHTS RESERVED. Enter your logon or BTEQ command: .LOGON UserId: dbc Password: [password_set_in_cf_template] CREATE user test AS password=test perm = 200000000, SPOOL = 100000000; You can then connect with your favorite SQL editor to test the connection and run table creation: CREATE TABLE "test".TEST_TABLE ( COLUMN1 VARCHAR(100), COLUMN2 VARCHAR(100) ) ; Step 2: Create a custom image with Teradata driver I created the following image by following the steps the documentation on custom engines: #Dockerfile FROM docker.repository.cloudera.com/cdsw/engine:11-cml-2020.04-1 WORKDIR /tmp #The RUN commands that install an editor #For example: RUN apt-get install myeditor RUN apt-get autoremove unixodbc -y RUN apt-get update -y RUN apt-get install lib32stdc++6 -y RUN apt-get install wget -y RUN wget [LOCATION_OF_DRIVERS] RUN tar -xzvf tdodbc1620__ubuntu_indep.16.20.00.87-1.tar.gz RUN dpkg -i tdodbc1620/tdodbc1620-16.20.00.87-1.noarch.deb RUN apt-get install -y python-pyodbc A few notes on this image: I removed the unixodbc because I read that it was causing issues with pyodbc but it may not be required You can find a built image on my dockerhub here Finally, connect to CML and add this new engine: Step 3: Configure and run python to Teradata ODBC Go to your workbench in a new project, and create a session with the engine you created. Then run the following: pip3 install pyodbc After it is installed, go to your odbc.ini file: vi /home/cdsw/.odbc.ini Configure your file as follows: [ODBC Data Sources] Teradata ODBC DSN = Teradata ODBC Driver 16.20 TEST = My DSN Description [TEST] Driver = /opt/teradata/client/16.20/odbc_64/lib/tdataodbc_sb64.so Description = Teradata database DBCName = [your_teradata_hostname] UID = test PWD = test Finally, you can run the following code to test: import pyodbc import pandas as pd conn = pyodbc.connect('DSN=TEST') # Define Cursor cus=conn.cursor() query = "select * from test.TEST_TABLE;" # Execute SQL statement to get current datetime and store result in cursor cus.execute(query) # Display the content of cursor row = cus.fetchone() print(row) # Use Pandas to execute and retrieve results df = pd.read_sql(query, conn) print(df) The output in the workbench should look something like the following:

pvidal · ‎05-29-2020

Cloudera Data Platform recently introduced Cluster Connectivity Manager (CCM) in tech preview, a feature with which CDP can communicate with workload clusters without the need for inbound network connections to customer workload clusters. In this article, I will highlight how to automate the setup of CCM for both AWS and Azure including: Cloud Network Setup CDP Environment setup Before you continue reading: You can find all automation on my github (here) Please refer to the official documentation for more details (here) This is just an example, different deployments require a different set of parameters! Overview As highlighted in the above image, CCM uses reverse tunneling to communicate back to your CDP dedicated control plane. A few things to note: Each tenant (i.e. customer) will have their own dedicated tunnel using a dedicated Apache Mina server The diagram above describes a setup in AWS. The Azure setup is very similar but does not require public subnets or NAT gateways. Network Setup Note: for both setup, you will notice that the open port 22 is inbound for a specific CIDR. This is to enable SSH to the different boxes for CDP and can be restricted to a CIDR within the VPC. AWS For AWS, we will need the following setup: One VPC 3 public networks 1 internet gateway 1 public route from a public network to IGW 3 private networks 3 private network to NAT gateway 2 security groups The following is a sample code: #!/bin/bash display_usage() { echo " Usage: $(basename "$0") [--help or -h] <prefix> <region> <sg_cidr> Description: Creates network assets for CDP env demployment Arguments: prefix: prefix of your assets region: AWS region sg_cidr: CIDR to open in your security group --help or -h: displays this help" } # check whether user had supplied -h or --help . If yes display usage if [[ ( $1 == "--help") || $1 == "-h" ]] then display_usage exit 0 fi # Check the numbers of arguments if [ $# -lt 3 ] then echo "Not enough arguments!" >&2 display_usage exit 1 fi if [ $# -gt 3 ] then echo "Too many arguments!" >&2 display_usage exit 1 fi prefix=$1 region=$2 sg_cidr=$3 # 1. Creating VPC vpc_id=$(aws ec2 create-vpc --cidr 10.10.0.0/16 | jq -r .Vpc.VpcId) aws ec2 create-tags --resources $vpc_id --tags Key=Name,Value="$prefix-cdp-vpc" > /dev/null 2>&1 # 2. Creating public subnets # 2.1. Subnets public_sub_1=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.0.0/24 --availability-zone "$region"a | jq -r .Subnet.SubnetId) public_sub_2=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.1.0/24 --availability-zone "$region"b | jq -r .Subnet.SubnetId) public_sub_3=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.2.0/24 --availability-zone "$region"c | jq -r .Subnet.SubnetId) aws ec2 create-tags --resources $public_sub_1 --tags Key=Name,Value="$prefix-pub-subnet-1" > /dev/null 2>&1 aws ec2 create-tags --resources $public_sub_2 --tags Key=Name,Value="$prefix-pub-subnet-2" > /dev/null 2>&1 aws ec2 create-tags --resources $public_sub_3 --tags Key=Name,Value="$prefix-pub-subnet-3" > /dev/null 2>&1 # 2.2. Internet gateway igw_id=$(aws ec2 create-internet-gateway | jq -r .InternetGateway.InternetGatewayId) aws ec2 create-tags --resources $igw_id --tags Key=Name,Value="$prefix-igw" aws ec2 attach-internet-gateway --internet-gateway-id $igw_id --vpc-id $vpc_id > /dev/null 2>&1 aws ec2 modify-vpc-attribute --enable-dns-support "{\"Value\":true}" --vpc-id $vpc_id > /dev/null 2>&1 aws ec2 modify-vpc-attribute --enable-dns-hostnames "{\"Value\":true}" --vpc-id $vpc_id > /dev/null 2>&1 # 2.3. Route route_pub=$(aws ec2 create-route-table --vpc-id $vpc_id | jq -r .RouteTable.RouteTableId) aws ec2 create-tags --resources $route_pub --tags Key=Name,Value="$prefix-pub-route" > /dev/null 2>&1 aws ec2 create-route --route-table-id $route_pub --destination-cidr-block 0.0.0.0/0 --gateway-id $igw_id > /dev/null 2>&1 aws ec2 associate-route-table --subnet-id $public_sub_1 --route-table-id $route_pub > /dev/null 2>&1 aws ec2 associate-route-table --subnet-id $public_sub_2 --route-table-id $route_pub > /dev/null 2>&1 aws ec2 associate-route-table --subnet-id $public_sub_3 --route-table-id $route_pub > /dev/null 2>&1 # 3. Creating private subnets # 3.1. Subnets private_sub_1=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.160.0/19 --availability-zone "$region"a | jq -r .Subnet.SubnetId) private_sub_2=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.192.0/19 --availability-zone "$region"b | jq -r .Subnet.SubnetId) private_sub_3=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.224.0/19 --availability-zone "$region"c | jq -r .Subnet.SubnetId) aws ec2 create-tags --resources $private_sub_1 --tags Key=Name,Value="$prefix-priv-subnet-1" > /dev/null 2>&1 aws ec2 create-tags --resources $private_sub_2 --tags Key=Name,Value="$prefix-priv-subnet-2" > /dev/null 2>&1 aws ec2 create-tags --resources $private_sub_3 --tags Key=Name,Value="$prefix-priv-subnet-3" > /dev/null 2>&1 # 3.1. NAT gateways alloc_id_1=$(aws ec2 allocate-address --domain vpc | jq -r .AllocationId) alloc_id_2=$(aws ec2 allocate-address --domain vpc | jq -r .AllocationId) alloc_id_3=$(aws ec2 allocate-address --domain vpc | jq -r .AllocationId) nat_1=$(aws ec2 create-nat-gateway --subnet-id $public_sub_1 --allocation-id $alloc_id_1 | jq -r .NatGateway.NatGatewayId) sleep 30 nat_2=$(aws ec2 create-nat-gateway --subnet-id $public_sub_2 --allocation-id $alloc_id_2 | jq -r .NatGateway.NatGatewayId) sleep 30 nat_3=$(aws ec2 create-nat-gateway --subnet-id $public_sub_3 --allocation-id $alloc_id_3 | jq -r .NatGateway.NatGatewayId) sleep 30 # 3.2. Routes route_priv_1=$(aws ec2 create-route-table --vpc-id $vpc_id | jq -r .RouteTable.RouteTableId) route_priv_2=$(aws ec2 create-route-table --vpc-id $vpc_id | jq -r .RouteTable.RouteTableId) route_priv_3=$(aws ec2 create-route-table --vpc-id $vpc_id | jq -r .RouteTable.RouteTableId) aws ec2 create-tags --resources $route_priv_1 --tags Key=Name,Value="$prefix-priv-route-1" > /dev/null 2>&1 aws ec2 create-tags --resources $route_priv_2 --tags Key=Name,Value="$prefix-priv-route-2" > /dev/null 2>&1 aws ec2 create-tags --resources $route_priv_3 --tags Key=Name,Value="$prefix-priv-route-3" > /dev/null 2>&1 aws ec2 create-route --route-table-id $route_priv_1 --destination-cidr-block 0.0.0.0/0 --nat-gateway-id $nat_1 > /dev/null 2>&1 aws ec2 create-route --route-table-id $route_priv_2 --destination-cidr-block 0.0.0.0/0 --nat-gateway-id $nat_2 > /dev/null 2>&1 aws ec2 create-route --route-table-id $route_priv_3 --destination-cidr-block 0.0.0.0/0 --nat-gateway-id $nat_3 > /dev/null 2>&1 aws ec2 associate-route-table --subnet-id $private_sub_1 --route-table-id $route_priv_1 > /dev/null 2>&1 aws ec2 associate-route-table --subnet-id $private_sub_2 --route-table-id $route_priv_2 > /dev/null 2>&1 aws ec2 associate-route-table --subnet-id $private_sub_3 --route-table-id $route_priv_3 > /dev/null 2>&1 # 4. VPC endpoints s3_endpoint=$(aws ec2 create-vpc-endpoint --vpc-id $vpc_id --service-name com.amazonaws.${region}.s3 | jq -r .VpcEndpoint.VpcEndpointId) dynamo_endpoint=$(aws ec2 create-vpc-endpoint --vpc-id $vpc_id --service-name com.amazonaws.${region}.dynamodb | jq -r .VpcEndpoint.VpcEndpointId) aws ec2 modify-vpc-endpoint --vpc-endpoint-id $s3_endpoint --add-route-table-ids $route_pub $route_priv_1 $route_priv_2 $route_priv_3 > /dev/null 2>&1 aws ec2 modify-vpc-endpoint --vpc-endpoint-id $dynamo_endpoint --add-route-table-ids $route_pub $route_priv_1 $route_priv_2 $route_priv_3 > /dev/null 2>&1 # 5. Security groups knox_sg_id=$(aws ec2 create-security-group --description "AWS CDP Knox security group" --group-name "$prefix-knox-sg" --vpc-id $vpc_id | jq -r .GroupId) aws ec2 create-tags --resources $knox_sg_id --tags Key=Name,Value="$prefix-knox-sg" > /dev/null 2>&1 aws ec2 authorize-security-group-ingress --group-id $knox_sg_id --protocol tcp --port 22 --cidr $sg_cidr aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol tcp --port 0-65535 --cidr 10.10.0.0/16 > /dev/null 2>&1 aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol udp --port 0-65535 --cidr 10.10.0.0/16 > /dev/null 2>&1 aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol tcp --port 0-65535 --cidr 10.10.224.0/19 > /dev/null 2>&1 aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol udp --port 0-65535 --cidr 10.10.224.0/19 > /dev/null 2>&1 default_sg_id=$(aws ec2 create-security-group --description "AWS default security group" --group-name "$prefix-default-sg" --vpc-id $vpc_id | jq -r .GroupId) aws ec2 create-tags --resources $default_sg_id --tags Key=Name,Value="$prefix-default-sg" aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol tcp --port 0-65535 --cidr 10.10.0.0/16 > /dev/null 2>&1 aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol udp --port 0-65535 --cidr 10.10.0.0/16 > /dev/null 2>&1 aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol tcp --port 0-65535 --cidr 10.10.224.0/19 > /dev/null 2>&1 aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol udp --port 0-65535 --cidr 10.10.224.0/19 > /dev/null 2>&1 echo "{\"VpcId\": \"$vpc_id\", \"InternetGatewayId\": \"$igw_id\", \"PublicSubnets\": [\"$public_sub_1\", \"$public_sub_2\", \"$public_sub_3\"], \"PublicRouteTableId\": \"$route_pub\", \"PublicNatGatewayIds\": [\"$nat_1\", \"$nat_2\", \"$nat_3\"], \"PrivateSubnets\": [\"$private_sub_1\", \"$private_sub_2\", \"$private_sub_3\"], \"PrivateRouteTableIds\": [\"$route_priv_1\", \"$route_priv_2\", \"$route_priv_3\"], \"VPCEndpoints\": [\"$s3_endpoint\", \"$dynamo_endpoint\"], \"KnoxGroupId\": \"$knox_sg_id\" , \"DefaultGroupId\": \"$default_sg_id\"}" Azure For Azure, we will need the following setup: One vnet 3 private subnets 2 security groups The following is a sample code: #!/bin/bash display_usage() { echo " Usage: $(basename "$0") [--help or -h] <prefix> <sg_cidr> Description: Creates network assets for CDP env demployment Arguments: prefix: prefix of your assets sg_cidr: CIDR to open in your security group --help or -h: displays this help" } # check whether user had supplied -h or --help . If yes display usage if [[ ( $1 == "--help") || $1 == "-h" ]] then display_usage exit 0 fi # Check the numbers of arguments if [ $# -lt 2 ] then echo "Not enough arguments!" >&2 display_usage exit 1 fi if [ $# -gt 2 ] then echo "Too many arguments!" >&2 display_usage exit 1 fi prefix=$1 sg_cidr=$2 # 1. Vnets and subnets az network vnet create -g $prefix-cdp-rg --name $prefix-cdp-vnet --address-prefix 10.10.0.0/16 az network vnet subnet create -g $prefix-cdp-rg --vnet-name $prefix-cdp-vnet -n $prefix-priv-subnet-1 --address-prefixes 10.10.160.0/19 az network vnet subnet create -g $prefix-cdp-rg --vnet-name $prefix-cdp-vnet -n $prefix-priv-subnet-2 --address-prefixes 10.10.192.0/19 az network vnet subnet create -g $prefix-cdp-rg --vnet-name $prefix-cdp-vnet -n $prefix-priv-subnet-3 --address-prefixes 10.10.224.0/19 az network vnet subnet update -n $prefix-priv-subnet-1 --vnet-name $prefix-cdp-vnet -g $prefix-cdp-rg --service-endpoints "Microsoft.Sql" "Microsoft.Storage" az network vnet subnet update -n $prefix-priv-subnet-2 --vnet-name $prefix-cdp-vnet -g $prefix-cdp-rg --service-endpoints "Microsoft.Sql" "Microsoft.Storage" az network vnet subnet update -n $prefix-priv-subnet-3 --vnet-name $prefix-cdp-vnet -g $prefix-cdp-rg --service-endpoints "Microsoft.Sql" "Microsoft.Storage" # 2. NSG az network nsg create -g $prefix-cdp-rg -n $prefix-knox-nsg az network nsg create -g $prefix-cdp-rg -n $prefix-default-nsg az network nsg rule create -g $prefix-cdp-rg --nsg-name $prefix-knox-nsg -n ssh_cidr --priority 102 --source-address-prefixes "$sg_cidr" --destination-address-prefixes '*' --destination-port-ranges 22 --direction Inbound --access Allow --protocol Tcp --description "Allow SSH to boxes from CIDR." az network nsg rule create -g $prefix-cdp-rg --nsg-name $prefix-knox-nsg -n outbound --priority 107 --source-address-prefixes '*' --destination-address-prefixes '*' --destination-port-ranges '*' --direction Outbound --access Allow --protocol '*' --description "Allow outbound access." az network nsg rule create -g $prefix-cdp-rg --nsg-name $prefix-default-nsg -n outbound --priority 107 --source-address-prefixes '*' --destination-address-prefixes '*' --destination-port-ranges '*' --direction Outbound --access Allow --protocol '*' --description "Allow outbound access." CDP setup Note: This setup is using the CDP CLI. You could also select these networks from the UI directly. It also assumes the creation of a proper IAM and storage, which can also be automated in my github. AWS cdp environments create-aws-environment --environment-name ${prefix}-cdp-env \ --credential-name ${credential} \ --region ${region} \ --security-access securityGroupIdForKnox="${knox_sg_id}",defaultSecurityGroupId="${default_sg_id}" \ --authentication publicKeyId="${key}" \ --log-storage storageLocationBase="${prefix}-cdp-bucket",instanceProfile="arn:aws:iam::$AWS_ACCOUNT_ID:instance-profile/${prefix}-log-role" \ --subnet-ids "${pub_sub_1}" "${pub_sub_2}" "${pub_sub_3}" "${priv_sub_1}" "${priv_sub_2}" "${priv_sub_3}" \ --vpc-id "${vpc}" \ --s3-guard-table-name ${prefix}-cdp-table \ --enable-tunnel \ --tags key="enddate",value="${END_DATE}" key="project",value="${PROJECT}" Azure cdp environments create-azure-environment --environment-name ${prefix}-cdp-env \ --credential-name ${credential} \ --region "${region}" \ --public-key "${key}" \ --security-access securityGroupIdForKnox="$knox_nsg",defaultSecurityGroupId="$default_nsg" \ --log-storage storageLocationBase="abfs://logs@${prefix}cdpsa.dfs.core.windows.net",managedIdentity="/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${prefix}-cdp-rg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/loggerIdentity" \ --existing-network-params networkId="$network_id",resourceGroupName="$prefix-cdp-rg",subnetIds="$subnet_1","$subnet_2","$subnet_3" \ --tags key="enddate",value="${END_DATE}" key="project",value="${PROJECT}" \ --no-use-public-ip \ --enable-tunnel

pvidal · ‎05-21-2020

Continuing my series of how-to articles for CDP, today we explore how to connect to Impala via JDBC in JSON. In my example, I will use a Jupyter notebook running in CML, but this can be generalized. This process is actually fairly easy, so let's dive in. Step 1: Setup Impala JDBC drivers First, download the latest impala JDBC drivers from Cloudera JDBC Driver 2.6.17 for Impala. Then, upload them to your machine. Here is an example of a CML Jupyter sessions with the jars uploaded: Finally, make sure that you set up your CLASSPATH properly by opening a terminal session and typing the following: CLASSPATH=.:/home/cdsw/ImpalaJDBC4.jar:/home/cdsw/ImpalaJDBC41.jar:/home/cdsw/ImpalaJDBC42.jar export CLASSPATH Step 2: Install JayDeBeApi To install JayDeBeApi, run the following: pip3 install JayDeBeApi A recommended step to avoid getting an error along the lines of "AttributeError: type object 'java.sql.Types' has no attribute '__javaclass__'", would be to downgrade your jpype by running the following: pip3 install --upgrade jpype1==0.6.3 --user Restart your kernel when you perform the downgrade. Step 3: Connect to Impala Finally, connect to your impala, using the following sample code: import jaydebeapi conn = jaydebeapi.connect("com.cloudera.impala.jdbc.DataSource", "jdbc:impala://[your_host]:443/;ssl=1;transportMode=http;httpPath=icml-data-mart/cdp-proxy-api/impala;AuthMech=3;", {'UID': "[your_cdp_user]", 'PWD': "[your_workload_pwd]"}, '/home/cdsw/ImpalaJDBC41.jar') curs = conn.cursor() curs.execute("select * from default.locations") curs.fetchall() curs.close() conn.close() Note: You can get your impala JDBC string either from the Datahub endpoint path or from the JDBC URL from CDW. The following is a screenshot of my code in action:

pvidal · ‎05-11-2020

Hi Abdul, That's a very good question. To answer it, you need to broaden the horizon a bit from just Navigator and Atlas, and consider Atlas, Ranger, Sentry and Navigator. In short, all functionalities you have in CDH (using Navigator, Sentry) will be available to you in CDP (using Atlas and Ranger). Here is a high level mapping of the different tools in our different platforms: Function Navigator (CDH) Sentry (CDH) Atlas (CDP) Ranger (CDP) Data Lineage X X Audits X X RBAC X X ABAC Not Supported X Of course, Atlas gives you a lot of extra functionalities not available to you before such as a more scalable back-end, open apis for integration, large apache community support. I recommend exploring our product pages to get a better understanding: https://www.cloudera.com/products/open-source/apache-hadoop/apache-atlas.html

pvidal · ‎05-08-2020

Here is a fun one: how do you connect from Python in Cloudera Machine Learning to our Kafka datahub cluster? The documentation is pretty thorough, but it does not have an example of a python client. That's what I'm going to highlight in this article. The good news is that since CML and Datahub run in the same network, you don't need to worry about opening the broker ports, therefore you just need to follow these steps: Step 1: Get and upload your freeIPA certificate Step 2: Find your broker hostnames Step 3: Setup your client Step 1: Get and upload your freeIPA certificate Go to your management console > your environment > Actions > Get FreeIPA Certificate: Once downloaded, go to your CML workspace, and upload your file (e.g. /home/cdsw/ca.crt). Step 2: Find your broker hostnames For this, go to your Kafka Datahub Cluster > CM UI > Kafka > Instances; you can find the broker hosts here: Step 3: Setup your client Then, open a session in CML, and use the following parameters: from kafka import KafkaProducer producer = KafkaProducer(bootstrap_servers=['<YOUR_BROKER_URL>:9093','<YOUR_BROKER_URL>:9093','<YOUR_BROKER_URL>:9093'], security_protocol="SASL_SSL", sasl_mechanism="PLAIN", ssl_check_hostname=True, ssl_cafile='/home/cdsw/ca.crt', sasl_plain_username="<YOUR_WORKLOAD_USER>", sasl_plain_password="<YOUR_WORKLOAD_PASSWORD>", api_version_auto_timeout_ms=30000)

pvidal · ‎05-05-2020

Recently I came around an interesting problem: how to use boto to get data from a secure bucket in a Jupyter notebook in Cloudera Machine Learning. The missing piece was: I needed to get my code integrated with my AWS permissions given by IDBroker. Since CML already authenticated me to Kerberos, all I need was getting the goods from IDBroker. In this article, I will show you pseudo code on how to get these access keys both in bash and python. Note: Special thanks to @Kevin Risden to whom I owe this article and many more things. Find your IDBroker URL Regardless of the method, you will need to get the URL for your IDBroker host. This is done simply in the management console of your datalake. The following is an example: Getting Access Keys in bash After you are connected to one of your cluster's node and ensure you kinit, run the following: IDBROKER_DT="$(curl -s --negotiate -u: "https:/[IDBROKER_HOST]:8444/gateway/dt/knoxtoken/api/v1/token")" IDBROKER_ACCESS_TOKEN="$(echo "$IDBROKER_DT" | python -c "import json,sys; print(json.load(sys.stdin)['access_token'])")" IDBROKER_CREDENTIAL_OUTPUT="$(curl -s -H "Authorization: Bearer $IDBROKER_ACCESS_TOKEN" "https://[IDBROKER_HOST]:8444/gateway/aws-cab/cab/api/v1/credentials")" The credentials can be found in the $IDBROKER_CREDENTIAL_OUTPUT variable. Getting Access Keys in Python Before getting started, the following libraries are installed: pip3 install requests requests-kerberos boto3 Then, run the following code: import requests from requests_kerberos import HTTPKerberosAuth r = requests.get("https://[IDBROKER_URL]:8444/gateway/dt/knoxtoken/api/v1/token", auth=HTTPKerberosAuth()) url = "https://[IDBROKER_URL]:8444/gateway/aws-cab/cab/api/v1/credentials" headers = { 'Authorization': "Bearer "+ r.json()['access_token'], 'cache-control': "no-cache" } response = requests.request("GET", url, headers=headers) ACCESS_KEY=response.json()['Credentials']['AccessKeyId'] SECRET_KEY=response.json()['Credentials']['SecretAccessKey'] SESSION_TOKEN=response.json()['Credentials']['SessionToken'] import boto3 client = boto3.client( 's3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, aws_session_token=SESSION_TOKEN, ) You can then access your buckets via the following: data = client.get_object(Bucket='[YOUR_BUCKET]', Key='[FILE_PATH]') contents = data['Body'].read() Added on 2022-03-25 If your user is part of multiple groups with different IDBroker mappings, you might get the following error message: "Ambiguous group role mappings for the authenticated user." In this case you need to adjust the following line in the code example to specify for which group you would like to get the access credentials: url = "https://[IDBROKER_URL]:8444/gateway/aws-cab/cab/api/v1/credentials/group/my_cdp_group"

Online	Offline
Last Visited	‎01-13-2021 10:35 AM

Member Since	‎07-10-2018 02:51 PM
Last Visited	‎01-13-2021 10:35 AM
Posts	63
Kudos received	73

Cloudera Community

Re: How to connect CML to Hive using Python

How to setup compute reserved instances in CDW

How to configure K9s for Cloudera Data Engineering

How to configure NiFi S2S from local machine to CD...

How to connect CML to Teradata via PyODBC

Automating Cluster Connectivity Manager (CCM) in C...

How to connect to CDP Impala from python

Re: Differnce Between Cloudera Navigator and Apach...

How to connect Datahub Kafka from Cloudera Machine...

How to get AWS access keys via IDBroker in CDP?