Member since
07-10-2018
63
Posts
71
Kudos Received
0
Solutions
09-24-2020
07:03 AM
Hey, Have you checked this article? https://community.cloudera.com/t5/Community-Articles/How-to-connect-to-CDP-Impala-from-python/ta-p/296405
... View more
08-31-2020
02:25 PM
2 Kudos
Cloudera Data Warehouse (CDW) brings one of the most efficient compute isolation and rapid scale up and down of data warehousing workloads, leveraging the latest container and caching technologies.
One of the great features of this architecture is the ability to only bring compute on-demand, as illustrated by the figure below:
This default setup is the most cost-effective setup as only a few shared services nodes (small nodes running services like UIs, Viz, Zookeeper, etc.) are long lasting. Each Virtual Warehouse has a set of nodes that are only running if compute is needed (i.e. a new query on a non-cached dataset).
The caveat to this approach is that on a completely cold warehouse, the warm up type from zero to compute is 1 minute or 2.
An alternative to this default architecture is to leverage compute-reserved nodes, that will be shared between virtual warehouses, as depicted below:
With this architecture, a pool of reserved nodes can be used to enable the immediate availability of compute across nodes. In this article, I will showcase how to set up reserved instances in CDW.
Note: This article is a high-level tutorial. It is not my intent to detail the behavior of how reserved nodes are shared across warehouses, or recommend generic sizing. The number of instances and the VW behavior will depend on your implementation.
Step 1: Get your Kubeconfig
In CDW, go to your environment, click on the 3 dots on the environment box > Show Kubeconfig:
Grant your ARN access to the environment, and copy/download the kubeconfig (see this article for more details).
Step 2: Connect to your cluster
$ export KUBECONFIG=[path_to_your_kubeconfig] $ kubectl get deployments -n cluster NAME READY UP-TO-DATE AVAILABLE AGE ardent-ferret-efs-provisioner 2/2 2 2 4h46m compute-reserved-node 0/0 0 0 4h46m crusty-abalone-cluster-autoscaler 1/1 1 1 4h46m nginx-default-backend 1/1 1 1 4h46m nginx-service 3/3 3 3 4h46m shared-services-reserved-node 0/0 0 0 4h46m
Step 3: Modify the replicas of compute reserved nodes
kubectl edit deployment compute-reserved-node -n cluster
apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" creationTimestamp: "2020-08-31T16:28:52Z" generation: 1 labels: app.kubernetes.io/instance: trendy-mastiff app.kubernetes.io/managed-by: Tiller app.kubernetes.io/name: cluster-overprovisioner cluster-overprovisioner-name: compute-reserved-node helm.sh/chart: cluster-overprovisioner-0.2.5 name: compute-reserved-node namespace: cluster resourceVersion: "3476" selfLink: /apis/extensions/v1beta1/namespaces/cluster/deployments/compute-reserved-node uid: a5cb9ea1-729a-4665-9734-94c2f669984f spec: progressDeadlineSeconds: 600 replicas: 3
Step 4: Verify your config
After a few minutes, you should see your configuration being applied:
$ kubectl get deployments -n cluster NAME READY UP-TO-DATE AVAILABLE AGE ardent-ferret-efs-provisioner 2/2 2 2 4h54m compute-reserved-node 3/3 3 3 4h54m crusty-abalone-cluster-autoscaler 1/1 1 1 4h54m nginx-default-backend 1/1 1 1 4h54m nginx-service 3/3 3 3 4h54m shared-services-reserved-node 0/0 0 0 4h54m
... View more
Labels:
08-19-2020
01:56 PM
1 Kudo
Inspired by @sunile_manjee 's article How to use K9s to fetch metrics and logs for Cloudera Data Warehouse Experience , I decided to create the same tutorial for Cloudera Data Engineering.
The process is very similar, as you can see below.
Pre-Requisites
CDE environment setup
K9s installed on your machine (instructions here)
AWS user ARN
AWS configure (CLI) pointing to your AWS env (instructions here)
Step 1: Add your AWS user to the CDE environment
In your AWS account console, go to IAM > Users > Search for your user name > copy your ARN:
In the CDE main page, go to environment details:
Then, go to access and add your ARN:
Step 2: Download the kubeconfig
In the CDE main page, click on the 3 dots, then Download Kube Config:
Step 3: Run K9s
You can now run k9s, by running: k9s --kubeconfig ~/Downloads/kubeconfig Note: the path to your downloaded kubeconfig is subject to change of course.
... View more
08-10-2020
02:14 PM
2 Kudos
Moving data from your local machine to the cloud has never been easier using NiFi site to site protocol and CDP Datahub. In this article, I will focus on how to set up a site to site communication between your local machine and CDP Cloud, without using the default Knox CDP Proxy.
This configuration assumes that you already have a local instance of NiFi (or MiNiFi) and a CDP Datahub Cluster running NiFi. If you want to learn how to use CDP Public Cloud, please visit our overview page and documentation.
This setup will be executed in 4 steps:
Step 1: Open CDP to your local IP
Step 2: Download and configure stores on your local machine
Step 3: Configure a simple site-to-site flow
Step 4: Authorize this flow in Ranger
Step 1: Open CDP to your local IP
Go to your CDP Management Console, and find your datahub (here pvn-nifi).
At the bottom of the datahub page, click on Hardware and locate one of the instances running NiFi:
Click on the instances and you will be redirected to your cloud provider (here AWS😞
At the bottom of the screen, click on the security group associated with your instance, and you will be redirected to that security group config page:
Click on Edit inbound rules and add a rule opening TCP port 8443 to your local IP:
Save these changes.
Step 2: Download and configure stores on your local machine
Connect to one of the NiFi machines with the Cloudbreak user and the key you used at deployment: $ ssh -i [path_to_private_key] cloudbreak@[your_nifi_host]
Copy and authorize the key and trust stores: $ sudo su
$ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-host_keystore.jks /tmp
$ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks /tmp
$ chmod a+rw /tmp/cm-auto-host_keystore.jks
$ chmod a+rw /tmp/cm-auto-global_truststore.jks
Disconnect from the remote machine and copy these stores: $ cd ~/Desktop
$ scp -i [path_to_private_key] cloudbreak@[your_nifi_host]:/tmp/cm-auto-host_keystore.jks cm-auto-host_keystore.jks
$ scp -i [path_to_private_key] cloudbreak@[your_nifi_host]:/tmp/cm-auto-global_truststore.jks cm-auto-global_truststore.jks
Configure your local NiFi with these stores, by modifying your nifi.properties: nifi.security.keystore=/Users/pvidal/Desktop/cm-auto-host_keystore.jks
nifi.security.keystoreType=JKS
nifi.security.keystorePasswd=[keystore_pw]
nifi.security.keyPasswd=[keystore_pw]
nifi.security.truststore=/Users/pvidal/Desktop/cm-auto-global_truststore.jks
nifi.security.truststoreType=JKS
nifi.security.truststorePasswd=[truststore_pw] Note: To know the passwords of these stores, please connect with your Cloudera team.
Restart your local NiFi instance: nifi restart
Step 3: Configure a simple site-to-site flow
Local instance
Create a process group to host your flow (here called S2S Cloud:
In this process group, create a remote process group instance and configure it with one of your cloud NiFi instances address, and the HTTP protocol:
Create a simple Generate flow file processor and connect it to the remote processor: Note: Without configuring Ranger, you will get a Forbidden warning (see step 4).
CDP Public Instance
Create a process group to host your flow (here called Receive from on prem):
In this process group, create an input port accepting remote connections:
Finally, create a flow that takes the data and logs it:
Start your flow.
Step 4: Authorize this flow in Ranger
From the Cloudera Management console, go to Ranger and your NiFi service:
From the list of policies, create a new policy (here called s2s) that will allow access to your specific process group and the site-to-site protocol (Ranger does auto completion):
Save this policy, and go back to your local machine; you can now enable the remote process group and start sending files!
Example of successful flows
Local Flow
CDP Public Flow
... View more
06-04-2020
11:38 AM
1 Kudo
Cloudera Machine Learning (and Cloudera Data Science Workbench) is built on a very robust and flexible framework to ease integration with third parties. In this article, I decided to explore the integration of Teradata with CML via ODBC.
A few notes before I dive in:
I could have easily used JDBC via JayDeBeApi (see my previous article), but where is the fun in that
This article could be generalized to more ODBC connections, provided proper parameters are setup
To get a fully rounded tutorial, I will go through these 3 steps:
Step 1 (optional if you already have a Teradata instance): Setting up Teradata Vantage in AWS
Step 2: Create a custom image with Teradata driver
Step 3: Configure and run python to Teradata ODBC
Step 1: Setting up Teradata Vantage in AWS
If you want to do development testing of Teradata, you can use Teradata Vantage Developer on the AWS Market Place. I am not going to dive too much into this as this configuration may change as Teradata evolves.
Pre-Requisites
The Teradata Vantage Developer is an AWS CloudFormation template, that requires the following assets to be pre-created:
An Internet Gateway
A VPC associated with this IGW (and DNS / DNS Hostnames support)
A subnet (with a route to the IGW and Public IP creation on launch)
A placement group
Setup
Once the assets are created, go to Teradata Vantage Developer and select your configuration (make sure you select the same region as the assets you created):
From this configuration, launch the CloudFormation template and fill the parameters with the AWS assets you created:
The template will create two nodes by default. We will connect to one of them to create a test user: $ ssh ec2-user@[your_public_hostname]
ec2-user@SMP001-01:~> bteq
Teradata BTEQ 16.20.00.01 for LINUX. PID: 17205
Copyright 1984-2017, Teradata Corporation. ALL RIGHTS RESERVED.
Enter your logon or BTEQ command:
.LOGON
UserId: dbc
Password: [password_set_in_cf_template]
CREATE user test AS password=test perm = 200000000, SPOOL = 100000000;
You can then connect with your favorite SQL editor to test the connection and run table creation: CREATE TABLE "test".TEST_TABLE (
COLUMN1 VARCHAR(100),
COLUMN2 VARCHAR(100)
) ;
Step 2: Create a custom image with Teradata driver
I created the following image by following the steps the documentation on custom engines: #Dockerfile
FROM docker.repository.cloudera.com/cdsw/engine:11-cml-2020.04-1
WORKDIR /tmp
#The RUN commands that install an editor
#For example: RUN apt-get install myeditor
RUN apt-get autoremove unixodbc -y
RUN apt-get update -y
RUN apt-get install lib32stdc++6 -y
RUN apt-get install wget -y
RUN wget [LOCATION_OF_DRIVERS]
RUN tar -xzvf tdodbc1620__ubuntu_indep.16.20.00.87-1.tar.gz
RUN dpkg -i tdodbc1620/tdodbc1620-16.20.00.87-1.noarch.deb
RUN apt-get install -y python-pyodbc A few notes on this image:
I removed the unixodbc because I read that it was causing issues with pyodbc but it may not be required
You can find a built image on my dockerhub here
Finally, connect to CML and add this new engine:
Step 3: Configure and run python to Teradata ODBC
Go to your workbench in a new project, and create a session with the engine you created.
Then run the following: pip3 install pyodbc
After it is installed, go to your odbc.ini file: vi /home/cdsw/.odbc.ini
Configure your file as follows: [ODBC Data Sources]
Teradata ODBC DSN = Teradata ODBC Driver 16.20
TEST = My DSN Description
[TEST]
Driver = /opt/teradata/client/16.20/odbc_64/lib/tdataodbc_sb64.so
Description = Teradata database
DBCName = [your_teradata_hostname]
UID = test
PWD = test
Finally, you can run the following code to test: import pyodbc
import pandas as pd
conn = pyodbc.connect('DSN=TEST')
# Define Cursor
cus=conn.cursor()
query = "select * from test.TEST_TABLE;"
# Execute SQL statement to get current datetime and store result in cursor
cus.execute(query)
# Display the content of cursor
row = cus.fetchone()
print(row)
# Use Pandas to execute and retrieve results
df = pd.read_sql(query, conn)
print(df)
The output in the workbench should look something like the following:
... View more
05-29-2020
08:05 AM
2 Kudos
Cloudera Data Platform recently introduced Cluster Connectivity Manager (CCM) in tech preview, a feature with which CDP can communicate with workload clusters without the need for inbound network connections to customer workload clusters.
In this article, I will highlight how to automate the setup of CCM for both AWS and Azure including:
Cloud Network Setup
CDP Environment setup
Before you continue reading:
You can find all automation on my github (here)
Please refer to the official documentation for more details (here)
This is just an example, different deployments require a different set of parameters!
Overview
As highlighted in the above image, CCM uses reverse tunneling to communicate back to your CDP dedicated control plane.
A few things to note:
Each tenant (i.e. customer) will have their own dedicated tunnel using a dedicated Apache Mina server
The diagram above describes a setup in AWS. The Azure setup is very similar but does not require public subnets or NAT gateways.
Network Setup
Note: for both setup, you will notice that the open port 22 is inbound for a specific CIDR. This is to enable SSH to the different boxes for CDP and can be restricted to a CIDR within the VPC.
AWS
For AWS, we will need the following setup:
One VPC
3 public networks
1 internet gateway
1 public route from a public network to IGW
3 private networks
3 private network to NAT gateway
2 security groups
The following is a sample code:
#!/bin/bash
display_usage() {
echo "
Usage:
$(basename "$0") [--help or -h] <prefix> <region> <sg_cidr>
Description:
Creates network assets for CDP env demployment
Arguments:
prefix: prefix of your assets
region: AWS region
sg_cidr: CIDR to open in your security group
--help or -h: displays this help"
}
# check whether user had supplied -h or --help . If yes display usage
if [[ ( $1 == "--help") || $1 == "-h" ]]
then
display_usage
exit 0
fi
# Check the numbers of arguments
if [ $# -lt 3 ]
then
echo "Not enough arguments!" >&2
display_usage
exit 1
fi
if [ $# -gt 3 ]
then
echo "Too many arguments!" >&2
display_usage
exit 1
fi
prefix=$1
region=$2
sg_cidr=$3
# 1. Creating VPC
vpc_id=$(aws ec2 create-vpc --cidr 10.10.0.0/16 | jq -r .Vpc.VpcId)
aws ec2 create-tags --resources $vpc_id --tags Key=Name,Value="$prefix-cdp-vpc" > /dev/null 2>&1
# 2. Creating public subnets
# 2.1. Subnets
public_sub_1=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.0.0/24 --availability-zone "$region"a | jq -r .Subnet.SubnetId)
public_sub_2=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.1.0/24 --availability-zone "$region"b | jq -r .Subnet.SubnetId)
public_sub_3=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.2.0/24 --availability-zone "$region"c | jq -r .Subnet.SubnetId)
aws ec2 create-tags --resources $public_sub_1 --tags Key=Name,Value="$prefix-pub-subnet-1" > /dev/null 2>&1
aws ec2 create-tags --resources $public_sub_2 --tags Key=Name,Value="$prefix-pub-subnet-2" > /dev/null 2>&1
aws ec2 create-tags --resources $public_sub_3 --tags Key=Name,Value="$prefix-pub-subnet-3" > /dev/null 2>&1
# 2.2. Internet gateway
igw_id=$(aws ec2 create-internet-gateway | jq -r .InternetGateway.InternetGatewayId)
aws ec2 create-tags --resources $igw_id --tags Key=Name,Value="$prefix-igw"
aws ec2 attach-internet-gateway --internet-gateway-id $igw_id --vpc-id $vpc_id > /dev/null 2>&1
aws ec2 modify-vpc-attribute --enable-dns-support "{\"Value\":true}" --vpc-id $vpc_id > /dev/null 2>&1
aws ec2 modify-vpc-attribute --enable-dns-hostnames "{\"Value\":true}" --vpc-id $vpc_id > /dev/null 2>&1
# 2.3. Route
route_pub=$(aws ec2 create-route-table --vpc-id $vpc_id | jq -r .RouteTable.RouteTableId)
aws ec2 create-tags --resources $route_pub --tags Key=Name,Value="$prefix-pub-route" > /dev/null 2>&1
aws ec2 create-route --route-table-id $route_pub --destination-cidr-block 0.0.0.0/0 --gateway-id $igw_id > /dev/null 2>&1
aws ec2 associate-route-table --subnet-id $public_sub_1 --route-table-id $route_pub > /dev/null 2>&1
aws ec2 associate-route-table --subnet-id $public_sub_2 --route-table-id $route_pub > /dev/null 2>&1
aws ec2 associate-route-table --subnet-id $public_sub_3 --route-table-id $route_pub > /dev/null 2>&1
# 3. Creating private subnets
# 3.1. Subnets
private_sub_1=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.160.0/19 --availability-zone "$region"a | jq -r .Subnet.SubnetId)
private_sub_2=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.192.0/19 --availability-zone "$region"b | jq -r .Subnet.SubnetId)
private_sub_3=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.224.0/19 --availability-zone "$region"c | jq -r .Subnet.SubnetId)
aws ec2 create-tags --resources $private_sub_1 --tags Key=Name,Value="$prefix-priv-subnet-1" > /dev/null 2>&1
aws ec2 create-tags --resources $private_sub_2 --tags Key=Name,Value="$prefix-priv-subnet-2" > /dev/null 2>&1
aws ec2 create-tags --resources $private_sub_3 --tags Key=Name,Value="$prefix-priv-subnet-3" > /dev/null 2>&1
# 3.1. NAT gateways
alloc_id_1=$(aws ec2 allocate-address --domain vpc | jq -r .AllocationId)
alloc_id_2=$(aws ec2 allocate-address --domain vpc | jq -r .AllocationId)
alloc_id_3=$(aws ec2 allocate-address --domain vpc | jq -r .AllocationId)
nat_1=$(aws ec2 create-nat-gateway --subnet-id $public_sub_1 --allocation-id $alloc_id_1 | jq -r .NatGateway.NatGatewayId)
sleep 30
nat_2=$(aws ec2 create-nat-gateway --subnet-id $public_sub_2 --allocation-id $alloc_id_2 | jq -r .NatGateway.NatGatewayId)
sleep 30
nat_3=$(aws ec2 create-nat-gateway --subnet-id $public_sub_3 --allocation-id $alloc_id_3 | jq -r .NatGateway.NatGatewayId)
sleep 30
# 3.2. Routes
route_priv_1=$(aws ec2 create-route-table --vpc-id $vpc_id | jq -r .RouteTable.RouteTableId)
route_priv_2=$(aws ec2 create-route-table --vpc-id $vpc_id | jq -r .RouteTable.RouteTableId)
route_priv_3=$(aws ec2 create-route-table --vpc-id $vpc_id | jq -r .RouteTable.RouteTableId)
aws ec2 create-tags --resources $route_priv_1 --tags Key=Name,Value="$prefix-priv-route-1" > /dev/null 2>&1
aws ec2 create-tags --resources $route_priv_2 --tags Key=Name,Value="$prefix-priv-route-2" > /dev/null 2>&1
aws ec2 create-tags --resources $route_priv_3 --tags Key=Name,Value="$prefix-priv-route-3" > /dev/null 2>&1
aws ec2 create-route --route-table-id $route_priv_1 --destination-cidr-block 0.0.0.0/0 --nat-gateway-id $nat_1 > /dev/null 2>&1
aws ec2 create-route --route-table-id $route_priv_2 --destination-cidr-block 0.0.0.0/0 --nat-gateway-id $nat_2 > /dev/null 2>&1
aws ec2 create-route --route-table-id $route_priv_3 --destination-cidr-block 0.0.0.0/0 --nat-gateway-id $nat_3 > /dev/null 2>&1
aws ec2 associate-route-table --subnet-id $private_sub_1 --route-table-id $route_priv_1 > /dev/null 2>&1
aws ec2 associate-route-table --subnet-id $private_sub_2 --route-table-id $route_priv_2 > /dev/null 2>&1
aws ec2 associate-route-table --subnet-id $private_sub_3 --route-table-id $route_priv_3 > /dev/null 2>&1
# 4. VPC endpoints
s3_endpoint=$(aws ec2 create-vpc-endpoint --vpc-id $vpc_id --service-name com.amazonaws.${region}.s3 | jq -r .VpcEndpoint.VpcEndpointId)
dynamo_endpoint=$(aws ec2 create-vpc-endpoint --vpc-id $vpc_id --service-name com.amazonaws.${region}.dynamodb | jq -r .VpcEndpoint.VpcEndpointId)
aws ec2 modify-vpc-endpoint --vpc-endpoint-id $s3_endpoint --add-route-table-ids $route_pub $route_priv_1 $route_priv_2 $route_priv_3 > /dev/null 2>&1
aws ec2 modify-vpc-endpoint --vpc-endpoint-id $dynamo_endpoint --add-route-table-ids $route_pub $route_priv_1 $route_priv_2 $route_priv_3 > /dev/null 2>&1
# 5. Security groups
knox_sg_id=$(aws ec2 create-security-group --description "AWS CDP Knox security group" --group-name "$prefix-knox-sg" --vpc-id $vpc_id | jq -r .GroupId)
aws ec2 create-tags --resources $knox_sg_id --tags Key=Name,Value="$prefix-knox-sg" > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $knox_sg_id --protocol tcp --port 22 --cidr $sg_cidr
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol tcp --port 0-65535 --cidr 10.10.0.0/16 > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol udp --port 0-65535 --cidr 10.10.0.0/16 > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol tcp --port 0-65535 --cidr 10.10.224.0/19 > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol udp --port 0-65535 --cidr 10.10.224.0/19 > /dev/null 2>&1
default_sg_id=$(aws ec2 create-security-group --description "AWS default security group" --group-name "$prefix-default-sg" --vpc-id $vpc_id | jq -r .GroupId)
aws ec2 create-tags --resources $default_sg_id --tags Key=Name,Value="$prefix-default-sg"
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol tcp --port 0-65535 --cidr 10.10.0.0/16 > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol udp --port 0-65535 --cidr 10.10.0.0/16 > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol tcp --port 0-65535 --cidr 10.10.224.0/19 > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol udp --port 0-65535 --cidr 10.10.224.0/19 > /dev/null 2>&1
echo "{\"VpcId\": \"$vpc_id\",
\"InternetGatewayId\": \"$igw_id\",
\"PublicSubnets\": [\"$public_sub_1\", \"$public_sub_2\", \"$public_sub_3\"],
\"PublicRouteTableId\": \"$route_pub\",
\"PublicNatGatewayIds\": [\"$nat_1\", \"$nat_2\", \"$nat_3\"],
\"PrivateSubnets\": [\"$private_sub_1\", \"$private_sub_2\", \"$private_sub_3\"],
\"PrivateRouteTableIds\": [\"$route_priv_1\", \"$route_priv_2\", \"$route_priv_3\"],
\"VPCEndpoints\": [\"$s3_endpoint\", \"$dynamo_endpoint\"],
\"KnoxGroupId\": \"$knox_sg_id\" ,
\"DefaultGroupId\": \"$default_sg_id\"}"
Azure
For Azure, we will need the following setup:
One vnet
3 private subnets
2 security groups
The following is a sample code:
#!/bin/bash
display_usage() {
echo "
Usage:
$(basename "$0") [--help or -h] <prefix> <sg_cidr>
Description:
Creates network assets for CDP env demployment
Arguments:
prefix: prefix of your assets
sg_cidr: CIDR to open in your security group
--help or -h: displays this help"
}
# check whether user had supplied -h or --help . If yes display usage
if [[ ( $1 == "--help") || $1 == "-h" ]]
then
display_usage
exit 0
fi
# Check the numbers of arguments
if [ $# -lt 2 ]
then
echo "Not enough arguments!" >&2
display_usage
exit 1
fi
if [ $# -gt 2 ]
then
echo "Too many arguments!" >&2
display_usage
exit 1
fi
prefix=$1
sg_cidr=$2
# 1. Vnets and subnets
az network vnet create -g $prefix-cdp-rg --name $prefix-cdp-vnet --address-prefix 10.10.0.0/16
az network vnet subnet create -g $prefix-cdp-rg --vnet-name $prefix-cdp-vnet -n $prefix-priv-subnet-1 --address-prefixes 10.10.160.0/19
az network vnet subnet create -g $prefix-cdp-rg --vnet-name $prefix-cdp-vnet -n $prefix-priv-subnet-2 --address-prefixes 10.10.192.0/19
az network vnet subnet create -g $prefix-cdp-rg --vnet-name $prefix-cdp-vnet -n $prefix-priv-subnet-3 --address-prefixes 10.10.224.0/19
az network vnet subnet update -n $prefix-priv-subnet-1 --vnet-name $prefix-cdp-vnet -g $prefix-cdp-rg --service-endpoints "Microsoft.Sql" "Microsoft.Storage"
az network vnet subnet update -n $prefix-priv-subnet-2 --vnet-name $prefix-cdp-vnet -g $prefix-cdp-rg --service-endpoints "Microsoft.Sql" "Microsoft.Storage"
az network vnet subnet update -n $prefix-priv-subnet-3 --vnet-name $prefix-cdp-vnet -g $prefix-cdp-rg --service-endpoints "Microsoft.Sql" "Microsoft.Storage"
# 2. NSG
az network nsg create -g $prefix-cdp-rg -n $prefix-knox-nsg
az network nsg create -g $prefix-cdp-rg -n $prefix-default-nsg
az network nsg rule create -g $prefix-cdp-rg --nsg-name $prefix-knox-nsg -n ssh_cidr --priority 102 --source-address-prefixes "$sg_cidr" --destination-address-prefixes '*' --destination-port-ranges 22 --direction Inbound --access Allow --protocol Tcp --description "Allow SSH to boxes from CIDR."
az network nsg rule create -g $prefix-cdp-rg --nsg-name $prefix-knox-nsg -n outbound --priority 107 --source-address-prefixes '*' --destination-address-prefixes '*' --destination-port-ranges '*' --direction Outbound --access Allow --protocol '*' --description "Allow outbound access."
az network nsg rule create -g $prefix-cdp-rg --nsg-name $prefix-default-nsg -n outbound --priority 107 --source-address-prefixes '*' --destination-address-prefixes '*' --destination-port-ranges '*' --direction Outbound --access Allow --protocol '*' --description "Allow outbound access."
CDP setup
Note: This setup is using the CDP CLI. You could also select these networks from the UI directly. It also assumes the creation of a proper IAM and storage, which can also be automated in my github.
AWS
cdp environments create-aws-environment --environment-name ${prefix}-cdp-env \
--credential-name ${credential} \
--region ${region} \
--security-access securityGroupIdForKnox="${knox_sg_id}",defaultSecurityGroupId="${default_sg_id}" \
--authentication publicKeyId="${key}" \
--log-storage storageLocationBase="${prefix}-cdp-bucket",instanceProfile="arn:aws:iam::$AWS_ACCOUNT_ID:instance-profile/${prefix}-log-role" \
--subnet-ids "${pub_sub_1}" "${pub_sub_2}" "${pub_sub_3}" "${priv_sub_1}" "${priv_sub_2}" "${priv_sub_3}" \
--vpc-id "${vpc}" \
--s3-guard-table-name ${prefix}-cdp-table \
--enable-tunnel \
--tags key="enddate",value="${END_DATE}" key="project",value="${PROJECT}"
Azure
cdp environments create-azure-environment --environment-name ${prefix}-cdp-env \
--credential-name ${credential} \
--region "${region}" \
--public-key "${key}" \
--security-access securityGroupIdForKnox="$knox_nsg",defaultSecurityGroupId="$default_nsg" \
--log-storage storageLocationBase="abfs://logs@${prefix}cdpsa.dfs.core.windows.net",managedIdentity="/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${prefix}-cdp-rg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/loggerIdentity" \
--existing-network-params networkId="$network_id",resourceGroupName="$prefix-cdp-rg",subnetIds="$subnet_1","$subnet_2","$subnet_3" \
--tags key="enddate",value="${END_DATE}" key="project",value="${PROJECT}" \
--no-use-public-ip \
--enable-tunnel
... View more
Labels:
05-21-2020
06:24 AM
1 Kudo
Continuing my series of how-to articles for CDP, today we explore how to connect to Impala via JDBC in JSON. In my example, I will use a Jupyter notebook running in CML, but this can be generalized.
This process is actually fairly easy, so let's dive in.
Step 1: Setup Impala JDBC drivers
First, download the latest impala JDBC drivers from Cloudera JDBC Driver 2.6.17 for Impala.
Then, upload them to your machine. Here is an example of a CML Jupyter sessions with the jars uploaded:
Finally, make sure that you set up your CLASSPATH properly by opening a terminal session and typing the following: CLASSPATH=.:/home/cdsw/ImpalaJDBC4.jar:/home/cdsw/ImpalaJDBC41.jar:/home/cdsw/ImpalaJDBC42.jar
export CLASSPATH
Step 2: Install JayDeBeApi
To install JayDeBeApi, run the following: pip3 install JayDeBeApi
A recommended step to avoid getting an error along the lines of "AttributeError: type object 'java.sql.Types' has no attribute '__javaclass__'", would be to downgrade your jpype by running the following: pip3 install --upgrade jpype1==0.6.3 --user
Restart your kernel when you perform the downgrade.
Step 3: Connect to Impala
Finally, connect to your impala, using the following sample code: import jaydebeapi
conn = jaydebeapi.connect("com.cloudera.impala.jdbc.DataSource",
"jdbc:impala://[your_host]:443/;ssl=1;transportMode=http;httpPath=icml-data-mart/cdp-proxy-api/impala;AuthMech=3;",
{'UID': "[your_cdp_user]", 'PWD': "[your_workload_pwd]"},
'/home/cdsw/ImpalaJDBC41.jar')
curs = conn.cursor()
curs.execute("select * from default.locations")
curs.fetchall()
curs.close()
conn.close() Note: You can get your impala JDBC string either from the Datahub endpoint path or from the JDBC URL from CDW.
The following is a screenshot of my code in action:
... View more
05-11-2020
06:55 AM
Hi Abdul, That's a very good question. To answer it, you need to broaden the horizon a bit from just Navigator and Atlas, and consider Atlas, Ranger, Sentry and Navigator. In short, all functionalities you have in CDH (using Navigator, Sentry) will be available to you in CDP (using Atlas and Ranger). Here is a high level mapping of the different tools in our different platforms: Function Navigator (CDH) Sentry (CDH) Atlas (CDP) Ranger (CDP) Data Lineage X X Audits X X RBAC X X ABAC Not Supported X Of course, Atlas gives you a lot of extra functionalities not available to you before such as a more scalable back-end, open apis for integration, large apache community support. I recommend exploring our product pages to get a better understanding: https://www.cloudera.com/products/open-source/apache-hadoop/apache-atlas.html
... View more
05-08-2020
10:36 AM
2 Kudos
Here is a fun one: how do you connect from Python in Cloudera Machine Learning to our Kafka datahub cluster? The documentation is pretty thorough, but it does not have an example of a python client. That's what I'm going to highlight in this article. The good news is that since CML and Datahub run in the same network, you don't need to worry about opening the broker ports, therefore you just need to follow these steps: Step 1: Get and upload your freeIPA certificate Step 2: Find your broker hostnames Step 3: Setup your client Step 1: Get and upload your freeIPA certificate Go to your management console > your environment > Actions > Get FreeIPA Certificate: Once downloaded, go to your CML workspace, and upload your file (e.g. /home/cdsw/ca.crt). Step 2: Find your broker hostnames For this, go to your Kafka Datahub Cluster > CM UI > Kafka > Instances; you can find the broker hosts here: Step 3: Setup your client Then, open a session in CML, and use the following parameters: from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['<YOUR_BROKER_URL>:9093','<YOUR_BROKER_URL>:9093','<YOUR_BROKER_URL>:9093'],
security_protocol="SASL_SSL",
sasl_mechanism="PLAIN",
ssl_check_hostname=True,
ssl_cafile='/home/cdsw/ca.crt',
sasl_plain_username="<YOUR_WORKLOAD_USER>",
sasl_plain_password="<YOUR_WORKLOAD_PASSWORD>",
api_version_auto_timeout_ms=30000)
... View more
05-05-2020
06:24 PM
5 Kudos
Recently I came around an interesting problem: how to use boto to get data from a secure bucket in a Jupyter notebook in Cloudera Machine Learning. The missing piece was: I needed to get my code integrated with my AWS permissions given by IDBroker. Since CML already authenticated me to Kerberos, all I need was getting the goods from IDBroker. In this article, I will show you pseudo code on how to get these access keys both in bash and python. Note: Special thanks to @Kevin Risden to whom I owe this article and many more things. Find your IDBroker URL Regardless of the method, you will need to get the URL for your IDBroker host. This is done simply in the management console of your datalake. The following is an example: Getting Access Keys in bash After you are connected to one of your cluster's node and ensure you kinit, run the following: IDBROKER_DT="$(curl -s --negotiate -u: "https:/[IDBROKER_HOST]:8444/gateway/dt/knoxtoken/api/v1/token")" IDBROKER_ACCESS_TOKEN="$(echo "$IDBROKER_DT" | python -c "import json,sys; print(json.load(sys.stdin)['access_token'])")" IDBROKER_CREDENTIAL_OUTPUT="$(curl -s -H "Authorization: Bearer $IDBROKER_ACCESS_TOKEN" "https://[IDBROKER_HOST]:8444/gateway/aws-cab/cab/api/v1/credentials")" The credentials can be found in the $IDBROKER_CREDENTIAL_OUTPUT variable. Getting Access Keys in Python Before getting started, the following libraries are installed: pip3 install requests requests-kerberos boto3 Then, run the following code: import requests from requests_kerberos import HTTPKerberosAuth r = requests.get("https://[IDBROKER_URL]:8444/gateway/dt/knoxtoken/api/v1/token", auth=HTTPKerberosAuth()) url = "https://[IDBROKER_URL]:8444/gateway/aws-cab/cab/api/v1/credentials" headers = { 'Authorization': "Bearer "+ r.json()['access_token'], 'cache-control': "no-cache" } response = requests.request("GET", url, headers=headers) ACCESS_KEY=response.json()['Credentials']['AccessKeyId'] SECRET_KEY=response.json()['Credentials']['SecretAccessKey'] SESSION_TOKEN=response.json()['Credentials']['SessionToken'] import boto3 client = boto3.client( 's3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, aws_session_token=SESSION_TOKEN, ) You can then access your buckets via the following: data = client.get_object(Bucket='[YOUR_BUCKET]', Key='[FILE_PATH]') contents = data['Body'].read() Added on 2022-03-25 If your user is part of multiple groups with different IDBroker mappings, you might get the following error message: "Ambiguous group role mappings for the authenticated user." In this case you need to adjust the following line in the code example to specify for which group you would like to get the access credentials: url = "https://[IDBROKER_URL]:8444/gateway/aws-cab/cab/api/v1/credentials/group/my_cdp_group"
... View more