Member since
05-19-2020
6
Posts
2
Kudos Received
0
Solutions
07-15-2021
07:09 AM
1 Kudo
It may often come to pass that you need to utilize data that does not exist in your Environment's Data Lake. This article will cover 2 scenarios where you need to access data that is outside of your Data Lake (either in a Storage Account in the same resource group as your CDP Public Cloud or in a different resource group, albeit in the same subscription).
UPDATE: Scenarios 3 and 4 below cover RAZ!
Scenario 1 - Adding External Storage Account Access to your Data Access Identity (non RAZ)
In this scenario, you may want a single managed identity to be able to access both your Data Lake Storage Account as well as your External Storage Account. For this, I will choose to grant access to only a single container within the storage account (but, as you will see in the following scenario, you could grant access to the entire storage account - I'm showing both purely for illustration).
Find your DataAccessIdentity in the Azure Portal, and note it's Object ID
Find your External Storage Account and note its name, resource group, and the container name you wish to grant access to.
Pull up an Azure Cloud Shell and execute the following command (with the appropriate substitutions).
az role assignment create --assignee $DATAACCESS_OBJECTID --role 'ba92f5b4-2d11-453d-a403-e96b0029c9fe' --scope "/subscriptions/$SUBSCRIPTIONID/resourceGroups/$RESOURCEGROUPNAME/providers/Microsoft.Storage/storageAccounts/$STORAGEACCOUNTNAME/blobServices/default/containers/$CONTAINERNAME"
Note: 'ba92f5b4-2d11-453d-a403-e96b0029c9fe' is the GUID that maps to the built in Azure role "Storage Blob Data Contributor", which allows for Edit Access in the container. Now, when you view the role assignments for this managed identity, you should see a new entry for the external storage account. Note: It may take several minutes until this page reflects your RBAC change. Anyone who has an IDBroker mapping to this MSI can now access this new container in the external storage account.
Scenario 2 - Adding External Storage Account Access to a new Managed Identity (non RAZ)
In this scenario, we will create a brand new managed identity and provision access to the entire storage account.
Create a new MSI
Now, we'll use the Portal to add the role assignment Note: We used the Cloud Shell in Scenario 1 because the portal doesn't (yet) support scoping the role down to the container. After a few minutes, you should see the role assignment appear
You are now ready to map this MSI to users in CDP (via an IDBroker Mapping). Since this is a new role, let's quickly review how to do that...
Head to the Properties blade for the new MSI and note the Resource ID
Now let's head over to the CDP Console. Head to your Environment and click Actions and Manage Access.
Click IDBroker Mappings and click Edit.
Click the plus sign to add a new mapping. Then start typing your user's name or Group and paste in the resource ID you noted in the Role field.
Click Save and Sync and you're done!
If you followed Scenario 1 and have the DataAccessIdentity mapped to your user, you should now be able to access both the data container in the Data Lake Storage Account and our new container in the External Storage Account.
If you followed Scenario 2 and have the new MSI mapped to your user, you should now be able to ONLY access the new container in the External Storage Account.
Scenario 3 - Adding External Storage Account in the same Subscription (RAZ)
In Azure, this can be accomplished by adding the same two roles (Storage Blob Data Owner and Storage Blob Delegator) you added to your RAZ Managed Identity for your Datalake Storage Account to an external Storage Account.
Here is what your RAZ Managed Identity looks like for your minimal setup for CDP with RAZ:
Just add these same two roles to another Storage Account to allow RAZ/your CDP Environment to interact with another Storage Account:
So that your Managed Identity Role Set looks like this:
Scenario 4 - Adding External Storage Account in a different Subscription (RAZ)
We follow the same procedure as above, but just with a different scope (because we're integrating with a storage account in a different subscription.
We add Storage Blob Data Owner and Storage Blob Delegator on the Storage Account (in a different subscription)
So that the RAZ Managed Identity has this role set for the scope of our "other" subscription:
Based on Scenarios 3 and 4, we now can interact with a total of 3 storage accounts:
perro4 (our Datalake storage account)
perro4ext (our external storage account in the same subscription)
perroext (our external storage account in a different subscription)
DISCLAIMER: This article is contributed by an external user. The steps may not be verified by Cloudera and may not be applicable for all use cases and may be very specific to a particular distribution. Please follow with caution and at your own risk. If needed, raise a support case to get confirmation.
... View more
Labels:
12-23-2020
11:30 AM
2 Kudos
Overview CDP Public Cloud provides a mechanism for provisioning all of your resources in private subnets (i.e. resources that don't get assigned public IPs). I won't detail the specifics about how each CSP accomplishes this; you can find much more thorough overviews in each CSP's documentation. However, in using this architecture, the problem becomes: how do I interact (UI/API/etc) with these resources if they aren't publicly accessible? In an ideal/ corporate world, some sort of VPN peering would be setup so that merely being on your corporate VPN would allow access to the resources in these subnets. You may find yourself in the unfortunate situation I do, where, you happen to be using an AWS (or Azure) account that doesn't have the fancy VPN peering enabled. Enter the SOCKS5 proxy. I also won't go over the specifics about how this works, but suffice it to say, we will create a bastion host in a public subnet (with a very narrow security group configuration), create an SSH tunnel to that host, and forward all of our web traffic over that SSH tunnel to the bastion host (which will be able to communicate with our CDP resources). One last note: I'm going to assume you have already provisioned your CDP environment with private networking (following these docs for AWS or these docs for Azure). The easiest way to confirm this is to take a look at your FreeIPA information: If the Public IP is listed as "N/A", then you have a private networking setup. AWS Find a Public Subnet Step 1 is to identify a public subnet. If you were the one who set up this environment, you may have an information handy. If not (say you had CDP create your network - or - you forgot), here are two ways you can figure this out: 1. Find a subnet with a route to an Internet Gateway Using the Console Using the CLI aws ec2 describe-internet-gateways --filter Name=attachment.vpc-id,Values=<YOUR_VPC_ID> | jq -r '.InternetGateways[0].InternetGatewayId'
<Returns an IGW_ID>
aws ec2 describe-route-tables --filter Name=route.gateway-id,Values=<IGW_ID> | jq -r '.RouteTables[0].Associations[0].SubnetId'
<Returns a PUBLIC_SUBNET_ID> 2. Find a subnet that has a NAT Gateway Using the Console Using the CLI aws ec2 describe-nat-gateways --filter Name=vpc-id,Values=<YOUR_VPC_ID> | jq -r '.NatGateways[0].SubnetId'
<Returns a PUBLIC_SUBNET_ID> Create a Bastion Host Now that you have the ID of a public subnet, we just need to create a bastion host in your VPC (that same VPC that has your CDP environment). Using the Console Select the button to launch a new instance: Search for your favorite Linux variant (CentOS 7 below). You may have to select the AWS Marketplace tab. Then, choose Select. Choose an instance type. Choose what meets your needs. I'm selecting t2.medium here. Configure the Networking. You need to select the VPC that your CDP environment is running in. Also, choose the public subnet we found earlier. Finally, depending on how your subnet was created, you may have to change the Auto-assign Public IP to Enable. In the Storage Options, you may want to check the option to Delete on Termination so you don't leave EBS volumes lying around after you delete your bastion. Add any tags you may want (like Name). You may have an existing SG you want to use. In the below example, I created a new x Launch your Instance Select your Key Pair You can now grab the public IP address of your bastion host. Via the CLI Find an Image ID (below uses a product code for Centos 7)... aws ec2 describe-images \
> --owners 'aws-marketplace' \
> --filters 'Name=product-code,Values=aw0evgkw8e5c1q413zgy5pjce' \
> --query 'sort_by(Images, &CreationDate)[-1].[ImageId]' \
> --output 'text'
<Returns an IMAGE_ID> Launch the Instance (NB - You have to provide an existing security group. You can create one with a single ingress rule for TCP/Port 22/Your IP as in the Console walkthrough above). aws ec2 run-instances --image-id <IMAGE_ID> --count 1 --instance-type t2.medium --key-name <YOUR_KEY_PAIR_NAME> --security-group-ids <EXISTING_SECURITY_GROUP> --subnet-id <PUBLIC_SUBNET_ID> --block-device-mapping DeviceName=/dev/sda1,Ebs={VolumeSize=8} --associate-public-ip-address --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value="my-bastion"}]' 'ResourceType=volume,Tags=[{Key=Name,Value="my-bastion"}] Grab the public IP address of your new bastion host aws ec2 describe-instances --filters "Name=tag:Name,Values=my-bastion" 2>/dev/null | jq -r '.Reservations[].Instances[] | select(.State.Name!="terminated")' | jq -r .PublicIpAddress Azure Find a Public Subnet This is easier in Azure. Azure doesn't really make the distinction around Public/Private subnets like AWS does. So, in the Console and CLI steps, you'll just see us picking whatever Subnet is first in the list of subnets in the VNET. Create the Bastion Host Using the Console 1. Find the Virtual Machines Service in the Azure Portal Start the process of creating a virtual machine. Make sure you create your bastion virtual machine in the same subscription as your CDP environment is running in. Second, I recommend creating a new resource group for your Bastion host (it makes deleting things much easier later on). You should pick the same region your CDP environment is based out of as well. You then need to decide on an image. I'm using CentOS, but any Linux flavour should work. Finally, choose a default username and decide on a public key / password. The last section will be used to create a skeleton Network Security Group. Leave "SSH" selected. We'll modify the Network Security Group later to restrict the source IP(s) to our own IP only. 2. Disks don't really matter. I down selected to "Standard SSD". Here's an important bit: ensure you choose the virtual network that houses your CDP environment. As mentioned earlier, I left the subnet alone. Any subnet in your VNET will do. Leave the option of creating a new Public IP. We'll need this to connect to our Bastion. 3. Create your Virtual Machine! 4. Once your resources are deployed, head to your resource group and click on your Network Security Group. 5. Once you're at your NSG, click on the first ingress rule for SSH access. 6. In the details box on the right, change the source to "IP Addresses" and add your IP/32 to the source address CIDR. This will restrict SSH access to your Bastion host to just your IP address. 7. Head back to your resource group and click on your virtual machine. 8. Make note of the public IP assigned to your virtual machine. Using the CLI First, we'll grab the first subnet in our VNET. NB: this needs to the subnet ID, not the subnet name. If you try creating a VM using the CLI command below and provide the friendly subnet name, Azure will create a new VNET and Subnet for you (which is not what we want). az network vnet subnet list --resource-group "<Your CDP Resource Group>" --vnet-name "YOUR CDP VNET NAME" | jq -r '.[0].id'
<Returns a Subnet ID> Let's create a new resource group for our VM, NSG, IP address (mostly for ease of deletion later on). az group create --name my-bastion-rg --location "YOUR CDP ENV REGION" Now we can create a new network security group and restrict the ingress to Port 22/SSH and your personal IP address az network nsg create -g my-bastion-rg -n my-bastion-nsg
az network nsg rule create -g my-bastion-rg --nsg-name my-bastion-nsg -n ssh_cidr --priority 102 --source-address-prefixes "YOUR_IP_ADDRESS/32" --destination-address-prefixes '*' --destination-port-ranges 22 --direction Inbound --access Allow --protocol Tcp --description "Allow SSH to boxes from CIDR." Finally, we can create our virtual machine. az vm create --name my-bastion --resource-group my-bastion-rg --image OpenLogic:CentOS:7.5:latest --location "YOUR CDP ENV REGION" --admin-username centos --public-ip-address $prefix-bastion-ip --subnet "PUBLIC_SUBNET_ID" --ssh-key-values "YOUR_PUBLIC_KEY" --nsg my-bastion-nsg Just grab the public IP address of your new VM az vm list -g my-bastion-rg -d | jq -r '.[0].publicIps' Start the SOCKS5 Proxy Open a terminal a create a SSH connection to your bastion host ssh -i <Path to Private Key for Instance Key Pair> -CND 8157 centos@<Bastion Host Public IP> Launch a Browser using your Proxy Launch your browser of choice using your proxy. Below is how you can launch Chrome with a different user data directory using a proxy server. There are obviously ways to do this in other browsers (and not using the command line). "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="$HOME/chrome-with-proxy" --proxy-server="socks5://localhost:8157" Now you can navigate to the CDP Management console and connect to your UIs. Happy Private Browsing!
... View more
Labels:
11-18-2020
08:50 AM
2 Kudos
A quick way to get started with CDP and Cloudera Data Warehouse (CDW) is to hydrate your environment with some existing relational tables you have. In this article, we'll walk through ingesting existing relational tables into CDP, creating table definitions on top of this data, and then briefly explore the question of "now what?"
A quick note: In this article, we'll use Apache Sqoop to do some full table copies as well as some simple selects/copies based on a where condition. If you plan on doing anything more complex or have thoughts of setting up something more permanent, you may want to consider another solution (like NiFi).
Special thanks to @pvidal for the initial dataset and RDS creation automation.
Determine the Source Data
I have created an Oracle database in one of my AWS accounts. Keep in mind, whatever your source data, you'll need to make sure that proper ingress is enabled in your CDP environment. I'll skip showing the specifics for my case but suffice to say I needed to make sure that my CDP environment could make a JDBC connection to my RDS.
Here's a quick snapshot of my data. I have three tables populated with some sample data:
cdc.location cdc.vaccine cdc.vaccination_rate
Sqoop Setup
Spin up a "Data Engineering" Data Hub cluster in your CDP environment (or reuse an existing one). You'll need your Environment Admin to do some quick prep for you - namely, drop your JDBC jar into Sqoop's lib directory. In this case, it's the OJDBC jar:
$ sudo cp ojdbc7-12.1.0.2.jar /opt/cloudera/parcels/CDH/lib/sqoop/lib/ && sudo chmod 777 /opt/cloudera/parcels/CDH/lib/sqoop/lib/ojdbc7-12.1.0.2.jar
Now we need to prep a storage location for our data to be copied to. Technically, you have two options here. You can ingest your data to the HDFS on your DataHub cluster (which is what will happen if you don't specify a target directory in your Sqoop command - it will go to your HDFS home directory). But in order to be able to utilize these tables across the entire CDP landscape, we'll copy them to Cloud Storage (our environment s3 bucket).
SSH into your Data Hub cluster (as yourself) and create a staging location on your Cloud Storage.
$ hdfs dfs -mkdir -p s3a://perro-cdp-bucket/perro-de/sqoop/cli/tables
This implies you have the proper IDBroker mappings. See the Cloudera Data Platform Docs for more details.
Sqoop Execution
Now, it's time to actually do the sqoop-ing! In the below examples, I just do a full copy of the three oracle tables. See the Sqoop Docs for additional examples and configurations (like custom select statements and utilizing more YARN resources).
The main parameters (that you'll have to change according to your setup) are:
the JDBC URL - e.g. jdbc:oracle:thin:@cmp-cdp-workshop-orcl-rds.cjofigxjk44c.us-east-1.rds.amazonaws.com:1521:ORCL, which is of the format jdbc:oracle:thin:@<your_rds>:<your_rds_port>:<db_sid>
the DB username - e.g. "cdc"
the DB user's password - e.g. "cdpw0rksh0p"
the table name - e.g. "CDC.LOCATION"
the target directory - e.g. "s3a://<your_cdp_bucket>"
$ sqoop import --connect jdbc:oracle:thin:@cmp-cdp-workshop-orcl-rds.cjofigxjk44c.us-east-1.rds.amazonaws.com:1521:ORCL --username cdc --password cdpw0rksh0p --table CDC.LOCATION --target-dir 's3a://perro-cdp-bucket/perro-de-2/sqoop/cli/tables/location'
$ sqoop import --connect jdbc:oracle:thin:@cmp-cdp-workshop-orcl-rds.cjofigxjk44c.us-east-1.rds.amazonaws.com:1521:ORCL --username cdc --password cdpw0rksh0p --table CDC.VACCINE --target-dir 's3a://perro-cdp-bucket/perro-de-2/sqoop/cli/tables/vaccine'
$ sqoop import --connect jdbc:oracle:thin:@cmp-cdp-workshop-orcl-rds.cjofigxjk44c.us-east-1.rds.amazonaws.com:1521:ORCL --username cdc --password cdpw0rksh0p --table CDC.VACCINATION_RATE --target-dir 's3a://perro-cdp-bucket/perro-de-2/sqoop/cli/tables/vaccination_rate'
Note: I did encounter some oddness with Oracle or Sqoop or the OJDBC jar - I *had* to capitalize the table names.
After your job completes, you can do a quick sanity check using the HDFS CLI:
Next Steps
Now that you have hydrated your CDP environment with some data, the world is your oyster! You could
Process the data with Cloudera Data Engineering (or another Datahub)
Build Impala/Hive tables on top of the data and use Cloudera Data Warehouse
Leverage the data in Cloudera Machine Learning
BONUS - CDW and Viz Apps
Let's do the following to run with our second example "Next Step": building some Hive tables and using Cloudera Data Warehouse:
Presuming you have a CDW environment, data catalog, and virtual warehouse at your disposal, let's create the CDP tables: CREATE EXTERNAL TABLE LOCATION
(
location_id INT,
location_name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://perro-cdp-bucket/perro-data-eng/sqoop/cli/tables/location';
CREATE EXTERNAL TABLE VACCINE
(
vaccine_id INT,
vaccine_name STRING,
vaccine_dose INT,
vaccine_age STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://perro-cdp-bucket/perro-data-eng/sqoop/cli/tables/vaccine';
CREATE EXTERNAL TABLE VACCINATION_RATE
(
location_id INT,
vaccine_id INT,
year INT,
rate FLOAT,
upper_limit FLOAT,
lower_limit FLOAT,
confidence_interval FLOAT,
sample_size INT,
target FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://perro-cdp-bucket/perro-data-eng/sqoop/cli/tables/vaccination_rate';
Now you can query/data engineer/machine learn across your CDP environment. I happen to create my virtual warehouse with "Data Visualization":
So, I'll continue on and create a quick Dashboard using Cloudera Data Visualization. Click on the options menu in your virtual warehouse, and, instead of selecting Hue or DAS, click on "Open Data Visualization":
Log in and Head to the "Data" tab:
From there, you can click on the Default Hive VW and "Connection Explorer" to check out the tables available to you:
Above, you can see the three tables we created in the default database. Let's create a quick dataset so we can use it to create a Dashboard. Click on the "New Dataset" button, give your Dataset a name and choose which database and table you want your Dataset to use.
You should now see your new Dataset in the "Datasets" tab. To create a Dashboard using this Dataset, click the icon next to your Dataset:
You should be presented with your data in a table, but that's boring. So let's create a real quick Bar Chart to wrap things up. Click the third little icon in the visuals tray to turn the object into a Bar Chart. Then, drag the "vaccine_age" dimension to "X-Axis" and "Record Count" measure to "Y-Axis". Finally, click the "Refresh Visual" button.
You now have a bar chart that shows the number of vaccines by age they are administered!
... View more
08-28-2020
01:16 PM
4 Kudos
It may often come to pass that you need to utilize data that doesn't exist in your Environment's Data Lake. This article will cover the 3 different scenarios where you need to access data that is outside of your Data Lake:
A bucket in the same account as your CDP Environment
A public bucket
A bucker in a different account (than your CDP Environment)
Since the steps for variations 1 and 2 are the same, I'll cover them together.
Since IDBroker maps CDP users to explicit IAM roles. You need to perform the following to access even Public Buckets (unless you intend on interacting with them via URL).
Public Bucket / Bucket in the Same AccountFirst, identify the public bucket in question. For purposes of this article, I will use the "commoncrawl" public AWS bucket (https://registry.opendata.aws/commoncrawl/)
You'll need the bucket ARN (in this case "arn:aws:s3:::commoncrawl"), the ability to create IAM policies and roles, as well as the EnvironmentAdmin role in CDP.
Next, we'll create a policy for accessing this bucket. In our example, I have created a policy with this definition:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:CreateJob",
"s3:GetAccountPublicAccessBlock",
"s3:HeadBucket",
"s3:ListJobs"
],
"Resource": "*"
},
{
"Sid": "AllowListingOfDataLakeFolder",
"Effect": "Allow",
"Action": [
"s3:GetAccelerateConfiguration",
"s3:GetAnalyticsConfiguration",
"s3:GetBucketAcl",
"s3:GetBucketCORS",
"s3:GetBucketLocation",
"s3:GetBucketLogging",
"s3:GetBucketNotification",
"s3:GetBucketPolicy",
"s3:GetBucketPolicyStatus",
"s3:GetBucketPublicAccessBlock",
"s3:GetBucketRequestPayment",
"s3:GetBucketTagging",
"s3:GetBucketVersioning",
"s3:GetBucketWebsite",
"s3:GetEncryptionConfiguration",
"s3:GetInventoryConfiguration",
"s3:GetLifecycleConfiguration",
"s3:GetMetricsConfiguration",
"s3:GetObject",
"s3:GetObjectAcl",
"s3:GetObjectTagging",
"s3:GetObjectVersion",
"s3:GetObjectVersionAcl",
"s3:GetObjectVersionTagging",
"s3:GetReplicationConfiguration",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::commoncrawl",
"arn:aws:s3:::commoncrawl/*"
]
}
]
}
And attached it to a corresponding role:
A few notes about this role:
It's a role much like your "Datalake Admin Role", i.e. its trusted entity type is "Another AWS Account" where the account number is your account number.
I've attached the dynamodb policy here as well since my ultimate goal is to be able to interact with this bucket via the hdfs CLI (and that involves the use of s3Guard) No longer needed. S3Guard is no longer needed.
Here is the trust relationship (the same as your "Datalake Admin Role)
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<Your Acct Number>:role/<Your ID Broker Role>"
},
"Action": "sts:AssumeRole"
}
]
}
Now, we can head back to CDP and update our IDBroker Mappings.
From your environment overview, click "Actions" and "Manage Access".
Click on "ID Broker Mappings" and then click "Edit"
From here, you can add the user (or group) you'd like to give access to the bucket in question as well as the role we just created:
Click "Save & Sync" and you're good to go!
You can now interact with this bucket via the hdfs CLI (here's an example from my Data Lake "master" node)
Bucket in a Different Account
But what about a bucket that's in a different account, you say? The process is very similar...
(Thanks to Nathan Anthony (@na) for the inspiration here)
Again, first, identify the bucket. I'll be using a bucket called "perro-ext" in an account that is different than the account that hosts my CDP environment. I will refer to the accounts as "CDP Account" and "External Account".
Next, create a policy (in the CDP Account) for accessing the bucket in the External Account:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:PutObjectAcl",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::perro-ext",
"arn:aws:s3:::perro-ext/*"
]
}
]
}
And attach it to a role in the same way as above:
With the exact same trust policy as above:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<CDP Account Number>:role/<Your ID Broker Role>"
},
"Action": "sts:AssumeRole"
}
]
}
Here's where we briefly divert from the first scenario. We need to edit the Bucket Policy on the bucket in the External Account.
In the External Account, head to the bucket and click on "Permissions" and "Bucket Policy"
Here's the policy I used:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<Your CDP Account>:role/<Your CDP Account Role>"
},
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:PutObjectAcl",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::perro-ext/*",
"arn:aws:s3:::perro-ext"
]
}
]
}
(The steps are loosely based on the first resolution method here).
Now we're back to the same steps as the method above. Head back to the ID Broker mappings tab in Manage Access under your Environment and add the role to the user or group of your choice:
"Save and Sync" and now you can access the bucket in the "External Account" from the hdfs CLI:
But What About RAZ?
If you have RAZ enabled in your environment, the steps are largely the same, but you must add a policy to access the bucket in question to the RAZ role. The Bucket Policy (if reading/writing to a bucket in a different account) is the same.
And don't forget to update your cm_s3 ranger policies!
DISCLAIMER: This article is contributed by an external user. The steps may not be verified by Cloudera and may not be applicable for all use cases and may be very specific to a particular distribution. Please follow with caution and at your own risk. If needed, raise a support case to get confirmation.
... View more
Labels:
05-19-2020
12:22 PM
CML provides some nice high-level session, model, and experiment auditing. But sometimes a situation arises where you would like audit logs at the level of file modification and user logins. This article provides a quick walkthrough of accessing the underlying RDBMS for this information. This page gives details on the tables and information available. This guide presupposes you are using AWS.
Thanks to @fletch_jeff for the help.
Step 1 - Install aws-iam-authenticator
Install according to your OS as per these instructions.
Step 2 - Grab your User ARN
From the command line:
==> aws sts get-caller-identity
{
"Account": "012345678910",
"UserId": "ABBBBBBBBB BXXXXXXXXXX",
"Arn": "arn:aws:iam::012345678910:user/cperro"
}
From AWS Console
Open the AWS Console.
Navigate to IAM.
Navigate to Users.
Select your user name.
Copy the ARN.
Step 3 - Add your ARN to your Workspace
Navigate to Machine Learning Workspaces in CDP.
Click on the Options icon (three vertical dots) for your workspace and select the Manage Remote Access option:
Paste your ARN and click Grant Access.
Click Download Kubeconfig:
Step 4 - Query the Audit Database
Note: the --kubeconfig file should be the KubeConfig you downloaded in the previous step
#List the tables in the database - interactive shell for db-0 pod in mlx namespace -> execute the psql command with user "sense"
kubectl --kubeconfig ~/Downloads/perro-small-workspace-kubeconfig.yaml exec -it db-0 -n mlx -- psql -P pager=off -U sense -c "\l”
#Show the last 10 user events
kubectl --kubeconfig ~/Downloads/perro-small-workspace-kubeconfig.yaml exec -it db-0 -n mlx -- psql -P pager=off -U sense -c "SELECT id, user_id, event_name, description, created_at FROM user_events order by created_at desc limit 10;”
... View more
05-19-2020
12:22 PM
2 Kudos
This article explains how to setup Hive Warehouse Connector (HWC), in CDP Public Cloud CML (tested with CDP Public Cloud runtime 7.1).
Step 1 - Start a Virtual Warehouse
Navigate to Data Warehouses in CDP.
If you haven't activated your environment, ensure you have activated it first.
Once activated, provision a new Hive Virtual Warehouse:
Step 2 - Find your HiveServer2 JDBC URL
Once your Virtual Warehouse has been created, find it in the list of Virtual Warehouses.
Click the Options icon (three vertical dots) for your Virtual Warehouse.
Select the Copy JDBC URL option and note what is copied to the clipboard:
Step 3 - Find your Hive Metastore URI
Navigate back to your environment Overview page and select the Data Lake tab.
Click on the CM-UI Service:
Click on the Options icon (three vertical dots) for the Hive Metastore Service and click Configuration:
Click the Actions dropdown menu and select the Download Client Configuration item:
Extract the downloaded zip file, open the hive-site.xml file, and find the value for the hive.metastore.uris configuration. Make a note this value.
Step 4 - Find the Hive Warehouse Connector Jar
A recent update has made this step unnecessary.
Navigate to your CML Workspace, select your Project, and launch a Python Workbench session.
In either the terminal or Workbench, find the hive warehouse connector jar (HWC):
find / -name *hive-warehouse-connector*
*Note: CML no longer allows "access" to /dev/null - so redirecting errors to that location no longer works. The above command will contain a lot of "Permission Denied" output, but the jar you're looking for should be somewhere mixed in - likely in /usr/lib.
Step 5 - Configure your Spark Session
The Overview Page for the Hive Warehouse Connector provides details and current limitations. The Configuration Page details the two modes described below. Note: fine-grained Ranger access controls are bypassed in the High-Performance Read Mode (i.e. LLAP / Cluster Mode).
JDBC / Client Mode
from pyspark.sql import SparkSession
from pyspark_llap import HiveWarehouseSession
spark = SparkSession\
.builder\
.appName("PythonSQL-Client")\
.master("local[*]")\
.config("spark.yarn.access.hadoopFileSystems","s3a:///[STORAGE_LOCATION]")\
.config("spark.hadoop.yarn.resourcemanager.principal", "[Your_User]")\
.config("spark.sql.hive.hiveserver2.jdbc.url", "[VIRTUAL_WAREHOUSE_HS2_JDBC_URL];user=[Your_User];password=[Your_Workload_Password]")\
.config("spark.datasource.hive.warehouse.read.via.llap", "false")\
.config("spark.datasource.hive.warehouse.read.jdbc.mode", "client")\
.config("spark.datasource.hive.warehouse.metastoreUri", "[Hive_Metastore_Uris]")\
.config("spark.datasource.hive.warehouse.load.staging.dir", "/tmp")\
//No longer necessary .config("spark.jars", "[HWC_Jar_Location]")\
.getOrCreate()
hive = HiveWarehouseSession.session(spark).build()
LLAP / Cluster Mode
Note: LLAP / Cluster Mode doesn't require the HiveWarehouseSession, though you are free to use it for consistency between the modes.
from pyspark.sql import SparkSession
from pyspark_llap import HiveWarehouseSession
spark = SparkSession\
.builder\
.appName("PythonSQL-Cluster")\
.master("local[*]")\
.config("spark.yarn.access.hadoopFileSystems","s3a:///[STORAGE_LOCATION]")\
.config("spark.hadoop.yarn.resourcemanager.principal", "[Your_User]")\
.config("spark.sql.hive.hiveserver2.jdbc.url", "[VIRTUAL_WAREHOUSE_HS2_JDBC_URL];user=[Your_User];password=[Your_Workload_Password]")\
.config("spark.datasource.hive.warehouse.read.via.llap", "true")\
.config("spark.datasource.hive.warehouse.read.jdbc.mode", "cluster")\
.config("spark.datasource.hive.warehouse.metastoreUri", "[Hive_Metastore_Uris]")\
.config("spark.datasource.hive.warehouse.load.staging.dir", "/tmp")\
//No longer necessary .config("spark.jars", "[HWC_Jar_Location]")\
.config("spark.sql.hive.hwc.execution.mode", "spark")\
.config("spark.sql.extensions", "com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension")\
.getOrCreate()
hive = HiveWarehouseSession.session(spark).build()
Step 6 - SQL All the Things
Add your Spark SQL...
JDBC / Client Mode
from pyspark.sql.types import *
#This table has column masking and row level filters in Ranger. The below query, using the HWC, has the policies applied.
hive.sql("select * from masking.customers").show()
#This query, using plain spark sql, will not have the column masking or row level filter policies applied.
spark.sql("select * from masking.customers").show()
LLAP / Cluster Mode
from pyspark.sql.types import *
#This table has column masking and row level filters in Ranger. Neither are applied in the below due to LLAP/Cluster Mode High Performance Reads
hive.sql("select * from masking.customers").show()
spark.sql("select * from masking.customers").show()
... View more