Member since
08-07-2019
13
Posts
2
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1331 | 08-07-2019 03:44 PM |
05-26-2021
07:49 AM
1 Kudo
Introduction
In Cloudera Data Engineering (CDE), you may have a need to move the jobs from one virtual cluster to another. This can be for testing purposes or to move the jobs into the latest CDE versions. This can be easily achieved by the backup & restore process and is well documented here. The purpose of this article is to provide additional tools to simplify this and demonstrate the process. The example shown here demonstrates this process within the same CDE service, but you can use the same approach to move across CDE services.
Steps
Ensure that CDP CLI is set up on the system with the required access to list and describe CDE services & virtual clusters. Also, have your workload password handy as this is required for the export and import process.
Download export & import scripts using the following commands. wget https://raw.githubusercontent.com/karthikeyanvijay/cdp-cde/main/export-import-cde-jobs/cde-vc-jobs-export.sh
wget https://raw.githubusercontent.com/karthikeyanvijay/cdp-cde/main/export-import-cde-jobs/cde-vc-jobs-import.sh
In this example, we will move the jobs from the virtual cluster sourceVC to targetVC. The sourceVC contains three jobs and the targetVC does not contain any job.
Run the following command to export the job from sourceVC. When prompted, enter the workload password. Make a note of the backup zip file from the output. Usage: ./cde-vc-jobs-export.sh <source-cluster-name> <source-vc-name> <workload-username>
Run the following command to import the job from targetVC. When prompted, enter the workload password. Usage: ./cde-vc-jobs-import.sh <target-cluster-name> <target-vc-name> <workload-username> <file-name>
Here is a view of the targetVC after the successful import. You can use more advanced options listed here to export jobs based on job names, include resources, etc. Modify the script as per your requirement.
... View more
Labels:
04-15-2021
02:59 PM
1 Kudo
Introduction
This post covers the steps required to build a custom runtime for Cloudera Data Engineering (CDE). The process pulls the base image from container.repository.cloudera.com and builds a custom image based on the Dockerfile provided and uploads the custom image to Amazon ECR using AWS CodeBuild. All the files mentioned in this post can be downloaded from here.
Steps
Setup ECR & IAM role using AWS CloudFormation
The CloudFormation template cloudformation-ecr-codebuild.yml creates the Amazon Elastic Container Repository and the IAM role required for AWS CodeBuild
Update the files cloudformation-ecr-codebuild.yml, cloudformation-parameters.json, and cloudformation-tags.json as required
Create the Cloudformation stack using the following command
aws cloudformation create-stack \ --stack-name vkar-ecr \ --template-body file://cloudformation-ecr-codebuild.yml \ --parameters file://cloudformation-parameters.json \ --tags file://cloudformation-tags.json \ --capabilities CAPABILITY_NAMED_IAM
(If required) To update the stack using change sets, use the following commands.
aws cloudformation create-change-set \ --stack-name vkar-ecr \ --change-set-name change1 \ --template-body file://cloudformation-ecr-codebuild.yml \ --parameters file://cloudformation-parameters.json \ --tags file://cloudformation-tags.json \ --capabilities CAPABILITY_NAMED_IAM
Modify AWS CodeBuild config files
Update the Dockerfile with the required customization required to the base image. The base image for the docker file will be provided by the aws-codebuild.json file.
Modify the aws-codebuild.json file with the updates for the environment.
The parameters SOURCE_REPO_USERNAME and SOURCE_REPO_PASSWORD in the aws-codebuild.json file specify the AWS Secrets Manager secret. You can embed the username and password here (or) specify the secret name here.
Quickly review the buildspec.yml file and make modifications if necessary. In general, no changes will be needed here.
Build the custom image using AWS CodeBuild
Zip up the Dockerfile & buildspec.yml (with no root directory) and upload them to the s3 bucket specified in aws-codebuild.json.
Create the CodeBuild project using the following command:
aws codebuild create-project --cli-input-json file://aws-codebuild.json
Run the build using the following command:
aws codebuild start-build --project-name cde-ml-xgboost-build
You can customize aws-codebuild.sh to automate the above steps.
Screenshots
Run CDE job with custom runtime image
Follow these steps to use the custom runtime image to run a job:
Create a resource of type custom-runtime-image:
cde resource create --type="custom-runtime-image" \ --image-engine="spark2" \ --name="cde-runtime-ml" \ --image="123456789012.dkr.ecr.us-west-2.amazonaws.com/cde/cde-spark-runtime-2.4.5:ml-xgboost"
Create a job using the newly created resource:
cde job create --type spark --name ml-scoring-job \ --runtime-image-resource-name cde-runtime-ml \ --application-file ./ml-scoring.py \ --num-executors 30 \ --executor-memory 4G \ --driver-memory 4G
Execute the job:
cde job run --name ml-scoring-job
-------------------
Vijay Anand Karthikeyan
... View more
04-14-2021
06:39 PM
Use have to use --py-file via CDE CLI to submit your py files. This way you can submit your custom python scripts/packages as .py/.zip/.egg file(s). Similarly via API, use "pyFiles": [ "string" ] configuration to submit your files.
... View more
04-14-2021
12:50 PM
There may be a need to use Anaconda to set up a Python virtual environment or an R environment to run SparklyR on the CDP Public Cloud Datahub cluster. CDP Public Cloud provides "Recipe" as extensions to install additional software on top of the base image. In this post, we will use Recipe with a sample script to accomplish this.
Download this bootstrap/ recipe script for Anaconda install.
wget https://raw.githubusercontent.com/karthikeyanvijay/cdp-publiccloud/main/datahub-recipes/setup-anaconda.sh
Ensure the parameters in the first section of the script. Modify them as required. The USER_GROUP_ADMIN will be the owner of the directory where Anaconda will be installed. ACLs are added to the directory for USER_GROUP_1 and USER_GROUP_2 for sufficient access. If you just have one user group, remove all lines containing USER_GROUP_2 from the script. TAR_FILE_PATH is the packaged Python/ R tarball that can be submitted in the YARN jobs (this way we do not have to install the Python/ R environment on all hosts on the cluster).
ANACONDA_PATH=/hadoopfs/fs1/anaconda3
ANACONDA_DOWNLOAD_FILE=Anaconda3-2020.11-Linux-x86_64.sh
ANACONDA_DOWNLOAD_URL=https://repo.anaconda.com/archive/${ANACONDA_DOWNLOAD_FILE}
ANACONDA_DOWNLOAD_PATH=/tmp
USER_GROUP_ADMIN=sandbox-default-ps-admin
USER_GROUP_1=ps-sandbox-aws-env-user-group
USER_GROUP_2=cdp_sandbox_workers_ww
TAR_FILE_PATH=/hadoopfs/fs1
If you do not require an R or Python environment setup, you can remove the corresponding section in the script.
Also, update the required packages for the workloads to be installed in the Python and/ or R sections.
You can now test this script independently, if possible, to ensure that there are no syntax errors and it will work as expected.
Upload the script into the recipe section in the CDP Control plane.
You can now attach the recipe during cluster provisioning. Here is an example where the 'setup-anaconda' recipe is attached to the gateway host.
Once the cluster is built, you should see the tar.gz files in the TAR_FILE_PATH path and should also have access to the conda commands on the node(s) where the recipe(s) were executed. Please note that the user has to run the command conda init bash before running these commands. Here is a view of the commands from the gateway node.
-------------
Vijay Anand Karthikeyan
... View more
Labels:
04-13-2021
03:20 PM
1 Kudo
Introduction
When working with CDP Public Cloud, there may be a need to access other AWS services from Apache Spark. One such scenario is to get a secret (like a database password) from the AWS Secrets Manager. One approach to this is to use AWS Access keys, but using long-term security credentials in a program may not be feasible or desirable. Within CDP, you can obtain the temporary AWS credentials from ID Broker and then use the AWS Java SDK to access AWS secrets manager.
Steps
Here are the steps to try this in a Spark shell first:
Ensure that the IAM role mapped to the user has 'Read' access to the AWS Secrets Manager.
To test out this program, you can launch a Spark shell using the following command:
spark-shell --master=yarn \ --conf "spark.jars.packages=com.amazonaws:aws-java-sdk:1.11.984,org.scalaj:scalaj-http_2.11:0.3.15"
Run the commands in the file interactively to see the results.
// Change variables here
val id_broker_host = "ps-sandbox-aws-dl-idbroker0.ps-sandb.a465-9q4k.cloudera.site"
val secretName = "cde-cloudera-repo"
val region = "us-west-2"
//Retreive credentials from ID Broker
import scalaj.http.{Http, HttpOptions}
import org.json4s.jackson.JsonMethods._
val id_broker_request = Http("https://"+id_broker_host+":8444/gateway/dt/knoxtoken/api/v1/token")
val id_broker_token = (parse(id_broker_request.asString) \ "access_token").values.toString
val auth_header = Map("Authorization" -> s"Bearer $id_broker_token", "cache-control" -> "no-cache")
val id_broker_credentials_request = Http("https://"+id_broker_host+":8444/gateway/aws-cab/cab/api/v1/credentials").headers(auth_header)
val id_broker_credentials = parse(id_broker_credentials_request.asString) \\ "Credentials"
val aws_access_key = (id_broker_credentials \ "AccessKeyId").values.toString
val aws_secret_key = (id_broker_credentials \ "SecretAccessKey").values.toString
val aws_session_token = (id_broker_credentials \ "SessionToken").values.toString
// Use the retreived credentials
import com.amazonaws.auth.BasicSessionCredentials
import com.amazonaws.auth.AWSStaticCredentialsProvider
val aws_session_credentials = new BasicSessionCredentials(aws_access_key, aws_secret_key, aws_session_token)
val aws_credentials = new AWSStaticCredentialsProvider(aws_session_credentials)
// Access Secrets Manager service using AWS Java SDK with the temporary credentials
import com.amazonaws.services.secretsmanager.AWSSecretsManager
import com.amazonaws.services.secretsmanager.AWSSecretsManagerClient
import com.amazonaws.services.secretsmanager.model._
val secretsmanager_client = AWSSecretsManagerClient.builder.withCredentials(aws_credentials).withRegion(region).build
val getSecretValueRequest = new GetSecretValueRequest().withSecretId(secretName)
val getSecretValueResult = secretsmanager_client.getSecretValue(getSecretValueRequest)
val secret = getSecretValueResult.getSecretString()
print(secret)
You can easily embed these steps in an Apache Spark Scala program to get the secret before creating the Spark session. Do not forget to include the dependent jars when submitting the job.
Download this code sample from here
wget https://raw.githubusercontent.com/karthikeyanvijay/cdp-publiccloud/main/aws/scripts/getAWSCredentials.scala
Conclusion
This post provided an example to access AWS Secrets Manager from Apache Spark using the temporary credentials from ID Broker. The approach can also be used to access other AWS services as well.
-------------
Vijay Anand Karthikeyan
... View more
04-12-2021
09:17 AM
1 Kudo
Introduction
If the AWS IAM roles are not configured correctly, and if you are running into issues using these IAM roles, the best way to troubleshoot it is by using the AWS CLI. This post contains the steps and the required setup for this.
Steps
If the data lake admin role is not configured correctly, you will see the following error during the data lake deployment.
Error: Failed to create HDFS directory., Command [CreateRangerAuditDir], with id [1546333865] failed: Failed to create HDFS directory s3a://<bucketname>/data/ranger/audit/hdfs., Command [RangerPluginCreateAuditDir].
To get to the root cause of the issue, check the S3 permissions on the IAM role. If you want to further troubleshoot, then you can assume the role from the ID Broker's instance profile and use AWS CLI to reproduce and troubleshoot the issue. To do this, use the following troubleshooting steps:
Log in to the ID broker host as a Cloudbreak user using the pem file provided during the environment build
Download this script
wget https://raw.githubusercontent.com/karthikeyanvijay/cdp-publiccloud/main/aws/scripts/get-dladmin-access.sh chmod u+x get-dladmin-access.sh
Modify the role_arn in the script to point to the data lake admin role
Execute the script
./get-dladmin-access.sh
The temporary AWS credentials for the role are now exported to the AWS profile dladmin
Use the AWS CLI to troubleshoot S3 IAM issues
aws s3 -cp sample.txt s3://<bucketname-path>/<path> --profile dladmin
You should now be able to reproduce the same error that occurred during the CDP deployment using AWS CLI.
You can now tweak the permissions and come up with the right IAM permissions for the role. Resume the CDP deployment after fixing the permission issues.
Conclusion
Using the above process, you can accelerate troubleshooting and help narrow down permission issues. The example above shows troubleshooting issues with the data lake admin role, but the same process can be used to troubleshoot any AWS IAM roles.
-------------
Vijay Anand Karthikeyan
... View more
Labels:
04-12-2021
07:44 AM
1 Kudo
Introduction
This post explains the process of submitting Spark jobs on CDP Public Cloud Data Hub to the Livy server and provides a sample wrapper script for job submission. To understand the Livy Spark submit process, please refer to this post and understand how to arrive at the job configuration.
The following steps submit the Spark Pi job from the spark_examples*.jar on S3.
Steps
Get the Livy endpoint for Data hub from the CDP control plane
Ensure that you have the workload password set for the environment
Download the Python wrapper script using the following command:
wget https://raw.githubusercontent.com/karthikeyanvijay/cdp-publiccloud/master/datahub-scripts/livy-cdp-spark-submit/cdp_spark_submit.py
Edit the script to modify the Livy endpoint, workload username, and password
Copy the spark sample jar to the S3
hdfs dfs -cp /opt/cloudera/parcels/CDH-7.2.7-1.cdh7.2.7.p6.11615609/lib/spark/examples/jars/spark-examples_2.11-2.4.5.7.2.7.6-2.jar \ s3a://vkarthikeyan/
Create a sample job configuration file as follows with the name jobconf.json
{ "className":"org.apache.spark.examples.SparkPi", "args": [1000], "file":"s3a://vkarthikeyan/spark-examples_2.11-2.4.5.7.2.7.6-2.jar", "driverMemory": "2G", "driverCores": 1, "executorCores": 2, "executorMemory": "4G", "numExecutors": 3, "queue": "default" }
Run the script
./cdp_spark_submit.py
Conclusion
The job should now be submitted to the Data Hub cluster. The wrapper script can also be used on the CDP Private Cloud Base where Livy is configured with Livy.
------------
Vijay Anand Karthikeyan
... View more
03-11-2021
02:26 PM
2 Kudos
The following steps can be used to configure Intellij to run Spark jobs on Cloudera Data Engineering experience (CDE). This way the developers will be able to test their jobs on CDE without having to leave the IDE. The following steps are shown for an Intellij installed on a Windows PC.
Complete setting up CDE CLI with Git Bash as per this article.
In IntelliJ, go to File > Settings, modify the "shell path" to the following: C:\Program Files\Git\bin\bash.exe" --login -i
Test this setting by launching a terminal from within the IntelliJ project and running CDE CLI commands. If you are unable to run CDE CLI commands (with environment variable CDE_CONFIG not configured), you will have the option to setup/override it in the next step.
Add a Run/Debug configuration with a new Shell script with the following details. Ignore the error stating that the Shell script is not found. You have the option of overriding the CDE_CONFIG environment variable (in case you want to submit it to a different cluster than the system default). Use environment variables to supply arguments to your Spark job:
Run the cde spark submit from the Run/Debug menu as shown as follows. You should see the job run in the terminal window in Intellij.
... View more
Labels:
03-11-2021
05:48 AM
1 Kudo
Download the Cloudera Data Engineering (CDE) CLI from the Cloudera Data Engineering > CDE Services > Virtual Clusters.
Click the Cluster details page, and click the CLI TOOL button to download the CDE CLI.
Move the CLI to the desired directory. Here, it is moved to D:\Users\Vijay.Karthikeyan\Documents\utils\cde.
Add this directory to PATH, as shown as follows:
Create the directory ~/AppData/Local/cloudera/cde to store the CDE configuration files.
Make a note of your workload username from the profile page of the CDP control plane, and set workload password for the environment if not done already.
Create a file containing the workload password
Create the certificate file ca.pem holding the root certificates. Get the certificates here.
Concatenate them and save them as ~/AppData/Local/cloudera/cde/ca.pem.
Get the virtual cluster endpoint (This will be entered as the value vcluster-endpoint in the config.yaml file)
Create a file config.yaml with the following contents under ~/AppData/Local/cloudera/cde user: <Workload username>
vcluster-endpoint: <URL copied in step 8>
auth-pass-file: <path to file created in step 6>
tls-ca-certs: <File created in step 7>
Set the environment variable CDE_CONFIG to point to the config.yaml file created in the previous step.
Now that PATH is updated and a new environment variable is set, you should relaunch Git Bash for the changes to take effect.
Run any CDE command to validate connectivity.
You can also submit spark-submit commands using cde spark submit. cde spark submit ./insurance-cde-app.jar \
s3a://devbucket/data/files/insurance_fraud \
--class org.cloudera.cde.app.Application \
--driver-cores 1 --driver-memory 3g \
--executor-cores 1 --executor-memory 3g --num-executors 3 \
--job-name insurance_fraud_job \
--log-level INFO \
--hide-logs
If you want to connect to another virtual cluster, you can export the CDE_CONFIG at a session-level to point to another file.
For CDE CLI usage, refer to the Cloudera official documentation.
... View more
Labels:
08-07-2019
03:44 PM
As long as two of the three Zookeeper services are up and running, all functionalities will continue to work well. But when another one goes down, it will no longer able to form a quorum. When the disk gets replaced and the 3rd Zookeeper server comes online, it will "bootstrap" itself from the active servers.
... View more