Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Cloudera Employee

In this article, we will learn how to run scheduled ETL workloads on CDE ( Cloudera Data Engineering )  and CDW ( Cloudera Data Warehouse) using Airflow.

 

CDE 1.4 release provides the capability to schedule Airflow jobs (shipped along with CDE). If we need to learn more about Airflow, refer to the Airflow website. 

This release adds two Airflow operators: CDEJobRunOperator and CDWOperator.

The CDEJobRunOperator behind the scene calls the jobs API of CDE to run Spark jobs. More details on jobs API can be found here.

 

The CDWOperator behind the scene makes use of JDBC connection to interact with CDW Hive Virtual warehouse or Impala Virtual warehouse.

 

In this article to demonstrate ETL workload, we will be reading CSV from AWS S3 bucket using Spark, perform a transformation (via CDEJobRunOperator) and load the processed data to CDW warehouse using Airflow CDWOperator.

Pre-Requisite 

CDP environment is created and CDW, CDE experience are provisioned.

CDE Job

The CDE job will be run on Airflow using the CDEJobRunOperator. The job will read CSV data (sample available in GitHub) and write the transformed back to S3.

 

mvn build in this repo, the artifact will be airflow-cde-etl-1.0-SNAPSHOT.jar

Upload the jar in the Create Job screen of CDE virtual cluster UI. Following is the screenshot for the same.

Screenshot 2021-02-17 at 10.23.56 PM.png

Note: {{ dataSetBucketPath }} and {{ writeBucketPath }} are variable that can be overiden from Airflow DAG.

Configuration Hive|Impala VW in Airflow

Launch the Airflow UI of the CDE virtual cluster, the link for the same will be in the virtual cluster details screen.

Add the Hive|Impala connection details as shown in the following screenshot (available under Admin > Connections). 

Screenshot 2021-02-17 at 10.39.23 PM.png

The hostname can be obtained from the Copy JDBC Url of Hive/Impala virtual warehouse as follows:

Screenshot 2021-02-17 at 10.38.50 PM.png

Note: The username/password will be CDP workload username and password.

Submit Airflow DAG 

Before submitting the job, upload the Provider.csv, to the CDP S3 bucket. Then, submit the Airflow dag

from CDE job create UI as follows:

 

Screenshot 2021-02-17 at 10.51.15 PM.png

Note: Replace the test_data_bucket in the dag code with the bucket name applicable.

There you go, we have scheduled an Airflow job to read from the S3 bucket and load to CDW Hive virtual warehouse.

Screenshot 2021-02-18 at 8.20.23 AM.png

3,087 Views
Comments
avatar
New Contributor

archer2012_0-1691155205231.png

hello @SanthoshGowda   Connecting impala VM  in the same way will report an error, hive VM is ok.

 any idea ?

avatar
Expert Contributor

Hi @archer2012 , that error output doesn't give us a lot of information about what went wrong, but it looks like the connection wasn't successful. I recommend using Beeline in verbose mode from a command line on a suitable node to troubleshoot the connection independently. Once you have the connection to Impala working separately, then you can come back to Airflow and use the working connection settings.