Community Articles
Find and share helpful community-sourced technical articles
Cloudera Employee

In this article, we will learn how to run scheduled ETL workloads on CDE ( Cloudera Data Engineering )  and CDW ( Cloudera Data Warehouse) using Airflow.

 

CDE 1.4 release provides the capability to schedule Airflow jobs (shipped along with CDE). If we need to learn more about Airflow, refer to the Airflow website. 

This release adds two Airflow operators: CDEJobRunOperator and CDWOperator.

The CDEJobRunOperator behind the scene calls the jobs API of CDE to run Spark jobs. More details on jobs API can be found here.

 

The CDWOperator behind the scene makes use of JDBC connection to interact with CDW Hive Virtual warehouse or Impala Virtual warehouse.

 

In this article to demonstrate ETL workload, we will be reading CSV from AWS S3 bucket using Spark, perform a transformation (via CDEJobRunOperator) and load the processed data to CDW warehouse using Airflow CDWOperator.

Pre-Requisite 

CDP environment is created and CDW, CDE experience are provisioned.

CDE Job

The CDE job will be run on Airflow using the CDEJobRunOperator. The job will read CSV data (sample available in GitHub) and write the transformed back to S3.

 

mvn build in this repo, the artifact will be airflow-cde-etl-1.0-SNAPSHOT.jar

Upload the jar in the Create Job screen of CDE virtual cluster UI. Following is the screenshot for the same.

Screenshot 2021-02-17 at 10.23.56 PM.png

Note: {{ dataSetBucketPath }} and {{ writeBucketPath }} are variable that can be overiden from Airflow DAG.

Configuration Hive|Impala VW in Airflow

Launch the Airflow UI of the CDE virtual cluster, the link for the same will be in the virtual cluster details screen.

Add the Hive|Impala connection details as shown in the following screenshot (available under Admin > Connections). 

Screenshot 2021-02-17 at 10.39.23 PM.png

The hostname can be obtained from the Copy JDBC Url of Hive/Impala virtual warehouse as follows:

Screenshot 2021-02-17 at 10.38.50 PM.png

Note: The username/password will be CDP workload username and password.

Submit Airflow DAG 

Before submitting the job, upload the Provider.csv, to the CDP S3 bucket. Then, submit the Airflow dag

from CDE job create UI as follows:

 

Screenshot 2021-02-17 at 10.51.15 PM.png

Note: Replace the test_data_bucket in the dag code with the bucket name applicable.

There you go, we have scheduled an Airflow job to read from the S3 bucket and load to CDW Hive virtual warehouse.

Screenshot 2021-02-18 at 8.20.23 AM.png

275 Views
Don't have an account?
Version history
Revision #:
7 of 7
Last update:
‎02-17-2021 07:28 PM
Updated by:
 
Top Kudoed Authors