Community Articles

subratadas · ‎02-17-2021

In this article, we will learn how to run scheduled ETL workloads on CDE ( Cloudera Data Engineering ) and CDW ( Cloudera Data Warehouse) using Airflow.

CDE 1.4 release provides the capability to schedule Airflow jobs (shipped along with CDE). If we need to learn more about Airflow, refer to the Airflow website.

This release adds two Airflow operators: CDEJobRunOperator and CDWOperator.

The CDEJobRunOperator behind the scene calls the jobs API of CDE to run Spark jobs. More details on jobs API can be found here.

The CDWOperator behind the scene makes use of JDBC connection to interact with CDW Hive Virtual warehouse or Impala Virtual warehouse.

In this article to demonstrate ETL workload, we will be reading CSV from AWS S3 bucket using Spark, perform a transformation (via CDEJobRunOperator) and load the processed data to CDW warehouse using Airflow CDWOperator.

Pre-Requisite

CDP environment is created and CDW, CDE experience are provisioned.

CDE Job

The CDE job will be run on Airflow using the CDEJobRunOperator. The job will read CSV data (sample available in GitHub) and write the transformed back to S3.

mvn build in this repo, the artifact will be airflow-cde-etl-1.0-SNAPSHOT.jar

Upload the jar in the Create Job screen of CDE virtual cluster UI. Following is the screenshot for the same.

Screenshot 2021-02-17 at 10.23.56 PM.png

Note: {{ dataSetBucketPath }} and {{ writeBucketPath }} are variable that can be overiden from Airflow DAG.

Configuration Hive|Impala VW in Airflow

Launch the Airflow UI of the CDE virtual cluster, the link for the same will be in the virtual cluster details screen.

Add the Hive|Impala connection details as shown in the following screenshot (available under Admin > Connections).

Screenshot 2021-02-17 at 10.39.23 PM.png

The hostname can be obtained from the Copy JDBC Url of Hive/Impala virtual warehouse as follows:

Screenshot 2021-02-17 at 10.38.50 PM.png

Note: The username/password will be CDP workload username and password.

Submit Airflow DAG

Before submitting the job, upload the Provider.csv, to the CDP S3 bucket. Then, submit the Airflow dag

from CDE job create UI as follows:

Screenshot 2021-02-17 at 10.51.15 PM.png

Note: Replace the test_data_bucket in the dag code with the bucket name applicable.

There you go, we have scheduled an Airflow job to read from the S3 bucket and load to CDW Hive virtual warehouse.

Screenshot 2021-02-18 at 8.20.23 AM.png

archer2012 · ‎08-04-2023

hello @SanthoshGowda Connecting impala VM in the same way will report an error， hive VM is ok.

any idea ?

bbreak · ‎08-04-2023

Hi @archer2012 , that error output doesn't give us a lot of information about what went wrong, but it looks like the connection wasn't successful. I recommend using Beeline in verbose mode from a command line on a suitable node to troubleshoot the connection independently. Once you have the connection to Impala working separately, then you can come back to Airflow and use the working connection settings.

Cloudera Community

Community Articles

Airflow Job scheduling with CDE and CDW ( ETL jobs)

Cloudera Data Engineering (CDE)

Cloudera Data Platform (CDP)

Pre-Requisite

CDE Job

Configuration Hive|Impala VW in Airflow

Submit Airflow DAG

Re: Airflow Job scheduling with CDE and CDW ( ETL jobs)

Re: Airflow Job scheduling with CDE and CDW ( ETL jobs)