Created on 02-17-2021 09:33 AM - edited on 02-17-2021 07:28 PM by subratadas
In this article, we will learn how to run scheduled ETL workloads on CDE ( Cloudera Data Engineering ) and CDW ( Cloudera Data Warehouse) using Airflow.
CDE 1.4 release provides the capability to schedule Airflow jobs (shipped along with CDE). If we need to learn more about Airflow, refer to the Airflow website.
This release adds two Airflow operators: CDEJobRunOperator and CDWOperator.
The CDEJobRunOperator behind the scene calls the jobs API of CDE to run Spark jobs. More details on jobs API can be found here.
The CDWOperator behind the scene makes use of JDBC connection to interact with CDW Hive Virtual warehouse or Impala Virtual warehouse.
In this article to demonstrate ETL workload, we will be reading CSV from AWS S3 bucket using Spark, perform a transformation (via CDEJobRunOperator) and load the processed data to CDW warehouse using Airflow CDWOperator.
CDP environment is created and CDW, CDE experience are provisioned.
The CDE job will be run on Airflow using the CDEJobRunOperator. The job will read CSV data (sample available in GitHub) and write the transformed back to S3.
mvn build in this repo, the artifact will be airflow-cde-etl-1.0-SNAPSHOT.jar
Upload the jar in the Create Job screen of CDE virtual cluster UI. Following is the screenshot for the same.
Note: {{ dataSetBucketPath }} and {{ writeBucketPath }} are variable that can be overiden from Airflow DAG.
Launch the Airflow UI of the CDE virtual cluster, the link for the same will be in the virtual cluster details screen.
Add the Hive|Impala connection details as shown in the following screenshot (available under Admin > Connections).
The hostname can be obtained from the Copy JDBC Url of Hive/Impala virtual warehouse as follows:
Note: The username/password will be CDP workload username and password.
Before submitting the job, upload the Provider.csv, to the CDP S3 bucket. Then, submit the Airflow dag
from CDE job create UI as follows:
Note: Replace the test_data_bucket in the dag code with the bucket name applicable.
There you go, we have scheduled an Airflow job to read from the S3 bucket and load to CDW Hive virtual warehouse.
Created on 08-04-2023 06:21 AM
hello @SanthoshGowda Connecting impala VM in the same way will report an error, hive VM is ok.
any idea ?
Created on 08-04-2023 08:11 AM
Hi @archer2012 , that error output doesn't give us a lot of information about what went wrong, but it looks like the connection wasn't successful. I recommend using Beeline in verbose mode from a command line on a suitable node to troubleshoot the connection independently. Once you have the connection to Impala working separately, then you can come back to Airflow and use the working connection settings.