Created on 01-20-2021 12:55 PM - edited 05-18-2022 07:48 AM
Credits to @mbawa (Mandeep Singh Bawa) who co-built all the assets in this article. Thank you!
We (Mandeep and I) engaged on a customer use case where Cloudera Data Engineering (Spark) jobs were triggered once a file lands in S3 (details on how to trigger CDE from Lambda here). Triggering CDE jobs is quite simple; however, we needed much more. Here are a few of the requirements:
It may look as though we are trying to make NiFi into an orchestration engine for CDE. That's not the case. Here we are trying to fill some core objectives and leveraging capabilities within the platform to accomplish the above-stated task. CDE comes with Apache Airflow, a much richer orchestration engine. Here we are integrating AWS triggers, multiple CDE clusters, monitoring, alerting, and single API for multi clusters.
At a high level, the NiFi workflow does the following:
The following NiFi parameters will be required
Note: When you run the workflow for the first time, generally the Kafka topics will be automatically created for you.
Once a CDE job spec is sent to NiFi, NiFi does the following:
To get started, CDE primary and secondary (if available) cluster API details are needed in NiFi as parameters:
https://service.cde-zzzzzz.moad-aw.aaaaa-aaaa.cloudera.site/grafana/d/sK1XDusZz/kubernetes?orgId=1&refresh=5s
service.cde-zzzzzz.moad-aw.aaaaa-aaaa.cloudera.site
Now get the Jobs API for both primary and secondary (if available). For a virtual cluster,
https://aaa.cde-aaa.moad-aw.aaa-aaa.cloudera.site/dex/api/v1
aaa.cde-aaa.moad-aw.aaa-aaa.cloudera.site
Inside of the NiFi workflow, there is a test flow to verify the NiFi CDE jobs pipeline works:
To run the flow, inside of InvokeHTTP, set the URL to one of the NiFi nodes. Run it and if the integration is working successfully; you will see a job running in CDE.
Enjoy! Oh, by the way, I plan on publishing a video walking through the NiFi flow.