Created on 07-05-2022 03:49 PM - edited 07-06-2022 07:23 AM
There are a numerous ways of doing event driven architectures in Cloud with Cloudera Data Platform(CDP) but, lately I was working on a use case which required Spark inferences and transformations on streaming log data in Azure cloud environment on a near realtime basis. The use case is to kickoff Spark transformations as soon as the files get landed in the Azure storage container.
The preferred way of doing Spark transformations for Data Engineering use cases in CDP is by leveraging Cloudera Data Engineering(CDE) which runs Spark on Kubernetes. CDE provides much better isolation, job orchestration with help of Airflow, Spark version independence, scales faster, efficiently utilizes the cloud native resources and the best part is that it does dependency management like a rockstar. Just makes your life easy and simple.
CDE exposes Jobs API for integration with your CI/CD platforms. There are multiple patterns to do streaming architectures with CDE such as leveraging:
In this article we will discuss the later option on how we can leverage Azure Event Grid and Azure Functions to trigger CDE Jobs. There are many other patterns that Azure Event Grid integrates with natively and everything is possible with this pattern, but for the sake of this article, we will explore how Even grid integrates with an Azure Storage containers to monitor for incoming files and invokes an Azure python functions(serverless) which invokes a Cloudera Data Engineering(CDE) Airflow/PySpark job using CDE Jobs API.
Create Airflow and/or Spark jobs in CDE following these instructions
Gather the JOBS API URL for the airflow/spark job and token_url for retrieving the AUTH token
Get the TOKEN URL for extracting the AUTH token for authentication purposes
pip freeze > requirements.txt
from your local root project folder structure to collect all your python dependencies into requirements.txt file.
2. Test the function locally(optional). Deploy the function to Azure functions App that was created in step 1-i above. You should be able to do that from Visual studio code and you might have to configure your Azure environment to be able to work with Visual Studio code.
/blobServices/default/containers/data/blobs/hrongtest/
Cloudera Data Engineering(CDE) makes it super simple to run Spark jobs at scale. Azure cloud native integration patterns such as Azure Event Grid makes it much more robust in terms of possibilities. Happy building in Azure with Cloudera Data Engineering(CDE)!!!