Community Articles

VidyaSargur · ‎09-11-2020

Recently I was engaged in a use case where CDE processing was required to be triggered once data landed on s3. The s3 trigger in AWS would be via a Lambda function. As the files/data land in s3, an AWS Lambda function would be triggered to then call CDE to process the data/files. Lambda functions at trigger time include the names and locations of the files the trigger was executed upon. The file locations/names would be passed onto the CDE engine to pick up and process accordingly.

Prerequisites to run this demo

AWS account
s3 Bucket
Some knowledge of Lambda
CDP and CDE

Artifacts

AWS Lambda function code
- https://github.com/sunileman/spark-kafka-streaming/blob/master/src/main/awslambda/triggerCDE.py
CDE Spark Job, main class com.cloudera.examples.SimpleCDERun
- Code for class com.cloudera.examples.SimpleCDERun
  - https://github.com/sunileman/spark-kafka-streaming
- Prebuilt jar
  - https://sunileman.s3.amazonaws.com/CDE/spark-kafka-streaming_2.11-1.0.jar

Processing Steps

Create a CDE Job (Jar provided above)
Create a Lambda function on an s3 bucket (Code provided above)
1. Trigger on put/post
Load a file or files on s3 (any file)
AWS Lambda is triggered by this event which calls CDE. The call to CDE will include the locations and names of all files the trigger was executed upon
CDE will launch, processing the files, and end gracefully

It's quite simple.

Create a CDE Job

Name: Any Name. I called it testjob
Spark Application: Jar file provided above
Main Class: com.cloudera.examples.SimpleCDERun

Lambda

Create an AWS Lambda function to trigger on put/post for s3. The lambda function code is simple. It will call CDE for each file posted to s3. Lambda function provided in the artifacts section above.

The following are the s3 properties:

Trigger CDE

Upload a file to s3. Lambda will trigger the CDE job. For example, I uploaded a file test.csv to s3. Once the file was uploaded, Lambda calls CDE to execute a job on that file

Lambda Log

The first arrow shows the file name (test.csv). The second arrow shows the CDE JobID, which in this case returned the number 14.

In CDE, Job Run ID: 14

In CDE stdout logs show that the job received the location and name of the file which Lambda was triggered upon.

As I said in my last post, CDE is making things super simple. Enjoy.

Cloudera Community