Created on 08-28-202009:44 AM - edited 10-26-202009:45 AM
The all new Cloudera Data Engineering Experience
I recently had the opportunity to work with Cloudera Data Engineering to stream data from Kafka. It's quite interesting how I was able to deploy code without much worry about how to configure the back end components.
Demonstration
This demo will pull from the Twitter API using NiFi, write to payload to a Kafka topic named "twitter". Spark Streaming on Cloudera Data Engineering Experience CDE will pull from the twitter topic, extract the text field from the payload (which is the tweet itself) and write back to another Kafka topic named "tweet"
The following is an example of a twitter payload. The objective is to extract only the text field:
What is Cloudera Data Engineering?
Cloudera Data Engineering (CDE) is a serverless service for Cloudera Data Platform that allows you to submit Spark jobs to an auto-scaling cluster. CDE enables you to spend more time on your applications, and less time on infrastructure.
How do I begin with Cloudera Data Engineering (CDE)?
If you're not interested in building the jar, that's fine. I’ve made the job Jar available here.
Oc t26, 2020 update - I added source code for how to connect CDE to Kafka DH available here. Users should be able to run the code as is without need for jaas or keytab.
Kafka Setup
This article is focused on Spark Structured Streaming with CDE. I'll be super brief here
Create two Kafka topics
twitter
This topic is used to ingest the firehose data from twitter API
tweet
This topic is used post tweet extraction performed via Spark Structured streaming
NiFi Setup
This article is focused on Spark Structured Streaming with CDE. I'll be super brief here.
Use the GetTwitter processor (which requires twitter api developer account, free) and write to the Kafka twitter topic
It will pull from the source Kafka topic (twitter), extract the text value from the payload (which is the tweet itself) and write to the target topic (tweet)
CDE
Assuming CDE access is available, navigate to virtual clusters->View Jobs
Click on Create Job:
Job Details
Name
Job Name
Spark Application File
This is the jar created from the sbt package: spark-kafka-streaming_2.11-1.0.jar
Another option is to simply provide the URL where the jar available