Created on 08-28-2020 09:44 AM - edited 10-26-2020 09:45 AM
I recently had the opportunity to work with Cloudera Data Engineering to stream data from Kafka. It's quite interesting how I was able to deploy code without much worry about how to configure the back end components.
This demo will pull from the Twitter API using NiFi, write to payload to a Kafka topic named "twitter". Spark Streaming on Cloudera Data Engineering Experience CDE will pull from the twitter topic, extract the text field from the payload (which is the tweet itself) and write back to another Kafka topic named "tweet"
The following is an example of a twitter payload. The objective is to extract only the text field:
Cloudera Data Engineering (CDE) is a serverless service for Cloudera Data Platform that allows you to submit Spark jobs to an auto-scaling cluster. CDE enables you to spend more time on your applications, and less time on infrastructure.
Complete setup instructions here.
I posted all my source code here.
If you're not interested in building the jar, that's fine. I’ve made the job Jar available here.
Oc t26, 2020 update - I added source code for how to connect CDE to Kafka DH available here. Users should be able to run the code as is without need for jaas or keytab.
This article is focused on Spark Structured Streaming with CDE. I'll be super brief here
This article is focused on Spark Structured Streaming with CDE. I'll be super brief here.
Use the GetTwitter processor (which requires twitter api developer account, free) and write to the Kafka twitter topic
It will pull from the source Kafka topic (twitter), extract the text value from the payload (which is the tweet itself) and write to the target topic (tweet)
Job Details
At this point, only the text (tweet) from the twitter payload is being written to the tweet Kafka topic.
That's it! You now have a spark structure stream running on CDE fully autoscaled. Enjoy