Created on 09-28-202005:54 AM - edited on 10-01-202002:25 AM by VidyaSargur
This article is inspired by another article Spark Structured Streaming example with CDE, which talks about how to use CDE to read/write from Kafka using Spark Streaming. The difference in this article is that this article talks about how to connect to "Kafka" of the CDP Streams Messaging data hub cluster.
The following are the differences in CDP Streams Messaging data hub cluster:
The assumption is that the CDP environment, Data lake, Flow Management, and Streaming Messaging datahub cluster are already created.
Flow Management Datahub Cluster
Once the above clusters are provisioned, setup NiFi to read tweets from twitter and publish it to Kafka.
Use this NiFi template to upload to NiFi and setup the Read Twitter > Push Kafka pipeline:
Edit the KafkaProducer processor and set the Kafka broker to the hostname of Kafka of the SMM cluster.
Edit the GetTwitter processor group with the twitter developer credentials [access key, secret].
Streaming Messaging Manager Datahub Cluster
Nothing to be done 🙂
CDE Cluster
The overall goal of the Spark job is to read from Kafka topic named "twitter" pushed by NiFi, extract the text from tweet messages, and push to another Kafka topic "tweet".
Note: Spark streaming job needs to talk to Kafka with SSL, this would mean the CA certificate of Kafka needs to be configured in the Spark job. The JRE of the default spark docker image already has the CA certificate of Kafka. The following is the link to the code that points to the same:
To configure a CA certificate of Kafka in the Spark job, do the following:
Note: For custom trust-store, the CA certificate of Kafka can be downloaded from Control Plane UI ( "Get FreeIPA Certificate" ) and the docker path to the same can be provided in the job.
Build the spark streaming application from the following repo.