Created on
09-28-2020
05:54 AM
- edited on
10-01-2020
02:25 AM
by
VidyaSargur
This article is inspired by another article Spark Structured Streaming example with CDE, which talks about how to use CDE to read/write from Kafka using Spark Streaming. The difference in this article is that this article talks about how to connect to "Kafka" of the CDP Streams Messaging data hub cluster.
The following are the differences in CDP Streams Messaging data hub cluster:
The assumption is that the CDP environment, Data lake, Flow Management, and Streaming Messaging datahub cluster are already created.
Nothing to be done 🙂
The overall goal of the Spark job is to read from Kafka topic named "twitter" pushed by NiFi, extract the text from tweet messages, and push to another Kafka topic "tweet".
Note: Spark streaming job needs to talk to Kafka with SSL, this would mean the CA certificate of Kafka needs to be configured in the Spark job. The JRE of the default spark docker image already has the CA certificate of Kafka. The following is the link to the code that points to the same:
https://github.com/bgsanthosh/spark-streaming/blob/master/spark-streaming/src/main/scala/org/cloudera/qe/e2es/TwitterStructureStreaming.scala#L23
To configure a CA certificate of Kafka in the Spark job, do the following:
Note: For custom trust-store, the CA certificate of Kafka can be downloaded from Control Plane UI ( "Get FreeIPA Certificate" ) and the docker path to the same can be provided in the job.
https://github.com/bgsanthosh/spark-streaming/tree/master/spark-streaming cde resource create --name spark-streamingcde resource upload --name spark-streaming --local-path spark-streaming.jar --resource-path spark-streaming.jarcat twitter-streaming.json
{
"mounts": [
{
"resourceName": "spark-streaming"
}
],
"name": "twitter-streaming",
"spark": {
"className": "org.cloudera.qe.e2es.TwitterStructuredStreaming",
"args": [
],
"driverCores": 1,
"driverMemory": "1g",
"executorCores": 1,
"executorMemory": "1g",
"file": "spark-streaming.jar",
"pyFiles": [],
"files": [],
"numExecutors": 4
}
}
cde job import --file twitter-streaming.jsoncde job run --name twitter-streamingThere you go !!! Spark Streaming on CDE is up and running !!
To create a CDP environment, provisioning flow management, streaming datahub, and CDE cluster, see the following documents:
Created on 10-22-2020 05:19 PM
Thanks for the article Santosh. I think we need to update the job definition with kbrokers are argument for the job to run.