Created on 09-28-2020 05:54 AM - edited on 10-01-2020 02:25 AM by VidyaSargur
This article is inspired by another article Spark Structured Streaming example with CDE, which talks about how to use CDE to read/write from Kafka using Spark Streaming. The difference in this article is that this article talks about how to connect to "Kafka" of the CDP Streams Messaging data hub cluster.
The following are the differences in CDP Streams Messaging data hub cluster:
The assumption is that the CDP environment, Data lake, Flow Management, and Streaming Messaging datahub cluster are already created.
Nothing to be done 🙂
The overall goal of the Spark job is to read from Kafka topic named "twitter" pushed by NiFi, extract the text from tweet messages, and push to another Kafka topic "tweet".
Note: Spark streaming job needs to talk to Kafka with SSL, this would mean the CA certificate of Kafka needs to be configured in the Spark job. The JRE of the default spark docker image already has the CA certificate of Kafka. The following is the link to the code that points to the same:
https://github.com/bgsanthosh/spark-streaming/blob/master/spark-streaming/src/main/scala/org/cloudera/qe/e2es/TwitterStructureStreaming.scala#L23
To configure a CA certificate of Kafka in the Spark job, do the following:
Note: For custom trust-store, the CA certificate of Kafka can be downloaded from Control Plane UI ( "Get FreeIPA Certificate" ) and the docker path to the same can be provided in the job.
https://github.com/bgsanthosh/spark-streaming/tree/master/spark-streaming
cde resource create --name spark-streaming
cde resource upload --name spark-streaming --local-path spark-streaming.jar --resource-path spark-streaming.jar
cat twitter-streaming.json
{
"mounts": [
{
"resourceName": "spark-streaming"
}
],
"name": "twitter-streaming",
"spark": {
"className": "org.cloudera.qe.e2es.TwitterStructuredStreaming",
"args": [
],
"driverCores": 1,
"driverMemory": "1g",
"executorCores": 1,
"executorMemory": "1g",
"file": "spark-streaming.jar",
"pyFiles": [],
"files": [],
"numExecutors": 4
}
}
cde job import --file twitter-streaming.json
cde job run --name twitter-streaming
There you go !!! Spark Streaming on CDE is up and running !!
To create a CDP environment, provisioning flow management, streaming datahub, and CDE cluster, see the following documents:
Created on 10-22-2020 05:19 PM
Thanks for the article Santosh. I think we need to update the job definition with kbrokers are argument for the job to run.