Support Questions
Find answers, ask questions, and share your expertise

How to read from a Kafka topic using Spark (streaming) in a secure Cluster?

Explorer

Specifications:

  • HDP 2.3.2
  • Kerberos enabled
  • Kafka topic exists and user <username> has read access
  • Kafka topic is readable/writable using the Kafka command line tools with specified user
  • We already have a Spark streaming application that works fine in an unsecure cluster reading from a Kafka topic.

What would be a working example of a Spark streaming job that reads input from Kafka in a secure cluster under above conditions?

We made a Spark streaming job work that reads/writes into secure HBase and thought it couldn't be that different to do it with Kafka.

1 ACCEPTED SOLUTION

spark streaming haven't yet enabled security for their kafka connector

View solution in original post

21 REPLIES 21

spark streaming haven't yet enabled security for their kafka connector

@schintalapani - Do you have more details? Such as if there are JIRA issues or dev email threads tracking this?

is not enable for the DirectApi connector, but is enable with the old one.

New Contributor

Can you shed some light on how you use Spark Streaming to read/write to HBase? I have hard time to use Spark/Scala to access HBase. Thanks!

Expert Contributor

I have made a small change to the spark streaming code here : https://github.com/davidtambjss/spark-release/tree/HDP-2.3.0.0-KERBEROS_KAFKA_STREAMING and was able to get Spark streaming to work with kafka with kerberos.

All you need rebuild spark-streaming-kafka_2.10-1.3.1.2.3.0.0-2557.jar from there and use that jar.

New Contributor

Hi David,

Can you please provide the running code.

Inside pom.xml im getting below compilation error:

Project build error: Non-resolvable parent POM for com.accenture.ngap:spark-test:[unknown-version]: Failure to transfer

Thanks,

Krishna

Checked with engineering today and the feature will be officially landing in upcoming HDP 2.4.2 patch. There will be documentation around the fact that users will need to use HDP spark streaming kafka jars (instead of vanilla Apache)

Expert Contributor

@Ali Bajwa we have HDP 2.4.2 and when we try to consume the messages form the Secured Kafka topics using Spark Streaming (spark 1.6.1) we can't consume any messages.

I followed the documentation on https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_spark-guide/content/spark-streaming-kafk...

Was this patch after 2.4.2 or am I missing something.

Thanks.

Expert Contributor

What is the issue you are facing after following the above documentation in HDP 2.4.2?

Explorer

If you want to test it, even on HDP 2.3.x (Spark 1.5.2) you can use the following library which has been compiled with the necessary changes in:

https://github.com/beto983/Streaming-Kafka-Spark-1.5.2/blob/master/spark-streaming-kafka_2.10-1.5.2....

Do you have the code with the changes that you did in that jar?

Contributor

Hi Stefan Kupstaitis-Dunkler,

We are using HDP-2.3.4.0 and use Kafka en SparkStreaming (Scala & Python) on a (Kerberos + Ranger) secured Cluster.

You need to add a jaas config location to the spark-sumbit command. We are using it in yarn-client mode. The kafka_client_jaas.conf file is send as a resource with the --files option and available in the yarn-container. We did not get ticket renewal working yet...

spark-submit (all your stuff) \
  --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.conf=kafka_client_jaas.conf" \
  --files "your_other_files,kafa_client_jaas.conf,serviceaccount.headless.keytab" \
  (rest of your stuff)

# --principal and --keytab does not work and conflict with --files keytab.
# The jaas file will be placed in the yarn-containers by Spark.
# The file contains the reference to the keytab-file and the principal for Kafka and ZooKeeper:

KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required
  useTicketCache=false
  useKeyTab=true
  principal="serviceaccount@DOMAIN.COM"
  keyTab="serviceaccount.headless.keytab"
  renewTicket=true
  storeKey=true
  serviceName="kafka";
};
Client {
  com.sun.security.auth.module.Krb5LoginModule required
  useTicketCache=false
  useKeyTab=true
  principal="serviceaccount@DOMAIN.COM"
  keyTab="serviceaccount.headless.keytab"
  renewTicket=true
  storeKey=true
  serviceName="zookeeper";
}

If you need more info, feel free to ask.

Greetings,

Alexander

I have tried this approach and doesn't work 😞 . Spark doesn't throw any error, but doesn't read anything from the kerberized Kafka.

After a hard day our team solved the issue and we have created a little Spark application that can read from a kerberized Kafka and launching using yarn-cluster mode. Attached you have the pom.xml needed (very important use the Kafka 0.9 dependency) and the java code.

It is very important to use the old Kafka API and to use PLAINTEXTSASL (and not SASL_PLAINTEXT, a value coming from the new Kafka API) when configuring the Kafka Stream.

The jaas.conf file is the same that @Alexander Bij shared. Be careful because the keytab path is not a local path, is the keytab where the executor store the file (by default you only have to indicated the name of the keytab).

KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required
  useTicketCache=false
  useKeyTab=true
  principal="serviceaccount@DOMAIN.COM"
  keyTab="serviceaccount.headless.keytab"
  renewTicket=true
  storeKey=true
  serviceName="kafka";
};
Client {
  com.sun.security.auth.module.Krb5LoginModule required
  useTicketCache=false
  useKeyTab=true
  principal="serviceaccount@DOMAIN.COM"
  keyTab="serviceaccount.headless.keytab"
  renewTicket=true
  storeKey=true
  serviceName="zookeeper";
};

Kafka will use this information to retrieve the kerberos configuration required when connecting to ZK and to the broker. In order to do that, the executor will require the keytab file to get the kerberos token before connecting to the broker. That is the reason of sending also the keytab to the container when launching the application. Please, notice that the extraJavaOptions refers to the container local path (the file will be placed in the root folder of the temp configuration) but the file path will require the local path where you have the jaas file in the server that is starting up the application.

spark-submit (all your stuff) \
  --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas.conf" \
  --files "your_other_files,kafka_client_jaas.conf,serviceaccount.headless.keytab" \
  (rest of your stuff)

Also verify that the parameter where you indicate the jaas file is java.security.auth.login.config.

And that's all, following above instruction and using the attached code you will be able to read from kerberized Kafka using Spark 1.5.2.

In our Kerberized environment we have Ambari 2.2.0.0, HDP 2.3.4.0 and Spark 1.5.2. I hope this information help you.

New Contributor

Hi javier,

Can you please upload the running code.

Inside pom.xml its giving below compilation error:

Project build error: Non-resolvable parent POM for com.accenture.ngap:spark-test:[unknown-version]: Failure to transfer com.accenture.ngap:parent:pom:0.0.1 from http://repo.hortonworks.com/content/repositories/releases/ was cached in the local repository, resolution will not be reattempted until the update interval of hortonworks has elapsed or updates are forced. Original error: Could not transfer artifact com.accenture.ngap:parent:pom:0.0.1 from/to hortonworks (http://repo.hortonworks.com/content/ repositories/releases/): connect timed out and 'parent.relativePath' points at wrong local POM

New Contributor

Are there examples for HDP 2.3.2 - Spark Streaming (1.4.1) and Kafka(0.8.2) with java and kerberos environment

Did you try with the code that I shared on https://community.hortonworks.com/comments/30193/view.html some comments above?

New Contributor

HI Team ,

I am currently using Spark Streaming with Kafka without any SSL , it works fine .

But we need to enable SSL for Spark Streaming with Kafka.

Is this supported ?

I see spark 2.0 also supports kafka 0.8 version, but SSL is added in 0.9 kafka version .

Is there any way through which HDP is providing SSL enabled spark streaming with kafka ?

If yes . can any one share the document link or any info ?

Thanks

Expert Contributor

I think we also have same problem with Kafka 0.9, Spark 1.6.1 and HDP 2.4.2

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.