Support Questions

Hitesh88 · ‎07-12-2018

I'm running spark streaming consumer job for kafka and kafka is running in SASL_SSL mode.

The job is configured to run in yarn client mode which means spark driver will run on local edge node (from where job is invoked) and spark executors run on hadoop cluster. Thus, I'm sending all required config files (jaas, keytab, keystore etc.) to executor cache via --files option. However, driver also needs keystore and keytab file to communicate to kafka along with executor but driver is not able to locate the keystore file.

Is there a way to configure spark driver to load keystore and keytab files (or any other files present locally on edge node) in yarn client mode on driver's cache (way i can do for executors using --files option).

Has anyone faced similar issue. Any help is appreciated.

Below is the error message -

org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: kafka_client_truststore.jks (No such file or directory)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: kafka_client_truststore.jks (No such file or directory)
Caused by: java.io.FileNotFoundException: kafka_client_truststore.jks (No such file or directory)

Yuexin Zhang · ‎07-12-2018

There is a complete sample in the Cloudera engineering blog [1], note the requirements mentioned there.

You will need to provide the jaas file with java options, see [2], notice the options used in spark2-submit:

 --driver-java-options "-Djava.security.auth.login.config=./spark_jaas.conf"...

 --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./spark_jaas.conf"

along with distributed files via "--files".

You will also need to set the SSL parameters when initlizing the kafka client, see [3].

[1] https://blog.cloudera.com/blog/2017/05/reading-data-securely-from-apache-kafka-to-apache-spark/

[2] https://github.com/markgrover/spark-secure-kafka-app

[3] https://github.com/markgrover/spark-secure-kafka-app/blob/master/src/main/java/com/cloudera/spark/ex...

Hitesh88 · ‎07-13-2018

Hi Yuexin,

Thanks for your response. I have gone through all these links and many more to research on this issue.

And I have already done all this configuration of setting jaas file for both driver and executor and also setting kafka ssl settings in the kafkaparams in the program.

$SPARK_HOME/bin/spark-submit \

--conf spark.yarn.queue=$yarnQueue \

--conf spark.hadoop.yarn.timeline-service.enabled=false \

--conf spark.yarn.archive=$sparkYarnArchive \

$sparkOpts \

--properties-file $sparkPropertiesFile \

--files /conf/kafka/kafka_client_jaas_dev.conf,/conf/kafka/krb5_dev.conf,/conf/keystore/kafka_client_truststore.jks,conf/kafka/kafka/kafkausr.keytab \

--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas_dev.conf -Djava.security.krb5.conf=krb5_dev.conf -Dsun.security.krb5.debug=true" \

--driver-java-options "-Djava.security.auth.login.config=/conf/kafka_client_jaas_dev.conf -Djava.security.krb5.conf=/conf/krb5_dev.conf \

--class com.commerzbank.streams.KafkaHDFSPersister $moduleJar \

$1 $2 $3 $4 $KafkaParamsconfFile \

Problem here is running in yarn client mode (and all links you mentioned also talks about yarn cluster mode) but i have to run in yarn client mode only here due to some project constraints , and issue is if I specify full path of keystore file(where it is located on edge node where I run the command) in 'ssl.truststore.location' parameter then executors cannot find this file in their cache as it looks for complete path+file name and executor cache contains file with name 'kafka_client_truststore.jks'.

And when i pass the keystore file (kafka_client_truststore.jks) without path to 'ssl.truststore.location' parameter then it fails for driver as driver looks in current path of edge node from where the job is run (and if I run the job from same directory /conf/keystore where this keystore file is present on edge node, then job succeeds)

Is there way to solve this in your view or better way to load same set of files for driver and executors running in yarn client mode.

Regards,

Hitesh

Cloudera Community

Support Questions

Issue in running spark streaming job in yarn client mode with kerberized and ssl kafka