Created on 10-19-201805:05 AM - edited 08-17-201906:06 AM
This is a quick how-to on integrating spark on zeppelin with kerberised kafka. To do this first we will enable impersonation support in zeppelin for spark. By Impersonation, i mean jobs will be executed / submitted to spark by zeppelin as the user logged in to zeppelin and not the default zeppelin user. In my case i will be using my user "Karthik"
Now to enable impersonation while submitting to spark, i would recommend using the livy server. The livy server and the interpreter natively support impersonation. Spark interpreter impersonation is possible, but you may have to configure SSh keys or sudo capability for zeppelin which may not be recommended. You can livy to hdp by going to a host and installing "Spark client" , followed by the livy server.
Restart Zeppelin , to enable the livy interpreter. Ensure that livy impersonation and livy proxy permissions are setup correctly in ambari. This is done when you install livy server by default.
Log in to zeppelin using your credentials.click the drop-down with your username on the top right and select interpreter option.
on the interpreter menu, scroll down to the livy interpreter and change interpreter instantiation to be "per user". you can change process option to scoped / isolated based on your needs, either option works.
At this point you write a test spark code and execute. check the yarn resourcemanger ui for the task, it should run under the logged in user, "karthik" in my case.
I attempted this using spark streaming, to enable spark streaming to connect with kafka, you need to include the spark-streaming-kafka jar as a dependency for your spark job. Please add the below property and value to your livy interpreter. ensure the correct jar version based on kafka and spark versions.
Now to enable the spark job to authenticate itself to a kerberised kafka. The trick is to specify the jaas config along with correct keytab to both the spark driver and executor. To do this i created a kafka-jass config with reference to my users keytab and put in /tmp/ in HDFS. I also put in the /tmp/ folder of all my data nodes. I also copied my keytab to all data nodes under /tmp
When passing the jaas conf or keytab to spark, i have two options
1. I can push them both to HDFS and refer to them in my livy spark options. The options are exactly the same as what you would pass to spark-submit, except you add "livy." in front of them. In this case since i copied the keytabs already to the datanode /tmp folders , i only had to pass the jaas config as a file to livy.
I then passed the name of the file (the part after # ) to my spark driver and executer as extrajava options...
This what my jaas-config looks like . Since, i had already copied my keytab to all compute nodes and set it with right permissions for me, i did not have to pass them in the config. I think it may be also better to do that with jaas-config so users can pass them from inside their code and not have to depend on a interpreter config. I will show how to do this later.
2. The second option is to simply pass the jaas config file location and envrionment variable using the System.setProperty() right inside the spark code. This will ensure that the user can use his own jaas conf and keytab without having to change interpreter config.
Run, the spark job and you should be able to connect to the kerberised kafka broker and access topics. if it fails for some reason, check you have uploaded the jaas-config to right location in hdfs, your keytabs have proper permission and your user has permissions to the topic in ranger or acls.