About hkropp

hkropp · ‎09-26-2016

Controlling the environment of an application is vital for it's functionality and stability. Especially in a distributed environment it is important for developers to have control over the version of dependencies. In such an scenario it's a critical task to ensure possible conflicting requirements of multiple applications are not disturbing each other. That is why frameworks like YARN ensure that each application is executed in a self-contained environment - typically in a Linux Container or Docker Container - that is controlled by the developer. In this post we show what this means for Python environments being used by Spark. YARN Application Deployment As mentioned earlier does YARN execute each application in a self-contained environment on each host. This ensures the execution in a controlled environment managed by individual developers. The way this works in a nutshell is that the dependency of an application are distributed to each node typically via HDFS. This figure simplifies the fact that HDFS is actually being used to distribute the application. See HDFS distributed cache for reference. The files are uploaded to a staging folder /user/${username}/.${application} of the submitting user in HDFS. Because of the distributed architecture of HDFS it is ensured that multiple nodes have local copies of the files. In fact to ensure that a large fraction of the cluster has a local copy of application files and does not need to download them over the network, the HDFS replication factor is set much higher for this files than 3. Often a number between 10 and 20 is chosen for the replication factor. During the preparation of the container on a node you will notice in logs similar commands to the below example are being executed: ln -sf "/hadoop/yarn/local/usercache/vagrant/filecache/72/pyspark.zip" "pyspark.zip" The folder /hadoop/yarn/local/ is the configured location on each node where YARN stores it's needed files and logs locally. Creating a symbolic link like this inside the container makes the content of the zip file available. It is being referenced as "pyspark.zip". Using Conda Env For application developers this means that they can package and ship their controlled environment with each application. Other solutions like NFS or Amazon EFS shares are not needed, especially since solutions like shared folders makes for a bad architecture that is not designed to scale very well making the development of application less agile. The following example demonstrate the use of conda env to transport a python environment with a PySpark application needed to be executed. This sample application uses the NLTK package with the additional requirement of making tokenizer and tagger resources available to the application as well. Our sample application: import os import sys from pyspark import SparkContext from pyspark import SparkConf conf = SparkConf() conf.setAppName("spark-ntlk-env") sc = SparkContext(conf=conf) data = sc.textFile('hdfs:///user/vagrant/1970-Nixon.txt') def word_tokenize(x): import nltk return nltk.word_tokenize(x) def pos_tag(x): import nltk return nltk.pos_tag([x]) words = data.flatMap(word_tokenize) words.saveAsTextFile('hdfs:///user/vagrant/nixon_tokens') pos_word = words.map(pos_tag) pos_word.saveAsTextFile('hdfs:///user/vagrant/nixon_token_pos') Preparing the sample input data For our example we are using the provided samples of NLTK (http://www.nltk.org/nltk_data/) and upload them to HDFS: (nltk_env)$ python -m nltk.downloader -d nltk_data all (nltk_env)$ hdfs dfs -put nltk_data/corpora/state_union/1970-Nixon.txt /user/vagrant/ No Hard (Absolute) Links! Before we actually go and create our environment lets first take a quick moment to recap on how an environment is typically being composed. On a machine the environment is made out of variables linking to different target folders containing executable or other resource files. So if you execute a command it is either referenced from your PATH, PYTHON_LIBRARY, or any other defined variable. These variables link to files in directories like /usr/bin, /usr/local/bin or any other referenced location. They are called hard links or absolute reference as they start from root /. Environments using hard links are not easily transportable as they make strict assumption about the the overall execution engine (your OS for example) they are being used in. Therefor it is necessary to use relative links in a transportable/relocatable environment. This is especially true for conda env as it creates hard links by default. By making the conda env relocatable it can be used in a application by referencing it from the application root . (current dir) instead of the overall root /. By using the --copy options during the creation of the environment packages are being copied instead of being linked. Creating our relocatable environment together with nltk and numpy: conda create -n nltk_env --copy -y -q python=3 nltk numpy Fetching package metadata ....... Solving package specifications: .......... Package plan for installation in environment /home/datalab_user01/anaconda2/envs/nltk_env: The following packages will be downloaded: package | build ---------------------------|----------------- python-3.5.2 | 0 17.2 MB nltk-3.2.1 | py35_0 1.8 MB numpy-1.11.1 | py35_0 6.1 MB setuptools-23.0.0 | py35_0 460 KB wheel-0.29.0 | py35_0 82 KB pip-8.1.2 | py35_0 1.6 MB ------------------------------------------------------------ Total: 27.2 MB The following NEW packages will be INSTALLED: mkl: 11.3.3-0 (copy) nltk: 3.2.1-py35_0 (copy) numpy: 1.11.1-py35_0 (copy) openssl: 1.0.2h-1 (copy) pip: 8.1.2-py35_0 (copy) python: 3.5.2-0 (copy) readline: 6.2-2 (copy) setuptools: 23.0.0-py35_0 (copy) sqlite: 3.13.0-0 (copy) tk: 8.5.18-0 (copy) wheel: 0.29.0-py35_0 (copy) xz: 5.2.2-0 (copy) zlib: 1.2.8-3 (copy) # # To activate this environment, use: # $ source activate nltk_env # # To deactivate this environment, use: # $ source deactivate # This also works for different python version 3.x or 2.x! Zip it and Ship it! Now that we have our relocatable environment all set we are able to package it and ship it as part of our sample PySpark job. $ cd ~/anaconda2/envs/ $ zip -r nltk_env.zip nltk_env Making this available in during the execution of our application in a YARN container we have for one distribute the package and for second change the default environment of Spark for python to your location. The variable controlling the python environment for python applications in Spark is named PYSPARK_PYTHON. PYSPARK_PYTHON=./NLTK/nltk_env/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./NLTK/nltk_env/bin/python --master yarn-cluster --archives nltk_env.zip#NLTK spark_nltk_sample.py Our virtual environment is linked by NLTK that is why the path in PYSPARK_PYTHON is pointing to ./NLTK/content/of/zip/... . The exact command being executed during container creation is something like this: ln -sf "/hadoop/yarn/local/usercache/vagrant/filecache/71/nltk_env.zip" "NLTK" Shipping additional resources with an application is controlled by the --files and --archives options as shown here. The options being used here are documented in Spark Yarn Configuration and Spark Environment Variables for reference. Packaging tokenizer and taggers Doing just the above will unfortunately fail, because using the NLTK parser in the way we are using it in the example program has some additional dependencies. If you have followed the above steps executing submitting it to your YARN cluster will result in the following exception at container level: Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): ... raise LookupError(resource_not_found) LookupError: ********************************************************************** Resource 'taggers/averaged_perceptron_tagger/averaged_perceptron _tagger.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/home/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' ********************************************************************** at org.apache.spark.api.python.PythonRunner$anon$1.read(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRunner$anon$1.<init>(PythonRDD.scala:207) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more The problem is that NLTK expects the follwoing resource in the tokenizers/punkt/english.pickle to be available in either of the following locations: Searched in: - '/home/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' The good thing about this is, that we by now should now how we can ship the required dependency to our application. We can do it the same way we did with our python environment. Again it is imporant to reensure yourself how the resource is going to be referenced. NLTK expects it by default in the current location under tokenizers/punkt/english.pickle that is why we navigate into the folder for packaging and reference the zip file wiht tokenizer.zip#tokenizer. (nltk_env)$ cd nltk_data/tokenizers/ (nltk_env)$ zip -r ../../tokenizers.zip * (nltk_env)$ cd ../../ (nltk_env)$ cd nltk_data/taggers/ (nltk_env)$ zip -r ../../taggers.zip * (nltk_env)$ cd ../../ At a later point our program will expect a tagger in the same fashion already demonstrated in the above snippet. Using YARN Locations We can ship those zip resources the same way we shipped our conda env. In addition environment variable can be used to control resource discovery and allocation. For NLTK you can use the environment variable NLTK_DATA to control the path. Setting this in Spark can be done similar to the way we set PYSPARK_PYTHON: --conf spark.yarn.appMasterEnv.NLTK_DATA=./ Additionally YARN exposes the container path via the environment variable PWD. This can be used in NLTK as to add it to the search path as follows: def word_tokenize(x): import nltk nltk.data.path.append(os.environ.get('PWD')) return nltk.word_tokenize(x) The submission of your application becomes now: PYSPARK_PYTHON=./NLTK/nltk_env/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./NLTK/nltk_env/bin/python --conf spark.yarn.appMasterEnv.NLTK_DATA=./ --master yarn-cluster --archives nltk_env.zip#NLTK,tokenizers.zip#tokenizers,taggers.zip#taggers spark_nltk_sample.py The expected result should be something like the following: (nltk_env)$ hdfs dfs -cat /user/datalab_user01/nixon_tokens/* | head -n 20 Annual Message to the Congress on the State of the Union . January 22 , 1970 Mr. Speaker , Mr. And: (nltk_env)$ hdfs dfs -cat /user/datalab_user01/nixon_token_pos/* | head -n 20 [(u'Annual', 'JJ')] [(u'Message', 'NN')] [(u'to', 'TO')] [(u'the', 'DT')] [(u'Congress', 'NNP')] [(u'on', 'IN')] [(u'the', 'DT')] [(u'State', 'NNP')] [(u'of', 'IN')] [(u'the', 'DT')] [(u'Union', 'NN')] [(u'.', '.')] [(u'January', 'NNP')] [(u'22', 'CD')] [(u',', ',')] [(u'1970', 'CD')] [(u'Mr.', 'NNP')] [(u'Speaker', 'NN')] [(u',', ',')] [(u'Mr.', 'NNP')] Further Readings PySpark Internals Spark NLTK Example NLTK Data Virtualenv in Hadoop Streaming Conda Intro Conda Env with Spark Python Env support in Spark (SPARK-13587) Post was first published here: http://henning.kropponline.de/2016/09/24/running-pyspark-with-conda-env/

hkropp · ‎09-20-2016

Well, you would have the login, but not the kerberos init. You would still have two realms with user credentials the KRB5 realm and the LDAP realm depending on your setup. Actually the KRB5 realm can be included inside LDAP or put differently Kerberos can be configured to use LDAP as it's user DB, that would give you the possibility to combine both. This essential is what FreeIPA is.

hkropp · ‎09-20-2016

Yes with pam_ldap integration: http://www.tldp.org/HOWTO/archived/LDAP-Implementation-HOWTO/pamnss.html

hkropp · ‎09-20-2016

A list of recommended tools are: SSSD https://fedorahosted.org/sssd/ / https://help.ubuntu.com/lts/serverguide/sssd-ad.html FreeIPA (introduces additional AD and need to establish Trust between the two) https://www.freeipa.org/page/Main_Page Winutils Centrify (commercial) https://www.centrify.com/ VAS / Quest (commercial) https://software.dell.com/products/authentication-services/ .... Please check the material of this workshop for reference: https://community.hortonworks.com/articles/1143/cheatsheet-on-configuring-authentication-authoriza.html https://community.hortonworks.com/repos/4465/workshops-on-how-to-setup-security-on-hadoop-using.html

hkropp · ‎09-19-2016

This does not sound like a good idea. Edge nodes by definition typically just hold client programs no services like datanode or nodemanager. YARN would manage the resource allocation based on data and utilization of the nodes, that is why it often also is not a good idea to run nodemanagers without datanodes on one machine. Concerning your "But can i .. bring the data to the edge .. run the task if all other nodes are busy?" YARN does the resource negotiation and scheduling for distributed frameworks like MR and Spark. I would advise to not do this manually but let YARN do this for you. I hope this helps?

hkropp · ‎09-15-2016

Currently Spark does not support the deployment to YARN from a SparkContext. Use spark-submit instead. For unit testing it is recommended to use [local] runner. The problem is that you can not set the Hadoop conf from outside the SparkContext, it is received from *-site.xml config under HADOOP_HOME during the spark-submit. So you can not point to your remote cluster in Eclipse unless you setup the correct *-site.conf on your laptop and use spark-submit. SparkSubmit is available as a Java class, but I doubt that you will achieve what your are looking for with it. But you would be able to launch a spark job from Eclipse to a remote cluster, if this is sufficient for you. Have a look at the Oozie Spark launcher as an example. SparkContext is dramatically changing in Spark 2 in favor I think of SparkClient to support multiple SparkContexts. I am not sure what the situation is with that.

hkropp · ‎09-15-2016

@Smart Solutions could you please check if this article is of any help for you: https://community.hortonworks.com/content/kbentry/56704/secure-kafka-java-producer-with-kerberos.html

hkropp · ‎09-15-2016

The most recent release of Kafka 0.9 with it's comprehensive security implementation has reached an important milestone. In his blog post Kafka Security 101 Ismael from Confluent describes the security features part of the release very well. As a part II of the here published post about Kafka Security with Kerberos this post discussed a sample implementation of a Java Kafka producer with authentication. It is part of a mini series of posts discussing secure HDP clients, connecting services to a secured cluster, and kerberizing the HDP Sandbox (Download HDP Sandbox). In this effort at the end of this post we will also create a Kafka Servlet to publish messages to a secured broker. Kafka provides SSL and Kerberos authentication. Only Kerberos is discussed here. Kafka from now on supports four different communication protocols between Consumers, Producers, and Brokers. Each protocol considers different security aspects, while PLAINTEXT is the old insecure communication protocol. PLAINTEXT (non-authenticated, non-encrypted) SSL (SSL authentication, encrypted) PLAINTEXT+SASL (authentication, non-encrypted) SSL+SASL (encrypted authentication, encrypted transport) A Kafka client needs to be configured to use the protocol of the corresponding broker. This tells the client to use authentication for communication with the broker: Properties props = new Properties(); props.put("security.protocol", "PLAINTEXTSASL"); Making use of Kerberos authentication in Java is provided by the Java Authentication and Authorization Service (JAAS) which is a pluggable authentication method similar to PAM supporting multiple authentication methods. In this case the authentication method being used is GSS-API for Kerberos. Demo Setup For JAAS a proper configuration of GSS would be needed in addition to being in possession of proper credentials, obviously. Some credentials can be created with MIT Kerberos like this: (as root) $ kadmin.local -q "addprinc -pw hadoop kafka-user" $ kadmin.local -q "xst -k /home/kafka-user/kafka-user.keytab kafka-user@MYCORP.NET" (Creating a keytab will make the existing password invalid. To change your password back to hadoop use as root:) $ kadmin.local -q "cpw -pw hadoop hdfs-user" The last line is not necessarily needed as it creates us a so called keytab - basically an encrypted password of the user - that can be used for password less authentication for example for automated services. We will make use of that here as well. First we need to prepare a test topic to publish messages with proper privileges for our kafka-user: # Become Kafka admin $ kinit -kt /etc/security/keytabs/kafka.service.keytab kafka/one.hdp@MYCORP.NET # Set privileges for kafka-user $ /usr/hdp/current/kafka-broker/bin/kafka-acls.sh --add --allow-principals user:kafka-user --operation ALL --topic test --authorizer-properties zookeeper.connect=one.hdp:2181 Adding following acls for resource: Topic:test user:kafka-user has Allow permission for operations: All from hosts: * Following is list of acls for resource: Topic:test user:kafka-user has Allow permission for operations: All from hosts: * As a sample producer we will use this: package hdp.sample; import java.util.Date; import java.util.Properties; import kafka.javaapi.producer.Producer; import kafka.producer.KeyedMessage; import kafka.producer.ProducerConfig; public class KafkaProducer { public static void main(String... args) { String topic = args[1]; Properties props = new Properties(); props.put("metadata.broker.list", args[0]); props.put("serializer.class", "kafka.serializer.StringEncoder"); props.put("request.required.acks", "1"); props.put("security.protocol", "PLAINTEXTSASL"); ProducerConfig config = new ProducerConfig(props); Producer producer = new Producer<String, String>(config); for (int i = 0; i < 10; i++){ producer.send(new KeyedMessage<String, String>(topic, "Test Date: " + new Date())); } } } With this setup we can go ahead demonstrating two ways to use a JAAS context to authenticate with the Kafka broker. At first we will configure a context to use the existing privileges possessed by the executing user. Next we use a so called keytab to demonstrate a password-less login for automated producer processes. At last we will look at a Servlet implementation provided here. Authentication with User Login To configure a JAAS config with userKeyTab set to false and useTicketCache to true, so that the privileges of the current users are being used. KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=false useTicketCache=true serviceName="kafka"; }; We store this in a file under /home/kafka-user/kafka-jaas.conf and exeute the broker like this: # list current user context $ klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: kafka-user@MYCORP.NET Valid starting Expires Service principal 21.02.2016 16:13:13 22.02.2016 16:13:13 krbtgt/MYCORP.NET@MYCORP.NET # execute java producer $ java -Djava.security.auth.login.config=/home/kafka-user/kafka-jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf -Djavax.security.auth.useSubjectCredsOnly=false -cp hdp-kafka-sample-1.0-SNAPSHOT.jar:/usr/hdp/current/kafka-broker/libs/* hdp.sample.KafkaProducer one.hdp:6667 test # consume sample messages for test $ /usr/hdp/current/kafka-broker/bin/kafka-simple-consumer-shell.sh --broker-list one.hdp:6667 --topic test --security-protocol PLAINTEXTSASL --partition 0 {metadata.broker.list=one.hdp:6667, request.timeout.ms=1000, client.id=SimpleConsumerShell, security.protocol=PLAINTEXTSASL} Test Date: Sun Feb 21 16:12:05 UTC 2016 Test Date: Sun Feb 21 16:12:06 UTC 2016 Test Date: Sun Feb 21 16:12:06 UTC 2016 Test Date: Sun Feb 21 16:12:06 UTC 2016 Test Date: Sun Feb 21 16:12:06 UTC 2016 Test Date: Sun Feb 21 16:12:06 UTC 2016 Test Date: Sun Feb 21 16:12:06 UTC 2016 Test Date: Sun Feb 21 16:12:06 UTC 2016 Test Date: Sun Feb 21 16:12:06 UTC 2016 Test Date: Sun Feb 21 16:12:06 UTC 2016 Using Keytab to Login Next we will configure the JAAS context to use a generated keytab file instead of the security context of the executing user. Before we can do this we need to create the keytab storing it also under /home/kafka-user/kafka-user.keytab. $ kadmin.local -q "xst -k /home/kafka-user/kafka-user.keytab kafka-user@MYCORP.NET" Authenticating as principal kafka-user/admin@MYCORP.NET with password. Entry for principal kafka-user@MYCORP.NET with kvno 2, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:/home/kafka-user/kafka-user.keytab. Entry for principal kafka-user@MYCORP.NET with kvno 2, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:/home/kafka-user/kafka-user.keytab. Entry for principal kafka-user@MYCORP.NET with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:/home/kafka-user/kafka-user.keytab. Entry for principal kafka-user@MYCORP.NET with kvno 2, encryption type arcfour-hmac added to keytab WRFILE:/home/kafka-user/kafka-user.keytab. Entry for principal kafka-user@MYCORP.NET with kvno 2, encryption type des-hmac-sha1 added to keytab WRFILE:/home/kafka-user/kafka-user.keytab. Entry for principal kafka-user@MYCORP.NET with kvno 2, encryption type des-cbc-md5 added to keytab WRFILE:/home/kafka-user/kafka-user.keytab. $ chown kafka-user. /home/kafka-user/kafka-user.keytab The JAAS configuration can now be changed to look like this: KafkaClient { com.sun.security.auth.module.Krb5LoginModule required doNotPrompt=true useTicketCache=true principal="kafka-user@MYCORP.NET" useKeyTab=true serviceName="kafka" keyTab="/home/kafka-user/kafka-user.keytab" client=true; }; This will use the keytab stored under /home/kafka-user/kafka-user.keytab while the user executing the producer must not be logged in to any security controller: $ klist klist: Credentials cache file '/tmp/krb5cc_0' not found $ java -Djava.security.auth.login.config=/home/kafka-user/kafka-jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf -Djavax.security.auth.useSubjectCredsOnly=true -cp hdp-kafka-sample-1.0-SNAPSHOT.jar:/usr/hdp/current/kafka-broker/libs/* hdp.sample.KafkaProducer one.hdp:6667 test Kafka Producer Servlet In a last example we will add a Kafka Servlet to the hdp-web-sample project previously described in this post. Our Servlet will get the topic and message as a GET parameter. The Servlet looks as follwoing: package hdp.webapp; import java.io.IOException; import java.io.PrintWriter; import java.util.Properties; import javax.servlet.Servlet; import javax.servlet.ServletException; import javax.servlet.http.HttpServlet; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse; import kafka.javaapi.producer.Producer; import kafka.producer.KeyedMessage; import kafka.producer.ProducerConfig; public class KafkaServlet extends HttpServlet implements Servlet { protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { String topic = request.getParameter("topic"); String msg = request.getParameter("msg"); Properties props = new Properties(); props.put("metadata.broker.list", "one.hdp:6667"); props.put("serializer.class", "kafka.serializer.StringEncoder"); props.put("request.required.acks", "1"); props.put("security.protocol", "PLAINTEXTSASL"); ProducerConfig config = new ProducerConfig(props); Producer producer = new Producer<String, String>(config); producer.send(new KeyedMessage<String, String>(topic, msg)); PrintWriter out = response.getWriter(); out.println("<html>"); out.println("<head><title>Write to topic: "+ topic +"</title></head>"); out.println("<body><h1>/"+ msg +"</h1>"); out.println("</html>"); out.close(); } } Again we are changing the JAAS config of the Tomcat service to be able to make use of the previously generated keytab. The jaas.conf of Tomcat will contain now this: KafkaClient { com.sun.security.auth.module.Krb5LoginModule required doNotPrompt=true useTicketCache=true principal="kafka-user@MYCORP.NET" useKeyTab=true serviceName="kafka" keyTab="/home/kafka-user/kafka-user.keytab" client=true; }; com.sun.security.jgss.krb5.initiate { com.sun.security.auth.module.Krb5LoginModule required doNotPrompt=true principal="tomcat/one.hdp@MYCORP.NET" useKeyTab=true keyTab="/etc/tomcat/tomcat.keytab" storeKey=true; }; After deploying the web app and restarting tomcat with this newly adapted JAAS config you should be able to publish message to a secured broker be triggering the following GET address from a browser http://one.hdp:8099/hdp-web/kafka?topic=test&msg=Test1 . The response should be a 200 OK like this: You might be having some issues and in particular seeing this Exception: SEVERE: Servlet.service() for servlet [KafkaServlet] in context with path [/hdp-web] threw exception [Servlet execution threw an exception] with root cause javax.security.auth.login.LoginException: Unable to obtain password from user at com.sun.security.auth.module.Krb5LoginModule.promptForPass(Krb5LoginModule.java:897) at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:760) at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) at javax.security.auth.login.LoginContext.login(LoginContext.java:587) at org.apache.kafka.common.security.kerberos.Login.login(Login.java:298) at org.apache.kafka.common.security.kerberos.Login.<init>(Login.java:104) at kafka.common.security.LoginManager$.init(LoginManager.scala:36) at kafka.producer.Producer.<init>(Producer.scala:50) at kafka.producer.Producer.<init>(Producer.scala:73) at kafka.javaapi.producer.Producer.<init>(Producer.scala:26) at hdp.webapp.KafkaServlet.doGet(KafkaServlet.java:33) at javax.servlet.http.HttpServlet.service(HttpServlet.java:620) at javax.servlet.http.HttpServlet.service(HttpServlet.java:727) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:501) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1040) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:314) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:745) If are seeing the message javax.security.auth.login.LoginException: Unable to obtain password from user it likely refers to your keytab file, as being the users password. So make sure that the tomcat user is able to read that file stored under /home/kafka-user/kafka-user.keytab for example. Further Readings Kafka Security 101 Kafka Security Kafka Sasl/Kerberos and SSL Implementation Oracle Doc: JAAS Authentication Krb5LoginModule Flume with kerberized KafkaJAAS Login Configuration File This article was first published under: http://henning.kropponline.de/2016/02/21/secure-kafka-java-producer-with-kerberos/

hkropp · ‎05-17-2016

Hi @sarfarazkhan pathan, if I am not mistaken this is just a warning. You typically can install the cluster with already installed users from previous installs. In some cases this obviously can cause some issues, but in general it doesn't. Also if you do add and remove users frequently from multiple installs you are increasing the uid count of your system. So it should be save to ignore the warning. You can read here about cleaning up nodes from previous installs: http://henning.kropponline.de/2016/04/24/uninstalling-and-cleaning-a-hdp-node/ EDIT: BTW to be safe restart of the ambari agent and simply re-running the checks should be enough. No restart of node or similar should be required. Regards

hkropp · ‎04-22-2016

To me it looks like in the Ranger admin config your ambari_ranger_admin and ambari_ranger_password are empty. Could you please check if they exists and are not blank? Please note, that the ambari_ranger_admin is different from the Ranger admin_user. But both are defined in the Ranger configuration in Ambari.