Created 12-18-2017 02:25 PM
We have been facing an intermittent issue in our QA env, where PutHDFS processor goes stuck and not able to release item from upstream processor queue. This does not look related to load as the no.of messages in queue and queue size was not too high. We tried to stop/start the processor but this does not work as after stopping the processor we don't see the start button on PutHDFS. The only way we are currently fixing this is by restarting the nifi which we would like to avoid in Prod env.
At the time of stuck No of threads we can see on UI 2 and processor is configured with default 1 parallel level. In order to further drill down we tried to take multiple thread-dumps and notices that a particular thread was always blocked with same stack trace and linked to PutHDFS processor.
Stack trace is as below.
Timer-Driven Process Thread-2" Id=106 BLOCKED on java.io.BufferedInputStream@b7146b3 at java.io.BufferedInputStream.read(BufferedInputStream.java:336) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
-------
at org.apache.nifi.processors.hadoop.PutHDFS$1.run(PutHDFS.java:255) at java.security.AccessController.doPrivileged(Native Method)
puthdfs-stuck.jpgthread-dump.txtthread-dump-2.txtthread-dump-3.txt
I am attaching the UI screen-shot at time of stuck and thread-dumps for reference.
Can someone please help us in finding the root-cause for this ?
Created 12-19-2017 12:22 AM
I have been investigating this issue for more than a few days, but so far I have not been able to reproduce it in any environment to which I have access. What you're seeing is the end result of a set of conditions that cause the JAAS configuration and previously authenticated principal to be contextually lost, resulting in Krb5LoginModule.promptForName to interactively prompt for the principal name.
If you wouldn't mind sharing details about your configuration, we can work together to diagnose what's causing PutHDFS to fail. Could you please answer the following questions:
There are a few settings you can add/change for help provide more information to debug the issue.
export HADOOP_JAAS_DEBUG=true
<logger name="org.apache.hadoop.security" level="DEBUG"/>
java.arg.100=-Dsun.security.krb5.debug=true
Please provide nifi-app and nifi-boostrap logs after restarting NiFi and observing the stuck threads, and I'll take a look at them.
Created 12-21-2017 06:46 AM
Thank You @Jeff Storck for the reply. Please see below response inline.
What values are set for the ticket lifetime and ticket renewal lifetime in the KDC for the principal you have set in PutHDFS?
What values are set for the ticket lifetime and ticket renewal liftetime in the krb5.conf that you have set for NiFi in Ambari?
Does this issue occur consistently? Would it seem to happen around the time that the principal's kerberos ticket would be getting renewed, some time between 80% and 100% of the ticket lifetime?
How often are files sent to PutHDFS incoming queue? On a regular interval, or is it sporadic?
What is the "Relogin Period" property set to in PutHDFS' configuration?
Unfortunately don't have nifi-app and bootstarp.logs for QA env issue when it occurred last time.But we also notice this on Dev env with FetchHDFS Processor, attached are thread-dumps and logs for that time.dev-fetchhdfs-stuck.jpgthread-dump-1.txtthread-dump-2.txt
Created 12-26-2017 02:55 PM
Hello @Tarun Kumar. Set this system property 'javax.security.auth.useSubjectCredsOnly' to true.
To configure it this way in NiFi you can add this line, for example, to your nifi/conf/bootstrap.conf file.
java.arg.101=-Djavax.security.auth.useSubjectCredsOnly=true
Created 12-26-2017 03:28 PM
Thank You @jwitt . Please confirm is this suggestion as a recommended fix for stuck issue with HDFS-Processors in nifi in the context of above thread?
Please also help with some relevant information/link in this regard to relate this issue.
Created 12-26-2017 03:43 PM
Yes this solves the original issue of this thread (promptForName). What is happening is the JDK/JRE security code is allowing the search for other methods to obtain the principal in a condition where a failure has occurred and a retry is being blocked most likely due to insufficient time. We've spent a considerable amount of time debugging this condition.
The link to the system property that explains its meaning/role is here https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/single-signon.html. Specifically read the 'Exceptions to the Model' case where this property is described.
Doing this will ensure the JDK/JRE does not attempt any methods/mechanisms other than what we've said we want and specifically it avoids the scenario where it would try to prompt for the user to supply a name at the command prompt which would obviously never work and worse yet when that happens our thread is stuck until a restart.
So, yes, add this system property and you should be in far better shape with regard to the prompt for name issue.
Created 12-27-2017 01:29 PM
Thank You very much @jwitt for useful insights with above statements.
Created 02-22-2018 05:17 PM
To follow up on your question, with the release of HDF 3.1, the issues with promptForName/stuck threads in Hadoop components should be resolved. In addition to the property that @jwitt mentioned (javax.security.auth.useSubjectCredsOnly=true), several code changes were made to how HDFS/HBase/Hive components in NiFi acquire a UGI.