Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Yarn Diagnostics Collection 0.0.0.0:8088/conf HTTP Error 401, kerberos enabled

Highlighted

Yarn Diagnostics Collection 0.0.0.0:8088/conf HTTP Error 401, kerberos enabled

New Contributor
Hello community,
 
In a 6 nodes cluster with kerberos enabled we are having intermittent errors when collecting yarn application diagnostic data, this error just occurs in one of the nodes (node3), if the request is executed in node4 there is no problem.
 
...
+ PROCESS_DIR_NAME=4727-YARN-yarn-RESOURCEMANAGER-5f5c0e4720198a69ec479dbba122f003-YarnApplicationDiagnosticsCollection
+ DIAGNOSTICS_DUMP_DIR=/tmp/4727-YARN-yarn-RESOURCEMANAGER-5f5c0e4720198a69ec479dbba122f003-YarnApplicationDiagnosticsCollection
+ COLLECT_APP_DATA_ARGS='--app_ids application_1519381726056_0027 --output_dir /tmp/4727-YARN-yarn-RESOURCEMANAGER-5f5c0e4720198a69ec479dbba122f003-YarnApplicationDiagnosticsCollection/diagnostics --hadoop_conf_dir /var/run/cloudera-scm-agent/process/4727-YARN-yarn-RESOURCEMANAGER-5f5c0e4720198a69ec479dbba122f003-YarnApplicationDiagnosticsCollection'
+ /usr/lib64/cmf/service/../agent/build/env/bin/python /usr/lib64/cmf/service/yarn/../support/collect_jobs_stats/collect_jobs_stats.py --app_ids application_1519381726056_0027 --output_dir /tmp/4727-YARN-yarn-RESOURCEMANAGER-5f5c0e4720198a69ec479dbba122f003-YarnApplicationDiagnosticsCollection/diagnostics --hadoop_conf_dir /var/run/cloudera-scm-agent/process/4727-YARN-yarn-RESOURCEMANAGER-5f5c0e4720198a69ec479dbba122f003-YarnApplicationDiagnosticsCollection
2018-02-27 14:48:32,748 CRITICAL GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information/Cannot determine realm for numeric host address
2018-02-27 14:48:32,749 ERROR Fail to fetch 'http://0.0.0.0:8088/conf': HTTP Error 401: Authentication required
Traceback (most recent call last):
 File "/usr/lib64/cmf/service/yarn/../support/collect_jobs_stats/collect_jobs_stats.py", line 356, in <module>
   sys.exit(main(sys.argv))
 File "/usr/lib64/cmf/service/yarn/../support/collect_jobs_stats/collect_jobs_stats.py", line 351, in main
   gatherer = MRJobInfoGatherer(app_ids.split(','), output_dir, hadoop_conf_dir)
 File "/usr/lib64/cmf/service/yarn/../support/collect_jobs_stats/collect_jobs_stats.py", line 79, in __init__
   self._yarn_conf = self._load_yarn_conf()
 File "/usr/lib64/cmf/service/yarn/../support/collect_jobs_stats/collect_jobs_stats.py", line 83, in _load_yarn_conf
   data = self._read_rm_url('/conf')
 File "/usr/lib64/cmf/service/yarn/../support/collect_jobs_stats/collect_jobs_stats.py", line 105, in _read_rm_url
   raise ex
urllib2.HTTPError: HTTP Error 401: Authentication required': HTTP Error 401: Authentication required
...
 
Scenario
 
  • 6 nodes cluster
  • kerberos/sentry enabled
  • 2x  Resource Managers (one in node 3 another in node 4)
  • Bind ResourceManager to Wildcard Address = TRUE
 
Analysis
 
The error is in script "/usr/lib64/cmf/service/support/collect_jobs_stats/collect_jobs_stats.py", it fails in the operation "res = urlopen_with_timeout(url, timeout=URL_TIMEOUT).read()" when the url variable is equal to "http://0.0.0.0:8088/conf", after comparing the list in the variable "self._rm_addresses" we verify that the function fails because the list that executes in node3 starts with the element "http://0.0.0.0:8088" in node4 the same list starts with "http://node4.xxxx.xxx:8088"
 
 
method that fails in script collect_jobs_stats.py
 
 def _read_rm_url(self, path):
   """Return data read from an RM web url. Not thread-safe."""
   res = None
   active_rm = None
 
   for rm_addr in self._rm_addresses:
     url = urlparse.urljoin(rm_addr, path)
     # In case of a HTTPS request, the parsed url might contain dirty
     # unicode character(s), say of the format - u'https://hostname'.
     # Converting it to a string gives us the right format
     # - 'https://hostname'
     url = str(url)
     try:
       res = urlopen_with_timeout(url, timeout=URL_TIMEOUT).read()  <===== this operation
       if not res.startswith("This is standby RM."):
         active_rm = rm_addr
         break
     except urllib2.HTTPError, ex:
       LOG.error("Fail to fetch '%s': %s" % (url, ex))
       raise ex
How to simulate
 
This error can be simulated with kerberos is enabled, and ResourceManager to Wildcard Address enable.
 
Code to simulate the error, but first kinit with yarn.
 
#!/usr/lib64/cmf/agent/build/env/bin/python
from cmf.url_util import urlopen_with_timeout
res = urlopen_with_timeout('http://0.0.0.0:8088/conf', timeout=60).read()
 
 
I believe that the url 0.0.0.0.... exists because the "Bind ResourceManager to Wildcard Address" is enable, but what type of bad consequences exists if I disable this flag, or if exists another workaround to this situation.
 
Best regards,
Nelson Antunes
Don't have an account?
Coming from Hortonworks? Activate your account here