Support Questions
Find answers, ask questions, and share your expertise

Ambari install failing because unable to obtain HDP version on some nodes during install

Ambari install failing because unable to obtain HDP version on some nodes during install

Rising Star

I an doing a clean install of HDP 2.3.4.7 using Ambari 2.2.1.1. By default, the HDP repo URL does not point to the latest version of HDP right now, which is 2.3.4.7. I conferred with @Neeraj Sabharwal to determine if it is feasible to change that URL to 2.3.4.7 and directly do the clean install of that latest version to avoid the need to install the earlier release of 2.3.4 then immediately have to upgrade. He said yes, so that's what I did. I encountered errors in the final step of the install at the "Install, Start, and Test" step where all the services are installed on the nodes. I was seeing errors from the scripts trying obtain the HDP versions so I was wondering if my URL tweak was actually causing a problem. In the end it was not.

Here is a sample error trace:

stderr:
<script id="metamorph-39638-start" type="text/x-placeholder"></script>Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/after-INSTALL/scripts/hook.py", line 37, in <module>
    AfterInstallHook().execute()
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 219, in execute
    method(env)
  File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/after-INSTALL/scripts/hook.py", line 31, in hook
    setup_hdp_symlinks()
  File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/after-INSTALL/scripts/shared_initialization.py", line 44, in setup_hdp_symlinks
    hdp_select.select_all(version)
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/hdp_select.py", line 122, in select_all
    Execute(command, only_if = only_if_command)
  File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 154, in __init__
    self.env.run()
  File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 158, in run
    self.run_action(resource, action)
  File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 121, in run_action
    provider_action()
  File "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py", line 238, in action_run
    tries=self.resource.tries, try_sleep=self.resource.try_sleep)
  File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 70, in inner
    result = function(command, **kwargs)
  File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 92, in checked_call
    tries=tries, try_sleep=try_sleep)
  File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 140, in _call_wrapper
    result = _call(command, **kwargs_copy)
  File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 291, in _call
    raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of 'ambari-sudo.sh /usr/bin/hdp-select set all `ambari-python-wrap /usr/bin/hdp-select versions | grep ^2.3 | tail -1`' returned 1. Traceback (most recent call last):
  File "/usr/bin/hdp-select", line 378, in <module>
    printVersions()
  File "/usr/bin/hdp-select", line 235, in printVersions
    result[tuple(map(int, versionRegex.split(f)))] = f
ValueError: invalid literal for int() with base 10: 'hadoop'
ERROR: set command takes 2 parameters, instead of 1
 
usage: hdp-select [-h] [<command>] [<package>] [<version>]
 
Set the selected version of HDP.
 
positional arguments:
  <command>   One of set, status, versions, or packages
  <package>   the package name to set
  <version>   the HDP version to set
 
optional arguments:
  -h, --help  show this help message and exit
  -r, --rpm-mode  if true checks if there is symlink exists and creates the symlink if it doesn't
 
Commands:
  set      : set the package to a specified version
  status   : show the version of the package
  versions : show the currently installed versions
  packages : show the individual package names<script id="metamorph-39638-end" type="text/x-placeholder"></script>

Being a Python guy, I dug into /usr/bin/hdp-select and found where the error was occurring. There are two issues:

  1. This function is not very bullet proof and prone to errors
  2. The reason for the error is an unexpected directory exists in /usr/hdp where this function is looking to parse installed versions

The function in question is:

# Print the installed packages
def printVersions():
  result = {}
  for f in os.listdir(root):
    if f not in [".", "..", "current", "share", "lost+found"]:
      result[tuple(map(int, versionRegex.split(f)))] = f 
  keys = result.keys()
  keys.sort()
  for k in keys:
     print result[k]



The problem is that if any but the excluded directories are found on the scan, this function will raise an exception where it tries to map the regex output to an int(), where the output is a non-numeric string - like the value "hadoop". On some of my nodes, the /usr/hdp directory looks like this (which does NOT result in this function throwing an exception):

[nn01.qa] out: drwxr-xr-x. 13 root root  4096 Apr 23 07:09 2.3.4.7-4
[nn01.qa] out: drwxr-xr-x.  2 root root  4096 Apr 23 16:32 current
[nn01.qa] out: drwx------.  2 root root 16384 Apr 19 16:09 lost+found

But others look like this (note presence of unexpected hadoop directory that was causing the exception):

[nn02.qa] out: drwxr-xr-x. 13 root root  4096 Apr 23 07:09 2.3.4.7-4
[nn02.qa] out: drwxr-xr-x.  2 root root  4096 Apr 23 16:32 current
[nn02.qa] out: drwxr-xr-x.  4 root root  4096 Apr 23 16:12 hadoop
[nn02.qa] out: drwx------.  2 root root 16384 Apr 19 16:09 lost+found

I fixed the problem by changing that function code as follows (adding exception handling), copying this update to all my nodes, retrying the install. It completed successfully after that.

# Print the installed packages
def printVersions():
  result = {}
  for f in os.listdir(root):
    if f not in [".", "..", "current", "share", "lost+found"]:
      try:
        result[tuple(map(int, versionRegex.split(f)))] = f 
      except:
        pass
  keys = result.keys()
  keys.sort()
  for k in keys:
     print result[k]

I don't recommend this code as a final fix. Instead of a black list condition, I would get rid of that code an use a new regex filter to verify that each directory looked like "^\d+\.\d+.*" or something like that which requires it to at least look like some form of the expected version format of 2.3.4.7-n. Then it does not matter how many unexpected entries you find - you will just gracefully skip over them.

One interesting point about this bug... The first time I ran the "Install, Start, and Test" step in the Ambari GUI, all 8 nodes failed, one with a red failure and the others orange and stopped due to warnings. I guess when one fails, the others seem to abort as well with warnings. Just for the heck of it, I then clicked the "Retry" button and one node completed (turned blue) but another failed and the rest also stopped with warnings. I though that was weird so I keep retrying over and over. I got to the point where 4 of the 8 turned blue but then the rest would never complete no matter how many times I retried. When I looked at the contents of all the nodes that made it to blue, they did NOT have "hadoop" subdir in /usr/hdp so that is why the function above did not raise an exception. I seems like, at some stage of the install, /usr/hdp/hadoop gets created when certain applications need to be installed there and, if you have to retry after that, you are dead. However, after my fixed code was deployed and I retried, all nodes completely installed and EVERY node now has a /usr/hdp/hadoop directory.

@Ryan Chapin

7 REPLIES 7
Highlighted

Re: Ambari install failing because unable to obtain HDP version on some nodes during install

Yes, that's a known problem. Some people ran into trouble by accidentally creating some files in /usr/hdp after which "hdp-select versions" doesn't work, and the output of this command is used as input to some other commands and so on. So, one either has to be careful and put nothing else in /usr/hdp or edit Ambari python code. By the way, as far as I know /usr/hdp/hadoop is not supposed to be there, and no module is supposed to use it. So, I think in your case this is the primary bug, and hdp-select weak code was triggered to produce error. Can you list /usr/hdp/hadoop and find out what's there. The cluster is supposed to work without it. Can you try as a test to move /usr/hdp/hadoop to /tmp and restart the cluster.

Highlighted

Re: Ambari install failing because unable to obtain HDP version on some nodes during install

Rising Star

Good point and thanks for the quick response. I believe the directories are there because I failed to setup some configs completely in the earlier steps of the install. So, for example, my dfs.namenode.name.dir is set to this, which is bogus. But, I am going to immediately convert to NN HA anyway, so all this gets fixed. But, as you noted, there is the /usr/hdp/hadoop dir as one of the below contributing to the error.

/usr/hdp/hadoop/hdfs/namenode,/tmp/hadoop/hdfs/namenode,/zk/hadoop/hdfs/namenode,/qjm/hadoop/hdfs/namenode,/opt/hadoop/hdfs/namenode,/var/hadoop/hdfs/namenode

That sure seems like a strange concoction of directories for defaults. Anyway, it all has to change for HA config anyway. I was just trying to get to a basic base install. Next phase is a complete config review. This is why I am building this on 8 VMs first before building out in PROD on real HW. Working through the details on this test set first! :)

Highlighted

Re: Ambari install failing because unable to obtain HDP version on some nodes during install

By default Ambari lists all mount points (partitions) as NN directories on masters, and data directories on DNs. So, I guess you mounted /usr/hdp as a separate partition, right? You need to sort out that quickly, and be careful now because HDFS has maybe already populated some of those directories with its files. So, check also your DNs.

Highlighted

Re: Ambari install failing because unable to obtain HDP version on some nodes during install

Rising Star

Ya, it was fun undoing that. Ultimately, I just blew away the data and namenode dirs and recreated from scratch after I first removed all the undesired partitions in the HDFS configs. IMO, Ambari setup wizard should make NO guesses on such critical configurations like this. It should REQUIRE you to enter the dfs.namenode.name.dir and dfs.datanode.data.dir list of directories before you can proceed. I could think of a few other path-based parameters that is should NOT try to default for you. Anyway, thanks for the response!

Highlighted

Re: Ambari install failing because unable to obtain HDP version on some nodes during install

New Contributor

Agreed, this just caused me a major amount of grief as well. The "educated guessing" that is done for these locations has required manual intervention in every cluster I have configured (several dozen, across several releases/topologies). For others reading this - several services will default to only a single partition as well, unlike HDFS which it will try to spread across multiples. In this case I saw YARN default several locations to /usr/hdp/ and an hbase tmp directory pointing to /usr/hdp/var/lib/... which is really odd.

Very glad I found this thread!

Highlighted

Re: Ambari install failing because unable to obtain HDP version on some nodes during install

Explorer

Thank you Mark! I found the same issue, came up when trying to restart an install that failed part way through due to space limitations in /usr. I'm not a Python coder, but this is what I used for printVersions piece of hdp-select:

# Print the installed packages
def printVersions():
  result = {}
  for f in os.listdir(root):
   matchObj = re.match(r'^\d\..*', f, re.M|re.I)
   if matchObj:
    if f not in [".", "..", "current", "share", "lost+found"]:
      try:
        result[tuple(map(int, versionRegex.split(f)))] = f
      except ValueError:
        print ("ERROR: Unexpected file/directory found in %s: %s" % (root, f))
        sys.exit(1)
  keys = result.keys()
  keys.sort()
  for k in keys:
     print result[k]

Hopefully someone else will find that useful (or suggest something better ;)

Tom

Highlighted

Re: Ambari install failing because unable to obtain HDP version on some nodes during install

Expert Contributor

This error is a huge problem. It needs to be patched and a warning check should be in place BEFORE starting an install.