Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

hdfs balancer is getting failed after 30 mins in ambari 2.2.0.

Re: hdfs balancer is getting failed after 30 mins in ambari 2.2.0.

Mentor

@Saurabh Kumar tagging an expert again :). @Chris Nauroth I know there was an alert about HA and balancer in an earlier release of Ambari. The document in my original response lists all the properties needed to be removed. What else do you think is possible in this situation?

Highlighted

Re: hdfs balancer is getting failed after 30 mins in ambari 2.2.0.

Re: hdfs balancer is getting failed after 30 mins in ambari 2.2.0.

Guru

@Neeraj Sabharwal: Thanks a lot for your help. I read jira and bug doc but I noticed it was an issue in ambari 2.1.2 and resolved in ambari 2.2.0. And I have ambari 2.2.0 only where I am facing this error.

Re: hdfs balancer is getting failed after 30 mins in ambari 2.2.0.

Contributor

@Neeraj Sabharwal

This is still happening with Ambari 2.2.1.1.

Re: hdfs balancer is getting failed after 30 mins in ambari 2.2.0.

Contributor

@Neeraj Sabharwal @Saurabh Kumar I am also having the same issue on HDP 2.4 Ambari 2.2.2.0.

I am trying to put 30TB on hdfs but it keeps filling the datanode where I run the hdfs put command. I need HDFS Load balancer working.

Re: hdfs balancer is getting failed after 30 mins in ambari 2.2.0.

Expert Contributor
@Saurabh Kumar

@Sushil Saxena

I am facing the same issue... I have HDP 2.4 Ambari 2.2.2.0, JDK 1.8

How did you resolve this issue??

Re: hdfs balancer is getting failed after 30 mins in ambari 2.2.0.

New Contributor

The timeout period for the rebalance operation is wired in the stack definition ( the value used here is the one set for the namenode command script, and is set to 1800s.) When issuing the rebalance process this value is increased by the server by 10 more minutes (to give the chance for the operation to end)

If the rebalance operation takes longer, the server times out the operation by killing the process (however some resources created by the rebalancer will remain on the HDFS) and eventually retries to execute the command. When the command is issued again (regardless if it's retried by the server or triggered manually) the rebalancer will notice that another balancer is running (because it finds the /system/balancer.id on the HDFS).

The problem is addressed here: https://issues.apache.org/jira/browse/AMBARI-20175

The only solution till the fix is available would be to update the timeout in the stack definition to a "big enough" value so that the timeout is disabled, or it's long enough not to timeout the rebalancing. (The ambari server needs to be restarted for this to take effect; also all namenode operation timeouts will be set to the new value)