Support Questions

Find answers, ask questions, and share your expertise

HDFS Balancer exits without balancing

avatar
Contributor

Command ran through shell script:

....Logging
sudo -u hdfs -b hdfs balancer -threshold 5
....

Log: The Balance exits successfully without balancing.

17/05/26 16:38:51 INFO balancer.Balancer: Using a threshold of 5.0
17/05/26 16:38:51 INFO balancer.Balancer: namenodes  = [hdfs://belongcluster1]
17/05/26 16:38:51 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 5.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false]
17/05/26 16:38:51 INFO balancer.Balancer: included nodes = []
17/05/26 16:38:51 INFO balancer.Balancer: excluded nodes = []
17/05/26 16:38:51 INFO balancer.Balancer: source nodes = []
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
17/05/26 16:38:53 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
17/05/26 16:38:53 INFO block.BlockTokenSecretManager: Setting block keys
17/05/26 16:38:53 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
17/05/26 16:38:53 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000)
17/05/26 16:38:53 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000)
17/05/26 16:38:53 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200)
17/05/26 16:38:53 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 5 (default=5)
17/05/26 16:38:53 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648)
17/05/26 16:38:53 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
17/05/26 16:38:53 INFO block.BlockTokenSecretManager: Setting block keys
17/05/26 16:38:53 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240)
17/05/26 16:38:53 INFO balancer.Balancer: dfs.blocksize = 134217728 (default=134217728)
17/05/26 16:38:53 INFO net.NetworkTopology: Adding a new node: /default-rack/58.XXX.144.YYY:50010
17/05/26 16:38:53 INFO net.NetworkTopology: Adding a new node: /default-rack/58.XXX.144.YYY:50010
17/05/26 16:38:53 INFO net.NetworkTopology: Adding a new node: /default-rack/58.XXX.145.YY:50010
17/05/26 16:38:53 INFO net.NetworkTopology: Adding a new node: /default-rack/58.XXX.145.YY:50010
17/05/26 16:38:53 INFO net.NetworkTopology: Adding a new node: /default-rack/58.XXX.145.YY:50010
17/05/26 16:38:53 INFO net.NetworkTopology: Adding a new node: /default-rack/58.XXX.144.YY:50010
17/05/26 16:38:53 INFO balancer.Balancer: 0 over-utilized: []
17/05/26 16:38:53 INFO balancer.Balancer: 0 underutilized: []
The cluster is balanced. Exiting...
May 26, 2017 4:38:53 PM           0                  0 B                 0 B               -1 B
May 26, 2017 4:38:54 PM  Balancing took 2.773 seconds

The Ambari Host view indicates that the data is still not balanced across the nodes:

hostviewambari.jpg

( Updated) The cluster has HA configuration (Primary-Secondary).

Ambari : 2.2.1.0

Hadoop : 2.7.1.2.4.0.0-169

Any input will be helpfull.

12 REPLIES 12

avatar
Super Guru
@Suhel
  1. sudo -u hdfs -b hdfs balancer -threshold 5

What do you have "-b" for in this command? Shouldn't this be

sudo -u hdfs hdfs balancer -threshold 5

avatar
Contributor

@mqureshi I used "-b" option to push the processing to background. I have also tried the following from server that has NN. Trial 1: (on Command Prompt)

nohup sudo -u hdfs hdfs balancer -threshold 5 > /var/log/hadoop/hdfs/balancer.$(date +%F_%H-%M-%S.%N).log 2>&1 &
Trial 2: (on Command Prompt) . DH05 needs to be offloaded as its the most unbalanced
sudo -u hdfs -b hdfs balancer -threshold 5 -source DH05 > /var/log/hadoop/hdfs/balancer.$(date +%F_%H-%M-%S.%N).log 2>&1 &

I get the same output from Balancer as it exists stating that "The cluster is balanced". It's somehow not able to get the current stats on data in the datanodes.

avatar
Super Guru

@Suhel

Do you also have a standby namenode? Can you try the following:

sudo -u hdfs -b hdfs balancer -fs hdfs://<your name node>:8020 -threshold 5

avatar
Contributor

@mqureshi The cluster has a primary and secondary configuration for NN. When i run the balance command as you indicated, i get an error stating "Another Balancer is running". But ps -ef | grep balancer does not show any running balancer process

[root@dh01 ~]# sudo -u hdfs hdfs balancer -fs hdfs://dh01.int.belong.com.au:8020 -threshold 5
17/05/29 12:14:53 INFO balancer.Balancer: Using a threshold of 5.0
17/05/29 12:14:53 INFO balancer.Balancer: namenodes  = [hdfs://belongcluster1, hdfs://dh01.int.belong.com.au:8020]
17/05/29 12:14:53 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 5.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false]
17/05/29 12:14:53 INFO balancer.Balancer: included nodes = []
17/05/29 12:14:53 INFO balancer.Balancer: excluded nodes = []
17/05/29 12:14:53 INFO balancer.Balancer: source nodes = []
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
17/05/29 12:14:54 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
17/05/29 12:14:54 INFO block.BlockTokenSecretManager: Setting block keys
17/05/29 12:14:54 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
17/05/29 12:14:55 INFO block.BlockTokenSecretManager: Setting block keys
17/05/29 12:14:55 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
17/05/29 12:14:55 INFO block.BlockTokenSecretManager: Setting block keys
17/05/29 12:14:55 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
java.io.IOException: Another Balancer is running..  Exiting ...
May 29, 2017 12:14:55 PM Balancing took 2.431 seconds


avatar
Super Guru

avatar
Contributor

@mqureshi

About : https://community.hortonworks.com/articles/4595/balancer-not-working-in-hdfs-ha.html

my hdfs-site.xml has 2 entries .. i am not sure if i need to delete both or NN2 only..

    <property>
      <name>dfs.namenode.rpc-address.belongcluster1.nn1</name>
      <value>dh01.int.belong.com.au:8020</value>
    </property>
    
    <property>
      <name>dfs.namenode.rpc-address.belongcluster1.nn2</name>
      <value>dh02.int.belong.com.au:8020</value>
    </property>

avatar
Super Guru

@Suhel

I think you should have only one value and it should point to your "name service". You should have a value for name service when you have HA enabled. See the following link on how this works in HA - third row:

https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

avatar
Contributor

@mqureshi I found another thread with similar issue: https://community.hortonworks.com/questions/22105/hdfs-balancer-is-getting-failed-after-30-mins-in-a... here they say indicate that if HA is enabled then one would need to remove dfs.namenode.rpc-address . I ran a check on Ambari Server using the configs.sh:

/var/lib/ambari-server/resources/scripts/configs.sh -u admin -p admin -port 8080 get dh01.int.belong.com.au belong1 hdfs-site

and the output does not contain the dfs.namenode.rpc-address property.

########## Performing 'GET' on (Site:hdfs-site, Tag:version1470359698835)
"properties" : {
"dfs.block.access.token.enable" : "true",
"dfs.blockreport.initialDelay" : "120",
"dfs.blocksize" : "134217728",
"dfs.client.block.write.replace-datanode-on-failure.enable" : "NEVER",
"dfs.client.failover.proxy.provider.belongcluster1" : "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
"dfs.client.read.shortcircuit" : "true",
"dfs.client.read.shortcircuit.streams.cache.size" : "4096",
"dfs.client.retry.policy.enabled" : "false",
"dfs.cluster.administrators" : " hdfs",
"dfs.content-summary.limit" : "5000",
"dfs.datanode.address" : "0.0.0.0:50010",
"dfs.datanode.balance.bandwidthPerSec" : "6250000",
"dfs.datanode.data.dir" : "/data/hadoop/hdfs/data",
"dfs.datanode.data.dir.perm" : "750",
"dfs.datanode.du.reserved" : "1073741824",
"dfs.datanode.failed.volumes.tolerated" : "0",
"dfs.datanode.http.address" : "0.0.0.0:50075",
"dfs.datanode.https.address" : "0.0.0.0:50475",
"dfs.datanode.ipc.address" : "0.0.0.0:8010",
"dfs.datanode.max.transfer.threads" : "16384",
"dfs.domain.socket.path" : "/var/lib/hadoop-hdfs/dn_socket",
"dfs.encrypt.data.transfer.cipher.suites" : "AES/CTR/NoPadding",
"dfs.encryption.key.provider.uri" : "",
"dfs.ha.automatic-failover.enabled" : "true",
"dfs.ha.fencing.methods" : "shell(/bin/true)",
"dfs.ha.namenodes.belongcluster1" : "nn1,nn2",
"dfs.heartbeat.interval" : "3",
"dfs.hosts.exclude" : "/etc/hadoop/conf/dfs.exclude",
"dfs.http.policy" : "HTTP_ONLY",
"dfs.https.port" : "50470",
"dfs.journalnode.edits.dir" : "/hadoop/hdfs/journal",
"dfs.journalnode.https-address" : "0.0.0.0:8481",
"dfs.namenode.accesstime.precision" : "0",
"dfs.namenode.acls.enabled" : "true",
"dfs.namenode.audit.log.async" : "true",
"dfs.namenode.avoid.read.stale.datanode" : "true",
"dfs.namenode.avoid.write.stale.datanode" : "true",
"dfs.namenode.checkpoint.dir" : "/tmp/hadoop/hdfs/namesecondary",
"dfs.namenode.checkpoint.edits.dir" : "${dfs.namenode.checkpoint.dir}",
"dfs.namenode.checkpoint.period" : "21600",
"dfs.namenode.checkpoint.txns" : "1000000",
"dfs.namenode.fslock.fair" : "false",
"dfs.namenode.handler.count" : "200",
"dfs.namenode.http-address" : "dh01.int.belong.com.au:50070",
"dfs.namenode.http-address.belongcluster1.nn1" : "dh01.int.belong.com.au:50070",
"dfs.namenode.http-address.belongcluster1.nn2" : "dh02.int.belong.com.au:50070",
"dfs.namenode.https-address" : "dh01.int.belong.com.au:50470",
"dfs.namenode.https-address.belongcluster1.nn1" : "dh01.int.belong.com.au:50470",
"dfs.namenode.https-address.belongcluster1.nn2" : "dh02.int.belong.com.au:50470",
"dfs.namenode.name.dir" : "/data/hadoop/hdfs/namenode",
"dfs.namenode.name.dir.restore" : "true",
"dfs.namenode.rpc-address.belongcluster1.nn1" : "dh01.int.belong.com.au:8020",
"dfs.namenode.rpc-address.belongcluster1.nn2" : "dh02.int.belong.com.au:8020",
"dfs.namenode.safemode.threshold-pct" : "0.99",
"dfs.namenode.shared.edits.dir" : "qjournal://dh03.int.belong.com.au:8485;dh02.int.belong.com.au:8485;dh01.int.belong.com.au:8485/belongcluster1",
"dfs.namenode.stale.datanode.interval" : "30000",
"dfs.namenode.startup.delay.block.deletion.sec" : "3600",
"dfs.namenode.write.stale.datanode.ratio" : "1.0f",
"dfs.nameservices" : "belongcluster1",
"dfs.permissions.enabled" : "true",
"dfs.permissions.superusergroup" : "hdfs",
"dfs.replication" : "3",
"dfs.replication.max" : "50",
"dfs.support.append" : "true",
"dfs.webhdfs.enabled" : "true",
"fs.permissions.umask-mode" : "022",
"nfs.exports.allowed.hosts" : "* rw",
"nfs.file.dump.dir" : "/tmp/.hdfs-nfs"
}

Are you suggesting that i just keep 1 namenode service address and point it to primary name node host:port. Something like the below:

    <property>
      <name>dfs.namenode.rpc-address.belongcluster1</name>
      <value>dh01.int.belong.com.au:8020</value>
    </property>

avatar
Super Guru
@Suhel

Actually, don't delete anything. Your version of Ambari does not seem to be affected by this bug. Try the following:

sudo -u hdfs -b hdfs balancer -fs hdfs://belongcluster1:8020 -threshold 5

My guess is you were only missing port number. Can you please try it.