Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Uneven DFS data storage across cluster.

avatar
Expert Contributor

I have a four node cluster (HDP 2.4). On the Ambari hosts page, I can see that the space consumption for one of the nodes is very high. First of all I do not understand the cause of this. I would like to understand how easy is it to evenly distribute the data across all the nodes so that all nodes consumes equal amount of dfs data.

1 ACCEPTED SOLUTION

avatar
Super Guru

@Pradeep kumar

1. You can go to the data directory on that datanode and do $du -sh * to check how much size it has.

It might be the case you have non dfs data present on that node.

2. You can evenly distribute data across datanodes using Balancer as shown below.

4044-screen-shot-2016-05-05-at-30231-pm.png

View solution in original post

6 REPLIES 6

avatar
Super Guru

@Pradeep kumar

1. You can go to the data directory on that datanode and do $du -sh * to check how much size it has.

It might be the case you have non dfs data present on that node.

2. You can evenly distribute data across datanodes using Balancer as shown below.

4044-screen-shot-2016-05-05-at-30231-pm.png

avatar
Expert Contributor

@Sagar ShimpiThanks for pointing me to "Rebalance HDFS" utility. After I clicked on Rebalance HDFS, the progress bar quickly ended saying success. Shouldn't this be a long procedure, with lots of data being sent from one node to another to balance?. How do I know when the process will finish, if it has not ended, because after click that link, I do not see any change immediately.

avatar
Super Guru

@Pradeep kumar

  1. Browse to namenode UI http://:50070/>:50070/
  2. On the top panel click on "Datanodes"
  3. Check for "Used" and "Non DFS Used" "Used" - This is HDFS used space "Non DFS Used" - This is data stored on local filesystem within the datanode.data.dir directory(hdfs dir path of datanode)

Please check above value. It seems your HDFS data was too less and hence balancer took less time to completed. Please do let me know the ambari and hdp version you are using.

avatar
Expert Contributor

@Sagar Shimpi I have checked the NameNode UI. I observe that the "Non-DFS Used" is showing 77.15 GB and "used" showing just 1.25 GB. 77.15 GB is very high as compared to other three nodes. My question is what to do next? how do I free up more space on this node?. As for the versions, HDP is version 2.4 and Ambari is version 2.2.1.1.

avatar
Expert Contributor

I found out what was occupying space in Non dfs space. It was the log files under the folder /var/log/hive. It had around 67 GB of log file!!!. I removed the file and now the space has been reclaimed. Thanks for your help. (I used the command du -kscx * to know the size of each folder. I executed this command in the log folder.)

avatar
Contributor

There are a number of things that cause HDFS imbalance. This post explains some of those causes in more detail. The balancer should be run regularly in a production system (you can kick it off from the command line, so you can schedule it using cron, for example). The balancer can take a while to complete if there are a lot of blocks to move.

Note that, when HDFS moves a block, the old block gets "marked for deletion" but doesn't get deleted immediately. HDFS deals with these un-used blocks over time.