Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to increase DFS space on existing cluster

avatar
Contributor

Hello,

We have 14 node cluster for hdp. Few servers have 4Tb diskspace and few servers have 2Tb diskspace. Using 14 nodes we got 44.8 TB diskspace for HDFS, In this Disk Usage (Non DFS Used)12.1 TB / 44.8 TB (27.04%). By this we are losing more amount of space with out keeping the data. I came to know that we can increase the DFS space to higher level by changing "Reserved space for HDFS" in Ambari config. Right now the value is "270415951872 bytes", what will be the value to get good amount of space. Is it necessary to keep 30% of space under Non DFS. Thanks in advance.

7 REPLIES 7

avatar
Expert Contributor

As you have heterogeneous worker nodes, I'd recommend setting up two separate host config groups first, then manage HDFS separately.

Here is the link to how to set up config groups in Ambari:

https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.1.0/bk_ambari-operations/content/using_host_con...

For each host group, you can config the non DSF use by setting the proper value for 'dfs.datanode.du.reserved' (in bytes per volume), normally it should be 20%- 25% of disk storage.

Also, keep in mind non DFS can grow into reserved DFS storage, you should regularly delete logs and other non HDFS data that are taking large local storage, I normally use commands like 'du -hsx * | sort -rh | head -10' to identify top 10 largest folders.

avatar
Contributor
@dsun

Thanks for your suggestions, I tried to find the which directories are taking high amount of space, logs are not taking hig amount of space still Non DFS usage is very high in Tb's

Thanks

avatar
Expert Contributor

@sysadmin CreditVidya Assuming you are referring to 'Non DFS Used:' in the NameNode UI page, which is the total across the whole cluster, and could be in TB's depending the size of your total storage. ALSO, that number refers to 'How much configured DFS capacity are occupied by non dfs use', here is a good article around it:

https://stackoverflow.com/questions/18477983/what-exactly-non-dfs-used-means

Hope that helps.

avatar
Contributor

@dsun I went through the above url already, came to know that we have configured 270G for dfs reserved space but it is taking Tb's on some of the servers, After analyzing the server we realized that most of the space is going to Map Reduce jobs. Is there any best tool/process to analyse to more.

avatar
Expert Contributor

@sysadmin CreditVidya There are several approaches I can think of might help:

1. It appears MR intermediate data is not being purged properly by Hadoop itself, you can manually delete files/folders configured in mapreduce.cluster.local.dir after MR jobs are completed, say files/folders older than 3 days. You can probably create a cron job for that purpose.

2. Make sure to implement cleanup() method in each mapper/reducer class, which will clean up local resources, and aggregates before the task exists.

3. Run hdfs balancer regularly, normally weekly or bi-weekly, that way you won't have too much more hdfs data stored on some nodes comparing to the others, as MR jobs always try to use the local copy of the data first, and always keep an eye on 'disk usage' for each host in Ambari.

Hope that helps.

avatar
Expert Contributor

Please don't forget to 'accept' the answer if it helped, thanks.

avatar
Expert Contributor
@sysadmin CreditVidya

There are several approaches I can think of might help:

1. It appears MR intermediate data is not being purged properly by Hadoop itself, you can manually delete files/folders configured in mapreduce.cluster.local.dir after MR jobs are completed, say files/folders older than 3 days. You can probably create a cron job for that purpose.

2. Make sure to implement cleanup() method in each mapper/reducer class, which will clean up local resources, and aggregates before the task exists.

3. Run hdfs balancer regularly, normally weekly or bi-weekly, that way you won't have too much more hdfs data stored on some nodes comparing to the others, as MR jobs always try to use the local copy of the data first, and always keep an eye on 'disk usage' for each host in Ambari.

Hope that helps.