- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to increase DFS space on existing cluster
- Labels:
-
Apache Ambari
-
Apache Hadoop
Created ‎07-17-2017 10:50 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
We have 14 node cluster for hdp. Few servers have 4Tb diskspace and few servers have 2Tb diskspace. Using 14 nodes we got 44.8 TB diskspace for HDFS, In this Disk Usage (Non DFS Used)12.1 TB / 44.8 TB (27.04%). By this we are losing more amount of space with out keeping the data. I came to know that we can increase the DFS space to higher level by changing "Reserved space for HDFS" in Ambari config. Right now the value is "270415951872 bytes", what will be the value to get good amount of space. Is it necessary to keep 30% of space under Non DFS. Thanks in advance.
Created ‎07-17-2017 05:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As you have heterogeneous worker nodes, I'd recommend setting up two separate host config groups first, then manage HDFS separately.
Here is the link to how to set up config groups in Ambari:
For each host group, you can config the non DSF use by setting the proper value for 'dfs.datanode.du.reserved' (in bytes per volume), normally it should be 20%- 25% of disk storage.
Also, keep in mind non DFS can grow into reserved DFS storage, you should regularly delete logs and other non HDFS data that are taking large local storage, I normally use commands like 'du -hsx * | sort -rh | head -10' to identify top 10 largest folders.
Created ‎07-18-2017 12:53 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your suggestions, I tried to find the which directories are taking high amount of space, logs are not taking hig amount of space still Non DFS usage is very high in Tb's
Thanks
Created ‎07-18-2017 01:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@sysadmin CreditVidya Assuming you are referring to 'Non DFS Used:' in the NameNode UI page, which is the total across the whole cluster, and could be in TB's depending the size of your total storage. ALSO, that number refers to 'How much configured DFS capacity are occupied by non dfs use', here is a good article around it:
https://stackoverflow.com/questions/18477983/what-exactly-non-dfs-used-means
Hope that helps.
Created ‎07-19-2017 07:12 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@dsun I went through the above url already, came to know that we have configured 270G for dfs reserved space but it is taking Tb's on some of the servers, After analyzing the server we realized that most of the space is going to Map Reduce jobs. Is there any best tool/process to analyse to more.
Created ‎07-19-2017 03:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@sysadmin CreditVidya There are several approaches I can think of might help:
1. It appears MR intermediate data is not being purged properly by Hadoop itself, you can manually delete files/folders configured in mapreduce.cluster.local.dir after MR jobs are completed, say files/folders older than 3 days. You can probably create a cron job for that purpose.
2. Make sure to implement cleanup() method in each mapper/reducer class, which will clean up local resources, and aggregates before the task exists.
3. Run hdfs balancer regularly, normally weekly or bi-weekly, that way you won't have too much more hdfs data stored on some nodes comparing to the others, as MR jobs always try to use the local copy of the data first, and always keep an eye on 'disk usage' for each host in Ambari.
Hope that helps.
Created ‎07-19-2017 11:49 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please don't forget to 'accept' the answer if it helped, thanks.
Created ‎07-19-2017 02:40 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are several approaches I can think of might help:
1. It appears MR intermediate data is not being purged properly by Hadoop itself, you can manually delete files/folders configured in mapreduce.cluster.local.dir after MR jobs are completed, say files/folders older than 3 days. You can probably create a cron job for that purpose.
2. Make sure to implement cleanup() method in each mapper/reducer class, which will clean up local resources, and aggregates before the task exists.
3. Run hdfs balancer regularly, normally weekly or bi-weekly, that way you won't have too much more hdfs data stored on some nodes comparing to the others, as MR jobs always try to use the local copy of the data first, and always keep an eye on 'disk usage' for each host in Ambari.
Hope that helps.
