Support Questions

sysadmin · ‎07-17-2017

Hello,

We have 14 node cluster for hdp. Few servers have 4Tb diskspace and few servers have 2Tb diskspace. Using 14 nodes we got 44.8 TB diskspace for HDFS, In this Disk Usage (Non DFS Used)12.1 TB / 44.8 TB (27.04%). By this we are losing more amount of space with out keeping the data. I came to know that we can increase the DFS space to higher level by changing "Reserved space for HDFS" in Ambari config. Right now the value is "270415951872 bytes", what will be the value to get good amount of space. Is it necessary to keep 30% of space under Non DFS. Thanks in advance.

dsun · ‎07-17-2017

As you have heterogeneous worker nodes, I'd recommend setting up two separate host config groups first, then manage HDFS separately.

Here is the link to how to set up config groups in Ambari:

https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.1.0/bk_ambari-operations/content/using_host_con...

For each host group, you can config the non DSF use by setting the proper value for 'dfs.datanode.du.reserved' (in bytes per volume), normally it should be 20%- 25% of disk storage.

Also, keep in mind non DFS can grow into reserved DFS storage, you should regularly delete logs and other non HDFS data that are taking large local storage, I normally use commands like 'du -hsx * | sort -rh | head -10' to identify top 10 largest folders.

sysadmin · ‎07-18-2017

@dsun

Thanks for your suggestions, I tried to find the which directories are taking high amount of space, logs are not taking hig amount of space still Non DFS usage is very high in Tb's

Thanks

dsun · ‎07-18-2017

@sysadmin CreditVidya Assuming you are referring to 'Non DFS Used:' in the NameNode UI page, which is the total across the whole cluster, and could be in TB's depending the size of your total storage. ALSO, that number refers to 'How much configured DFS capacity are occupied by non dfs use', here is a good article around it:

https://stackoverflow.com/questions/18477983/what-exactly-non-dfs-used-means

Hope that helps.

sysadmin · ‎07-19-2017

@dsun I went through the above url already, came to know that we have configured 270G for dfs reserved space but it is taking Tb's on some of the servers, After analyzing the server we realized that most of the space is going to Map Reduce jobs. Is there any best tool/process to analyse to more.

dsun · ‎07-19-2017

@sysadmin CreditVidya There are several approaches I can think of might help:

1. It appears MR intermediate data is not being purged properly by Hadoop itself, you can manually delete files/folders configured in mapreduce.cluster.local.dir after MR jobs are completed, say files/folders older than 3 days. You can probably create a cron job for that purpose.

2. Make sure to implement cleanup() method in each mapper/reducer class, which will clean up local resources, and aggregates before the task exists.

3. Run hdfs balancer regularly, normally weekly or bi-weekly, that way you won't have too much more hdfs data stored on some nodes comparing to the others, as MR jobs always try to use the local copy of the data first, and always keep an eye on 'disk usage' for each host in Ambari.

Hope that helps.

dsun · ‎07-19-2017

Please don't forget to 'accept' the answer if it helped, thanks.

dsun · ‎07-19-2017

@sysadmin CreditVidya

There are several approaches I can think of might help:

1. It appears MR intermediate data is not being purged properly by Hadoop itself, you can manually delete files/folders configured in mapreduce.cluster.local.dir after MR jobs are completed, say files/folders older than 3 days. You can probably create a cron job for that purpose.

2. Make sure to implement cleanup() method in each mapper/reducer class, which will clean up local resources, and aggregates before the task exists.

3. Run hdfs balancer regularly, normally weekly or bi-weekly, that way you won't have too much more hdfs data stored on some nodes comparing to the others, as MR jobs always try to use the local copy of the data first, and always keep an eye on 'disk usage' for each host in Ambari.

Hope that helps.

Cloudera Community

Support Questions

How to increase DFS space on existing cluster

How to Increase HDP Sandbox Disk Space

java.io.FileNotFoundException: File file:/dfs/dn d...

🤗 Hugging Face Spaces AMPs Accelerate ML Projects

dfs storage(dfs.data.dir) space issue

How to increase space allocated to HDFS

how to increase hdfs disk space

How to increase ambari-agent disk space ??

Reduced Datalake cluster free space

Cleaning /dfs/dn sub-directories to free disk spac...

Adding IBM Spectrum Scale Service to HDP cluster u...