Support Questions

Find answers, ask questions, and share your expertise

HDFS Disk Usage and datanode storage thresholds


HDP 2.5.3 Ambari 2.4.2

18 data nodes 190TB

HDFS disk usage is at about 92% ~15TB free with critical alarms or warnngs on most all the data nodes

Percent DataNodes With Available Space is alarming as well

Are the best practice recommendations for setting these thresholds, for managing the percent of HDFS disk usage? Are there concerns for running HDFS disk usage above a certain percentage?



You are living dangerously when you get to 80% disk usage. This is because batch jobs write intermediate data to local non-HDFS disk (map-reduce writes a lot of data to local disk, tez less so) and that temp data can approach or exceed 20% of available disk (depends of course on the jobs you are running). Also, if you are on physical servers (vs cloud) you need the lead time to provision, rack, stack etc to scale out and add new data nodes, and you likely will continue to ingest new data during this lead time.

It is a good practice to set it at 70% and have a plan in place when it reaches that. (If you are ingesting large volumes on a scheduled basis, you may want to go lower).

Another good practice is to compress data that you rarely process, using non-splittable codecs (you can decompress on the rare times you need the data) and possible other data that is still processed using splittable codecs. Automating compression is desirable. Compression is a bit of an involved topic. This is a useful first reference:

I would compress or delete data in the cluster you are referencing, and add more data nodes ASAP.