I have a HDP Cluster whose 60% space is consumed. So, when I am running hive queries from Zeppelin , the space consumed on hdfs increases to 89% resulting in continues warning from Ambari. I found that the space in "/tmp/hive/hive" gradually increases . I had to kill the yarn job to stop the query . On killing the extra HDFS space consumed is released, reverting back to 60% hdfs used.
As per my understanding "/tmp/hive/hive" directory is being used to store the temp data while executing MapReduce.
So my question is how much storage be left on HDFS , so that my Cluster does not gets exhausted while running queries?
Please correct me on the below statement :-
(Total HDFS Size = Size of Data Stored + Buffered Storage to execute queries)
To add more details for the above question :-
Replication Factor =3
Data stored Format = Parquet
It's generally considered a best practice to add at least 20% of your data size as additional storage to handle temporary working files, views, scratch space, etc. It's also a good idea to account for 20% additional head room above that.
If your raw data is 100GB, then you add about 20GB for working space and another 20GB for extra head room. You should target an HDFS size of 140GB before data data is loaded. While 3x replication is default, that is typically mitigated by HDFS compression which provides 2x-3x compression for most scenarios.
Average daily ingest rate :- 1 TB Replication factor 3 : Daily raw consumption :- 3 TB Ingest × replication
MapReduce temp space reserve 25% For intermediate MapReduce data