Created 06-25-2018 09:13 PM
When we run beeline query on big table, the /tmp space is going up to 1 TB and total usage is 3 TB due to replication factor.
can we minimize /tmp space usage?
Created 06-26-2018 04:32 AM
Hi @Alpesh Virani!
Usually, the /tmp grows cause of intermediate phases of the job, like you pointed, a big table is consuming a lot of space, and also when you have unfinished/failure jobs, it'll leave its data on /tmp.
You can try to change this by compressing the intermediate data through:
hive.exec.compress.intermediate=true;
tez.runtime.compress=true;
And also choosing a good codec compression like snappy, will help you decrease the tmp dir.
If you have a moment, take a look at these links:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_hive-performance-tuning/content/ch_hive-...
https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_command-line-installation/content/ref-ff...
They will show you some good tips like vectorization, mapjoin, CBO, ORC, bucket and so on.
Hope this helps! 🙂
Created 06-30-2018 03:31 AM
Thanks Vinicius Higa Murakami it is very helpful.
Created 06-30-2018 04:17 PM
Hi @Alpesh Virani!
Good to know!
Please, if your issue has been solved, I'd kindly ask you to accept this as an answer.
Doing this you'll help other users to find an answer as well!