I am running EC2 cluster with s3 . Here when I run any hive query or some hadoop command that operates on very big data, it copies tmp files on the local disk on the nodes before/after copying them to/from s3. I know it can be configured with 'fs.s3.buffer.dir' property. Ideally it should delete and it does, but in some cases it does not delete those files, resulting in accumulation of a lot of .tmp files(in GBs) on all the nodes.. resulting in space issues.
Is there anyway that we can avoid the .tmp files creation?
Or somehow if we can identify why in some cases it does not delete those .tmp files and correct it?
Please suggest what can be the best solution in this case.
Thanks For such an informative reply. I have already implemented s3a:// and yes only is the solution.
The other one,i.e. changing to /tmp dir is an intelligent workaround.