Support Questions

Find answers, ask questions, and share your expertise

disk space issue on local disk.. due to buffering of s3 data

avatar
New Contributor

I am running EC2 cluster with s3 . Here when I run any hive query or some hadoop command that operates on very big data, it copies tmp files on the local disk on the nodes before/after copying them to/from s3. I know it can be configured with 'fs.s3.buffer.dir' property. Ideally it should delete and it does, but in some cases it does not delete those files, resulting in accumulation of a lot of .tmp files(in GBs) on all the nodes.. resulting in space issues.

 

Is there anyway that we can avoid the .tmp files creation?

Or somehow if we can identify why in some cases it does not delete those .tmp files and correct it?

 

Please suggest what can be the best solution in this case.

 

 

1 ACCEPTED SOLUTION

avatar
Mentor
If the JVM that's buffering in the local dir were to die of a SIGKILL or such forms of immediate interruption, then the cleanup procedures aren't taken care of.

When running in MR mode, try setting the buffer directory to ./tmp (relative) such that it creates the files under the task's working directories and these can be deleted automatically when the TaskTracker/NodeManager cleans up the tasks' environment after its kill.

Also, have you tried to use S3A (s3a://) instead? It may function better than the older S3 FS, and does not utilise a buffer directory. S3A is included in CDH5 for a while now.

View solution in original post

2 REPLIES 2

avatar
Mentor
If the JVM that's buffering in the local dir were to die of a SIGKILL or such forms of immediate interruption, then the cleanup procedures aren't taken care of.

When running in MR mode, try setting the buffer directory to ./tmp (relative) such that it creates the files under the task's working directories and these can be deleted automatically when the TaskTracker/NodeManager cleans up the tasks' environment after its kill.

Also, have you tried to use S3A (s3a://) instead? It may function better than the older S3 FS, and does not utilise a buffer directory. S3A is included in CDH5 for a while now.

avatar
New Contributor

Thanks For such an informative reply. I have already implemented s3a:// and yes only is the solution. 

The other one,i.e. changing to /tmp dir is an intelligent workaround.