Support Questions

LovekeshBansal · ‎11-23-2015

I am running EC2 cluster with s3 . Here when I run any hive query or some hadoop command that operates on very big data, it copies tmp files on the local disk on the nodes before/after copying them to/from s3. I know it can be configured with 'fs.s3.buffer.dir' property. Ideally it should delete and it does, but in some cases it does not delete those files, resulting in accumulation of a lot of .tmp files(in GBs) on all the nodes.. resulting in space issues.

Is there anyway that we can avoid the .tmp files creation?

Or somehow if we can identify why in some cases it does not delete those .tmp files and correct it?

Please suggest what can be the best solution in this case.

Harsh J · ‎12-06-2015

If the JVM that's buffering in the local dir were to die of a SIGKILL or such forms of immediate interruption, then the cleanup procedures aren't taken care of.

When running in MR mode, try setting the buffer directory to ./tmp (relative) such that it creates the files under the task's working directories and these can be deleted automatically when the TaskTracker/NodeManager cleans up the tasks' environment after its kill.

Also, have you tried to use S3A (s3a://) instead? It may function better than the older S3 FS, and does not utilise a buffer directory. S3A is included in CDH5 for a while now.

View solution in original post

Harsh J · ‎12-06-2015

If the JVM that's buffering in the local dir were to die of a SIGKILL or such forms of immediate interruption, then the cleanup procedures aren't taken care of.

When running in MR mode, try setting the buffer directory to ./tmp (relative) such that it creates the files under the task's working directories and these can be deleted automatically when the TaskTracker/NodeManager cleans up the tasks' environment after its kill.

Also, have you tried to use S3A (s3a://) instead? It may function better than the older S3 FS, and does not utilise a buffer directory. S3A is included in CDH5 for a while now.

LovekeshBansal · ‎12-07-2015

Thanks For such an informative reply. I have already implemented s3a:// and yes only is the solution.

The other one,i.e. changing to /tmp dir is an intelligent workaround.

Cloudera Community

Support Questions

disk space issue on local disk.. due to buffering of s3 data

How to Increase HDP Sandbox Disk Space

Decommission and Reconfigure Data Node Disks

disk space issue on nodes for distcp data transfer...

HDFS Balancer: Balancing Data Between Disks on a D...

CDSW Session Couldn't Start Due To Node Taints - n...

hdfs-audit.log disk space consumption

Logical Disk Encryption - Data at Rest Encryption

Is it best to increase the disk or datanode to add...

Knox gateway.out file is filling up disk space

Can't balance datas between disks on a datanode