- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
disk space issue on local disk.. due to buffering of s3 data
- Labels:
-
HDFS
Created ‎11-23-2015 09:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am running EC2 cluster with s3 . Here when I run any hive query or some hadoop command that operates on very big data, it copies tmp files on the local disk on the nodes before/after copying them to/from s3. I know it can be configured with 'fs.s3.buffer.dir' property. Ideally it should delete and it does, but in some cases it does not delete those files, resulting in accumulation of a lot of .tmp files(in GBs) on all the nodes.. resulting in space issues.
Is there anyway that we can avoid the .tmp files creation?
Or somehow if we can identify why in some cases it does not delete those .tmp files and correct it?
Please suggest what can be the best solution in this case.
Created ‎12-06-2015 10:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When running in MR mode, try setting the buffer directory to ./tmp (relative) such that it creates the files under the task's working directories and these can be deleted automatically when the TaskTracker/NodeManager cleans up the tasks' environment after its kill.
Also, have you tried to use S3A (s3a://) instead? It may function better than the older S3 FS, and does not utilise a buffer directory. S3A is included in CDH5 for a while now.
Created ‎12-06-2015 10:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When running in MR mode, try setting the buffer directory to ./tmp (relative) such that it creates the files under the task's working directories and these can be deleted automatically when the TaskTracker/NodeManager cleans up the tasks' environment after its kill.
Also, have you tried to use S3A (s3a://) instead? It may function better than the older S3 FS, and does not utilise a buffer directory. S3A is included in CDH5 for a while now.
Created ‎12-07-2015 09:19 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks For such an informative reply. I have already implemented s3a:// and yes only is the solution.
The other one,i.e. changing to /tmp dir is an intelligent workaround.
