- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How do I delete MultipleOutput files from a map task when the task performs a retry?
- Labels:
-
Apache Hadoop
Created on 06-02-2014 12:53 PM - edited 09-16-2022 01:59 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm using a map-only hadoop task to transfer files from S3 into a local cluster. Along the way, I split the lines into their own directories based on record type using MultipleOutputs. When a map task dies due to S3 connection issues it leaves its MultipleOutput directories, making retries impossible.
Is there a way to avoid this? Can I ask a Map what file a named MultipleOutput will write to and delete them in the setup call?
Created on 06-19-2014 12:59 PM - edited 06-19-2014 01:01 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This turned out to be an issue with speculative execution. (e.g. conf.set("mapred.map.tasks.speculative.execution", "false"); ) It was causing the job to create a new task before the previous task had cleaned up after itself. It turns out that MultipleOutputs doesn't handle speculative execution very well.
Created on 06-19-2014 12:59 PM - edited 06-19-2014 01:01 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This turned out to be an issue with speculative execution. (e.g. conf.set("mapred.map.tasks.speculative.execution", "false"); ) It was causing the job to create a new task before the previous task had cleaned up after itself. It turns out that MultipleOutputs doesn't handle speculative execution very well.