Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How do I delete MultipleOutput files from a map task when the task performs a retry?

avatar
Contributor

I'm using a map-only hadoop task to transfer files from S3 into a local cluster. Along the way, I split the lines into their own directories based on record type using MultipleOutputs. When a map task dies due to S3 connection issues it leaves its MultipleOutput directories, making retries impossible.

 

Is there a way to avoid this? Can I ask a Map what file a named MultipleOutput will write to and delete them in the setup call?

1 ACCEPTED SOLUTION

avatar
Contributor

This turned out to be an issue with speculative execution. (e.g. conf.set("mapred.map.tasks.speculative.execution", "false"); ) It was causing the job to create a new task before the previous task had cleaned up after itself. It turns out that MultipleOutputs doesn't handle speculative execution very well.

View solution in original post

1 REPLY 1

avatar
Contributor

This turned out to be an issue with speculative execution. (e.g. conf.set("mapred.map.tasks.speculative.execution", "false"); ) It was causing the job to create a new task before the previous task had cleaned up after itself. It turns out that MultipleOutputs doesn't handle speculative execution very well.