Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Contributor

An OutputCommitter that commits files specified in job output directory i.e. ${mapreduce.output.fileoutputformat.outputdir}. in mapred-site.xml

The file output committer algorithm version valid algorithm version number: 1 or 2 default to 1

The file output committer has three phases 1.Commit task Recover task Commit Job

If you choose version value to be

Version 1 :

1. Commit task will rename the directory from $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/_temporary/$appAttemptID/$taskID/ , to put it in a simple way it just reduces one subdirectory.

2. RecoverTask will also do a rename of $joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/_temporary/($appAttemptID + 1)/$taskID/

3. Commit Job will merge every task output in $joboutput/_temporary/$appAttemptID/$taskID/ to the path which is specified in the mapreduce.output.fileoutputformat.outputdir $joboutput/, then it will delete $joboutput/_temporary/ and write $joboutput/_SUCCESS

It has a performance regression, If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. the commit is single-threaded and waits until all tasks have completed before commencing.

Version 2 :

Algorithm version 2 will change the behavior of commitTask, recoverTask, and commitJob.

1. CommitTask will rename all files in $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/

2.RecoverTask actually doesn't require to do anything, but for upgrade from version 1 to version 2 case, it will check if there are any files in $joboutput/_temporary/($appAttemptID - 1)/$taskID/ and rename them to $joboutput/

3.CommitJob can simply delete $joboutput/_temporary and write $joboutput/_SUCCESS This algorithm will reduce the output commit time for large jobs by having the tasks commit directly to the final output directory as they were completing and commitJob had very little to do.

443 Views
0 Kudos