Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Now Live: Explore expert insights and technical deep dives on the new Cloudera Community BlogsRead the Announcement
Labels (1)
avatar
Contributor

An OutputCommitter that commits files specified in job output directory i.e. ${mapreduce.output.fileoutputformat.outputdir}. in mapred-site.xml

The file output committer algorithm version valid algorithm version number: 1 or 2 default to 1

The file output committer has three phases 1.Commit task Recover task Commit Job

If you choose version value to be

Version 1 :

1. Commit task will rename the directory from $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/_temporary/$appAttemptID/$taskID/ , to put it in a simple way it just reduces one subdirectory.

2. RecoverTask will also do a rename of $joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/_temporary/($appAttemptID + 1)/$taskID/

3. Commit Job will merge every task output in $joboutput/_temporary/$appAttemptID/$taskID/ to the path which is specified in the mapreduce.output.fileoutputformat.outputdir $joboutput/, then it will delete $joboutput/_temporary/ and write $joboutput/_SUCCESS

It has a performance regression, If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. the commit is single-threaded and waits until all tasks have completed before commencing.

Version 2 :

Algorithm version 2 will change the behavior of commitTask, recoverTask, and commitJob.

1. CommitTask will rename all files in $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/

2.RecoverTask actually doesn't require to do anything, but for upgrade from version 1 to version 2 case, it will check if there are any files in $joboutput/_temporary/($appAttemptID - 1)/$taskID/ and rename them to $joboutput/

3.CommitJob can simply delete $joboutput/_temporary and write $joboutput/_SUCCESS This algorithm will reduce the output commit time for large jobs by having the tasks commit directly to the final output directory as they were completing and commitJob had very little to do.

1,116 Views
0 Kudos
Version history
Last update:
‎02-25-2020 10:38 AM
Updated by: