Community Articles
Find and share helpful community-sourced technical articles.
Announcements
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428
Labels (1)
Cloudera Employee

An OutputCommitter that commits files specified in job output directory i.e. ${mapreduce.output.fileoutputformat.outputdir}. in mapred-site.xml

The file output committer algorithm version valid algorithm version number: 1 or 2 default to 1

The file output committer has three phases 1.Commit task Recover task Commit Job

If you choose version value to be

Version 1 :

1. Commit task will rename the directory from $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/_temporary/$appAttemptID/$taskID/ , to put it in a simple way it just reduces one subdirectory.

2. RecoverTask will also do a rename of $joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/_temporary/($appAttemptID + 1)/$taskID/

3. Commit Job will merge every task output in $joboutput/_temporary/$appAttemptID/$taskID/ to the path which is specified in the mapreduce.output.fileoutputformat.outputdir $joboutput/, then it will delete $joboutput/_temporary/ and write $joboutput/_SUCCESS

It has a performance regression, If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. the commit is single-threaded and waits until all tasks have completed before commencing.

Version 2 :

Algorithm version 2 will change the behavior of commitTask, recoverTask, and commitJob.

1. CommitTask will rename all files in $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/

2.RecoverTask actually doesn't require to do anything, but for upgrade from version 1 to version 2 case, it will check if there are any files in $joboutput/_temporary/($appAttemptID - 1)/$taskID/ and rename them to $joboutput/

3.CommitJob can simply delete $joboutput/_temporary and write $joboutput/_SUCCESS This algorithm will reduce the output commit time for large jobs by having the tasks commit directly to the final output directory as they were completing and commitJob had very little to do.

242 Views
0 Kudos
Don't have an account?
Version history
Last update:
‎02-25-2020 10:38 AM
Updated by:
Top Kudoed Authors