Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Cloudera Employee

An OutputCommitter that commits files specified in job output directory i.e. ${mapreduce.output.fileoutputformat.outputdir}. in mapred-site.xml

The file output committer algorithm version valid algorithm version number: 1 or 2 default to 1

The file output committer has three phases 1.Commit task Recover task Commit Job

If you choose version value to be

Version 1 :

1. Commit task will rename the directory from $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/_temporary/$appAttemptID/$taskID/ , to put it in a simple way it just reduces one subdirectory.

2. RecoverTask will also do a rename of $joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/_temporary/($appAttemptID + 1)/$taskID/

3. Commit Job will merge every task output in $joboutput/_temporary/$appAttemptID/$taskID/ to the path which is specified in the mapreduce.output.fileoutputformat.outputdir $joboutput/, then it will delete $joboutput/_temporary/ and write $joboutput/_SUCCESS

It has a performance regression, If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. the commit is single-threaded and waits until all tasks have completed before commencing.

Version 2 :

Algorithm version 2 will change the behavior of commitTask, recoverTask, and commitJob.

1. CommitTask will rename all files in $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/

2.RecoverTask actually doesn't require to do anything, but for upgrade from version 1 to version 2 case, it will check if there are any files in $joboutput/_temporary/($appAttemptID - 1)/$taskID/ and rename them to $joboutput/

3.CommitJob can simply delete $joboutput/_temporary and write $joboutput/_SUCCESS This algorithm will reduce the output commit time for large jobs by having the tasks commit directly to the final output directory as they were completing and commitJob had very little to do.

Related JIRAs : https://hortonworks.jira.com/browse/BUG-59560 https://hortonworks.jira.com/browse/BUG-57410

18 Views
0 Kudos
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎03-23-2017 07:24 PM
Updated by:
 
Contributors
Top Kudoed Authors