Reply
YC
New Contributor
Posts: 1
Registered: ‎04-24-2014

Mapreduce MultiFileOutput behavior : append vs overwrite

I have a mapreduce program that uses my MultiFileOutput to output multiple hdfs files based on the data key value by the reducers.

 

        static class MultiFileOutput extends MultipleTextOutputFormat<Text, Text> {

                protected String generateFileNameForKeyValue(Text key, Text value, String name) {

                        return key.toString();

                }

        }

 

When I ran this in Apache Hadoop 1.0.1, the reducers all append their output to different output files based on the value of the key. However, when I ran this in Cloudera 4.6.0 (both mapreduce1 and YARN), the reducers overwrite each other instead of appending their output for the same output file name. Is this the expected behavior in Cloudera? Why is it different than Apache Hadoop? Any quick fix for this issue?

 

Thanks!

 

Yongcheng

Posts: 1,836
Kudos: 416
Solutions: 295
Registered: ‎07-31-2013

Re: Mapreduce MultiFileOutput behavior : append vs overwrite

Do you perhaps have a small test case illustrating this? The reducers should not be overwriting each other's files cause of the leaf name still carrying the reducer's ID.

Additionally, are you using the MR2 libraries or MR1 for the APIs and cluster?
Announcements