Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

how to suppress mapper output files if the output file does not have any data?

avatar
Rising Star

Apologies if i haven’t put the question properly.

 

I have a combined file format, which returns file name as key and filecontent as value.

I customized Mapper class’s run method and runs the map method if the file meets specific conditions only.

 

lets say, it calls map method if the file content is greater than 200 kb .

 

If 200 files are sent as input, 200 mappers will commence, and if only 100 files met the criteria and ran map method, we will still have 200 output files in output folder.

Is there a way, to make sure to ensure no output file should be there if the file does not have any data.? or other way around, to create files only if the data is there for files?

1 ACCEPTED SOLUTION

avatar
Mentor
>From Hadoop: The Definite Guide (Tom White):

"""
About LazyOutputFormat
-----------------------
A typical mapreduce program can produce output files that are empty,
depending on your implemetation.
If you want to suppress creation of empty files, you need to leverage
LazyOutputFormat.
Two lines in your driver will do the trick-
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
&
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
"""


View solution in original post

5 REPLIES 5

avatar
Mentor
>From Hadoop: The Definite Guide (Tom White):

"""
About LazyOutputFormat
-----------------------
A typical mapreduce program can produce output files that are empty,
depending on your implemetation.
If you want to suppress creation of empty files, you need to leverage
LazyOutputFormat.
Two lines in your driver will do the trick-
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
&
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
"""


avatar
Champion Alumni

Hello  Harsh,

 

I tried the same  for  AvroMultipleOut files and  this still generates   empty avro files.Should something in addition be done  when we are using Avro MultipleOutputs?I am using avro 1.7.7 and CDH 5.4.Please let me know if you have faced  this issue.

 

Thanks,

Nishanth

 

 

avatar
Champion Alumni
The issue in my case was I was not closing the avromultipleoutputs instance in the mapper.Combination of lazyoutputformat and closing the avromultipleoutputs instance in the mapper fixed the issue for me.

avatar
New Contributor

Hello Harsh,

 

Can you please suggest the solution also for Old Mapred API code since my code generates the empty part-xxxx files if the mapper conditions are not met and because of which the reducer throws exceptions when it reaches 80%.. So need to suppress writing the empty part-xxxx files in mapper stage itself. your inputs would be highly helpful. Thanks in advance!

 

BR//Hareeharan

avatar
Mentor
LazyOutputFormat is available for both APIs. Here's the one for the older API: http://archive.cloudera.com/cdh5/cdh/5/hadoop/api/org/apache/hadoop/mapred/lib/LazyOutputFormat.html