Support Questions
Find answers, ask questions, and share your expertise

how to suppress mapper output files if the output file does not have any data?

Solved Go to solution
Highlighted

how to suppress mapper output files if the output file does not have any data?

Explorer

Apologies if i haven’t put the question properly.

 

I have a combined file format, which returns file name as key and filecontent as value.

I customized Mapper class’s run method and runs the map method if the file meets specific conditions only.

 

lets say, it calls map method if the file content is greater than 200 kb .

 

If 200 files are sent as input, 200 mappers will commence, and if only 100 files met the criteria and ran map method, we will still have 200 output files in output folder.

Is there a way, to make sure to ensure no output file should be there if the file does not have any data.? or other way around, to create files only if the data is there for files?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: how to suppress mapper output files if the output file does not have any data?

Master Guru
>From Hadoop: The Definite Guide (Tom White):

"""
About LazyOutputFormat
-----------------------
A typical mapreduce program can produce output files that are empty,
depending on your implemetation.
If you want to suppress creation of empty files, you need to leverage
LazyOutputFormat.
Two lines in your driver will do the trick-
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
&
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
"""


View solution in original post

5 REPLIES 5
Highlighted

Re: how to suppress mapper output files if the output file does not have any data?

Master Guru
>From Hadoop: The Definite Guide (Tom White):

"""
About LazyOutputFormat
-----------------------
A typical mapreduce program can produce output files that are empty,
depending on your implemetation.
If you want to suppress creation of empty files, you need to leverage
LazyOutputFormat.
Two lines in your driver will do the trick-
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
&
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
"""


View solution in original post

Re: how to suppress mapper output files if the output file does not have any data?

Champion Alumni

Hello  Harsh,

 

I tried the same  for  AvroMultipleOut files and  this still generates   empty avro files.Should something in addition be done  when we are using Avro MultipleOutputs?I am using avro 1.7.7 and CDH 5.4.Please let me know if you have faced  this issue.

 

Thanks,

Nishanth

 

 

Highlighted

Re: how to suppress mapper output files if the output file does not have any data?

Champion Alumni
The issue in my case was I was not closing the avromultipleoutputs instance in the mapper.Combination of lazyoutputformat and closing the avromultipleoutputs instance in the mapper fixed the issue for me.
Highlighted

Re: how to suppress mapper output files if the output file does not have any data?

New Contributor

Hello Harsh,

 

Can you please suggest the solution also for Old Mapred API code since my code generates the empty part-xxxx files if the mapper conditions are not met and because of which the reducer throws exceptions when it reaches 80%.. So need to suppress writing the empty part-xxxx files in mapper stage itself. your inputs would be highly helpful. Thanks in advance!

 

BR//Hareeharan

Highlighted

Re: how to suppress mapper output files if the output file does not have any data?

Master Guru
LazyOutputFormat is available for both APIs. Here's the one for the older API: http://archive.cloudera.com/cdh5/cdh/5/hadoop/api/org/apache/hadoop/mapred/lib/LazyOutputFormat.html