Support Questions

rushikeshdeshmu · ‎02-19-2016

Hi,

Can anyone suggest me any possible way for "seeing output of a MR job as a single file even if reducer has created multiple part-r-0000* files?"

pminovic · ‎02-19-2016

@Rushikesh Deshmukh, -getmerge will download all parts from HDFS to your local machine and merge them in a local-destination there. If you have 8 parts, each say 128M each, you will end up downloading 1G of data. Though it makes sense for small files.

However, if you want to keep the resulting file on HDFS one way to do it is to create a MR job with unit mappers and a single unit reducer. For example:

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -Dmapred.reduce.tasks=1 -mapper cat -reducer cat -input 'inputdir/part-r-*' -output outputdir

If you keep the output on HDFS, another question is "Seeing by whom as a single file?". Above command will create a single output file, but if you run another MR job using that file as input, the MR framework will by default "see" it as 8 files, actually 8 HDFS blocks (assuming block size of 128M), and will process it using 8 mappers.

View solution in original post

rushikeshdeshmu · ‎02-19-2016

I got below answer:

hdfs dfs -getmerge <src> <localdst> [addnl]

-

Above command worked for me.

aervits · ‎02-19-2016

@Rushikesh Deshmukh what command worked for you? Did @Predrag Minovic question help you? If so please accept his answer and not your own as we do not know what solved it for you. You do not earn points for accepting own answers and its not a good practice unless you find your own solution.

rushikeshdeshmu · ‎02-19-2016

@Artem Ervits, the above mentioned command worked in my case to see output as single file while reducer output had created multiple parts file, thus I have accepted that answer.

pminovic · ‎02-19-2016

@Rushikesh Deshmukh, -getmerge will download all parts from HDFS to your local machine and merge them in a local-destination there. If you have 8 parts, each say 128M each, you will end up downloading 1G of data. Though it makes sense for small files.

However, if you want to keep the resulting file on HDFS one way to do it is to create a MR job with unit mappers and a single unit reducer. For example:

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -Dmapred.reduce.tasks=1 -mapper cat -reducer cat -input 'inputdir/part-r-*' -output outputdir

If you keep the output on HDFS, another question is "Seeing by whom as a single file?". Above command will create a single output file, but if you run another MR job using that file as input, the MR framework will by default "see" it as 8 files, actually 8 HDFS blocks (assuming block size of 128M), and will process it using 8 mappers.

Cloudera Community

Support Questions

Seeing output of a MR job as a single file even if reducer has created multiple part-r-0000* files?

MergeRecord generates multiple files

How to convert/merge Many flow files to single flo...

Append multiple tweets into single file using NIFI...

Working with CDE Files Resources

Single records of a file split into multiple?

How to reduce HDFS output file size?

How to MergeContent with single header from multip...

Spark to parse Weblogs text files and write output...

Create custom format from the csv file content usi...

How to access Ozone file system using Java API