Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Seeing output of a MR job as a single file even if reducer has created multiple part-r-0000* files?

avatar

Hi,

Can anyone suggest me any possible way for "seeing output of a MR job as a single file even if reducer has created multiple part-r-0000* files?"

1 ACCEPTED SOLUTION

avatar
Master Guru

@Rushikesh Deshmukh, -getmerge will download all parts from HDFS to your local machine and merge them in a local-destination there. If you have 8 parts, each say 128M each, you will end up downloading 1G of data. Though it makes sense for small files.

However, if you want to keep the resulting file on HDFS one way to do it is to create a MR job with unit mappers and a single unit reducer. For example:

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -Dmapred.reduce.tasks=1 -mapper cat -reducer cat -input 'inputdir/part-r-*' -output outputdir

If you keep the output on HDFS, another question is "Seeing by whom as a single file?". Above command will create a single output file, but if you run another MR job using that file as input, the MR framework will by default "see" it as 8 files, actually 8 HDFS blocks (assuming block size of 128M), and will process it using 8 mappers.

View solution in original post

4 REPLIES 4

avatar

I got below answer:

hdfs dfs -getmerge <src> <localdst> [addnl]

-

Above command worked for me.

avatar
Master Mentor

@Rushikesh Deshmukh what command worked for you? Did @Predrag Minovic question help you? If so please accept his answer and not your own as we do not know what solved it for you. You do not earn points for accepting own answers and its not a good practice unless you find your own solution.

avatar

@Artem Ervits, the above mentioned command worked in my case to see output as single file while reducer output had created multiple parts file, thus I have accepted that answer.

avatar
Master Guru

@Rushikesh Deshmukh, -getmerge will download all parts from HDFS to your local machine and merge them in a local-destination there. If you have 8 parts, each say 128M each, you will end up downloading 1G of data. Though it makes sense for small files.

However, if you want to keep the resulting file on HDFS one way to do it is to create a MR job with unit mappers and a single unit reducer. For example:

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -Dmapred.reduce.tasks=1 -mapper cat -reducer cat -input 'inputdir/part-r-*' -output outputdir

If you keep the output on HDFS, another question is "Seeing by whom as a single file?". Above command will create a single output file, but if you run another MR job using that file as input, the MR framework will by default "see" it as 8 files, actually 8 HDFS blocks (assuming block size of 128M), and will process it using 8 mappers.