Created 02-19-2016 07:06 AM
Hi,
Can anyone suggest me any possible way for "seeing output of a MR job as a single file even if reducer has created multiple part-r-0000* files?"
Created 02-19-2016 08:15 AM
@Rushikesh Deshmukh, -getmerge will download all parts from HDFS to your local machine and merge them in a local-destination there. If you have 8 parts, each say 128M each, you will end up downloading 1G of data. Though it makes sense for small files.
However, if you want to keep the resulting file on HDFS one way to do it is to create a MR job with unit mappers and a single unit reducer. For example:
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -Dmapred.reduce.tasks=1 -mapper cat -reducer cat -input 'inputdir/part-r-*' -output outputdir
If you keep the output on HDFS, another question is "Seeing by whom as a single file?". Above command will create a single output file, but if you run another MR job using that file as input, the MR framework will by default "see" it as 8 files, actually 8 HDFS blocks (assuming block size of 128M), and will process it using 8 mappers.
Created 02-19-2016 07:15 AM
I got below answer:
hdfs dfs -getmerge <src> <localdst> [addnl]
-
Above command worked for me.
Created 02-19-2016 11:12 AM
@Rushikesh Deshmukh what command worked for you? Did @Predrag Minovic question help you? If so please accept his answer and not your own as we do not know what solved it for you. You do not earn points for accepting own answers and its not a good practice unless you find your own solution.
Created 02-19-2016 11:25 AM
@Artem Ervits, the above mentioned command worked in my case to see output as single file while reducer output had created multiple parts file, thus I have accepted that answer.
Created 02-19-2016 08:15 AM
@Rushikesh Deshmukh, -getmerge will download all parts from HDFS to your local machine and merge them in a local-destination there. If you have 8 parts, each say 128M each, you will end up downloading 1G of data. Though it makes sense for small files.
However, if you want to keep the resulting file on HDFS one way to do it is to create a MR job with unit mappers and a single unit reducer. For example:
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -Dmapred.reduce.tasks=1 -mapper cat -reducer cat -input 'inputdir/part-r-*' -output outputdir
If you keep the output on HDFS, another question is "Seeing by whom as a single file?". Above command will create a single output file, but if you run another MR job using that file as input, the MR framework will by default "see" it as 8 files, actually 8 HDFS blocks (assuming block size of 128M), and will process it using 8 mappers.