Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

merge file in hdfs

merge file in hdfs

New Contributor

i need to merge n part files in hdfs but i dont have enough space in the local FS to generate it with getmerge. Is there another way to do this?

6 REPLIES 6

Re: merge file in hdfs

New Contributor

Re: merge file in hdfs

Expert Contributor

Re: merge file in hdfs

New Contributor

i dont have enough space in the local FS

Re: merge file in hdfs

Expert Contributor

@eric valoschin the solution in the above link is not storing the output on local FS. It is streaming the output from HDFS to HDFS:

============================

A command line scriptlet to do this could be as follows:

hadoop fs -text *_fileName.txt | hadoop fs -put - targetFilename.txt

This will cat all files that match the glob to standard output, then you'll pipe that stream to the put command and output the stream to an HDFS file named targetFilename.txt

=============================

Highlighted

Re: merge file in hdfs

New Contributor

it is a compress bz2 file and i get an error about the codec when tring to get de new file.

INFO compress.CodecPool: Got brand-new decompressor [.bz2]

text: Unable to write to output stream.

Re: merge file in hdfs

Super Guru

@eric valoschin

Can you try the following command

hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \
                   -Dmapred.reduce.tasks=1 \
                   -input "<path-to-input-directory>" \
                   -output "<path-to-output-directory>" \
                   -mapper cat \
                   -reducer cat

make sure which version of hadoop streaming jar you are using by going to

/usr/hdp

then give the input path and make sure the output directory is not existed as this job will merge the files and creates the output directory for you.

Here what i tried:-

#hdfs dfs -ls /user/yashu/folder2/
Found 2 items 
-rw-r--r--   3 hdfs hdfs        150 2017-09-26 17:55 /user/yashu/folder2/part1.txt 
-rw-r--r--   3 hdfs hdfs         20 2017-09-27 09:07 /user/yashu/folder2/part1_sed.txt
#hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \
>                    -Dmapred.reduce.tasks=1 \
>                    -input "/user/yashu/folder2/" \
>                    -output "/user/yashu/folder1/" \
>                    -mapper cat \
>                    -reducer cat

Folder2 having 2 files after running the above command, i am storing the merged files to folder1 directory and the 2 files got merged into 1 file as you can see below.

#hdfs dfs -ls /user/yashu/folder1/
Found 2 items 
-rw-r--r--   3 hdfs hdfs          0 2017-10-09 16:00 /user/yashu/folder1/_SUCCESS 
-rw-r--r--   3 hdfs hdfs        174 2017-10-09 16:00 /user/yashu/folder1/part-00000