Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

merge file in hdfs

avatar
Explorer

i need to merge n part files in hdfs but i dont have enough space in the local FS to generate it with getmerge. Is there another way to do this?

6 REPLIES 6

avatar
Explorer

avatar
Super Collaborator

avatar
Explorer

i dont have enough space in the local FS

avatar
Super Collaborator

@eric valoschin the solution in the above link is not storing the output on local FS. It is streaming the output from HDFS to HDFS:

============================

A command line scriptlet to do this could be as follows:

hadoop fs -text *_fileName.txt | hadoop fs -put - targetFilename.txt

This will cat all files that match the glob to standard output, then you'll pipe that stream to the put command and output the stream to an HDFS file named targetFilename.txt

=============================

avatar
Explorer

it is a compress bz2 file and i get an error about the codec when tring to get de new file.

INFO compress.CodecPool: Got brand-new decompressor [.bz2]

text: Unable to write to output stream.

avatar
Master Guru

@eric valoschin

Can you try the following command

hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \
                   -Dmapred.reduce.tasks=1 \
                   -input "<path-to-input-directory>" \
                   -output "<path-to-output-directory>" \
                   -mapper cat \
                   -reducer cat

make sure which version of hadoop streaming jar you are using by going to

/usr/hdp

then give the input path and make sure the output directory is not existed as this job will merge the files and creates the output directory for you.

Here what i tried:-

#hdfs dfs -ls /user/yashu/folder2/
Found 2 items 
-rw-r--r--   3 hdfs hdfs        150 2017-09-26 17:55 /user/yashu/folder2/part1.txt 
-rw-r--r--   3 hdfs hdfs         20 2017-09-27 09:07 /user/yashu/folder2/part1_sed.txt
#hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \
>                    -Dmapred.reduce.tasks=1 \
>                    -input "/user/yashu/folder2/" \
>                    -output "/user/yashu/folder1/" \
>                    -mapper cat \
>                    -reducer cat

Folder2 having 2 files after running the above command, i am storing the merged files to folder1 directory and the 2 files got merged into 1 file as you can see below.

#hdfs dfs -ls /user/yashu/folder1/
Found 2 items 
-rw-r--r--   3 hdfs hdfs          0 2017-10-09 16:00 /user/yashu/folder1/_SUCCESS 
-rw-r--r--   3 hdfs hdfs        174 2017-10-09 16:00 /user/yashu/folder1/part-00000