Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best way to merge multi part file into single file?

avatar

We have huge data set in hdfs in multiple files and want to merge them all into single file to be used by our customers. We tried using hdfs getmerge command but running into OOM issues on edge node. Any other ways to achieve this merge functionality?

1 ACCEPTED SOLUTION

avatar
Rising Star
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
5 REPLIES 5

avatar
Rising Star
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar

Thanks! will try this.

avatar
New Contributor

If you are using spark then use below code:

sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)

This will merge all part files into one and save it again into hdfs location

avatar
New Contributor

Is there also an approach to combine snappy compressed files without decompressing/recompressing them? I have about 50 small files per hour, snappy compressed (framed stream, 65k chunk size) that I would like to combine to a single file, without recompressing (which should not be needed according to snappy documentation).

With above parameters the input files are decompressed (on-the-fly). I could of course recompress them during reduce, but that would be a waste of (CPU) resources.

avatar