- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Best way to merge multi part file into single file?
- Labels:
-
Apache Hadoop
Created ‎06-29-2016 07:37 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have huge data set in hdfs in multiple files and want to merge them all into single file to be used by our customers. We tried using hdfs getmerge command but running into OOM issues on edge node. Any other ways to achieve this merge functionality?
Created ‎06-29-2016 07:43 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Use hadoop-streaming job (with single reducer) to merge all part files data to single hdfs file on cluster itself and then use hdfs get to fetch single file to local system.
$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \ -Dmapred.reduce.tasks=1 \ -input "/hdfs/input/dir" \ -output "/hdfs/output/dir" \ -mapper cat \ -reducer cat
Created ‎06-29-2016 07:43 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Use hadoop-streaming job (with single reducer) to merge all part files data to single hdfs file on cluster itself and then use hdfs get to fetch single file to local system.
$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \ -Dmapred.reduce.tasks=1 \ -input "/hdfs/input/dir" \ -output "/hdfs/output/dir" \ -mapper cat \ -reducer cat
Created ‎06-29-2016 07:44 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! will try this.
Created ‎01-23-2017 10:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are using spark then use below code:
sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)
This will merge all part files into one and save it again into hdfs location
Created ‎01-24-2017 09:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there also an approach to combine snappy compressed files without decompressing/recompressing them? I have about 50 small files per hour, snappy compressed (framed stream, 65k chunk size) that I would like to combine to a single file, without recompressing (which should not be needed according to snappy documentation).
With above parameters the input files are decompressed (on-the-fly). I could of course recompress them during reduce, but that would be a waste of (CPU) resources.
Created ‎05-25-2018 02:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
this seems like a better solution to me
