- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
merge csv files based on a column timestamp to get one file
- Labels:
-
Apache Hadoop
-
Apache Spark
Created ‎03-03-2017 09:21 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i have 4 csv files , i want to join and merge these files into one files based on a column timestamps to get one file.using spark or hadoop Please any help would be appreciated
Created ‎03-03-2017 02:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sol 1: Reduce side join
create separate mappers for all 4 csv files and produce the time stamp as key from all mappers and remaining fields + tag field to represent from which file it is returned as value.
handle them in reduce side..
Sol 2: Map side join (if 3 files are small)
add 3 csv files into distributed cache and merge them with large csv file in mapper.
Created ‎03-03-2017 04:22 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am seeing a similar question of yours in the link below.
Here is one where i answered the question combining any files whether it be csv or txt
https://community.hortonworks.com/questions/85230/erge-csv-files-in-one-file.html#answer-85245
