- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Merge and Rename files in HDFS - Pig?
- Labels:
-
Apache Hadoop
-
Apache Pig
Created 06-26-2016 04:29 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi experts,
I used Apache Pig to add a new column to my 3 text files inserted on HDFS. The three texts files was:
- 2016-06-25.txt
- 2016-06-24.txt
- 2016-06-23.txt
However after I execute my Pig code I've in my HDFS 7 files (because the Map Reduce):
- part-m-0000
- part-m-0001
- part-m-0002
- part-m-0003
- ...
- part-m-0006
How can I obtain only 3 files with it orignally name? Basically I want to add the new column but still have the same files with the same name...
My code is:
- Src = LOAD '/data/Src/' using PigStorage(' ','-tagFile');
- STORE Src INTO '/data/Src/Src2' USING PigStorage(' ');
Created 06-27-2016 01:05 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There's currently no mechanism to force the name of MapReduce output files.
Once you've loaded all the data and added the extra column, you can split your alias into one per date, then store each one in a different directory. e.g.
SPLIT Src INTO Src23 IF date==ToDate('2016-06-23', 'yyyy-MM-dd'), Src24 IF date==ToDate('2016-06-24', 'yyyy-MM-dd'), Src25 IF date==ToDate('2016-06-23', 'yyyy-MM-dd');
STORE Src23 INTO '/data/Src/2016-06-23' using PigStorage(' ');
This way, you could merge the output files in each date directory using -getmerge (and specify the resulting file name), and then copy them back onto HDFS.
Another option is to force a reduce job to occur (yours is map only), and and set PARALLEL 1. It will be a slower job, but you will get one output file. E.g.
Ordered23 = ORDER Src23 BY somecolumn PARALLEL 1;
STORE Ordered23 INTO '/data/Src/2016-06-23' using PigStorage(' ');
You would still have to rename the files outside of this process.
Created 06-27-2016 01:05 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There's currently no mechanism to force the name of MapReduce output files.
Once you've loaded all the data and added the extra column, you can split your alias into one per date, then store each one in a different directory. e.g.
SPLIT Src INTO Src23 IF date==ToDate('2016-06-23', 'yyyy-MM-dd'), Src24 IF date==ToDate('2016-06-24', 'yyyy-MM-dd'), Src25 IF date==ToDate('2016-06-23', 'yyyy-MM-dd');
STORE Src23 INTO '/data/Src/2016-06-23' using PigStorage(' ');
This way, you could merge the output files in each date directory using -getmerge (and specify the resulting file name), and then copy them back onto HDFS.
Another option is to force a reduce job to occur (yours is map only), and and set PARALLEL 1. It will be a slower job, but you will get one output file. E.g.
Ordered23 = ORDER Src23 BY somecolumn PARALLEL 1;
STORE Ordered23 INTO '/data/Src/2016-06-23' using PigStorage(' ');
You would still have to rename the files outside of this process.
Created 06-27-2016 01:32 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Many thanks Emily. One problem I think: my column "date" isn't ideitified as date because it apperars like the filename "2016-06-23.txt". So I think it was created like a String. Can I do the Split in same way?
Created 06-28-2016 12:21 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @João Souza, no problem. Yes you should still be able to use split, just with
IF (date=='2016-06-23')
comparing string type instead of date type.
Hope this helps!
Created 06-28-2016 04:45 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Emily, only one more question. Mu current code is in attach. It execute succesfully however my final data sets it returns empty... Do you know why?pig-statement.txt
Created 06-30-2016 12:21 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I haven't tested it, but I believe using -tagFile will prepend the file name, which will place it at position 0 instead of 1. I.e.
GENERATE (chararray)$0 AS Filename, (chararray)$1 AS ID, etc.
Hope this solves it!
Created 06-27-2016 01:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @João Souza, it's not a good idea to base your design on file names in hdfs. You can use file names only in phase 1 of your processing flow (what you are already doing using "-tagFile"), after that just consider your input as a "data set". Using directories, what Emily suggested, is a much better idea, and is often used to partition data for MR jobs and Hive tables.
