Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Merge and Rename files in HDFS - Pig?

avatar
Rising Star

Hi experts,

I used Apache Pig to add a new column to my 3 text files inserted on HDFS. The three texts files was:

  • 2016-06-25.txt
  • 2016-06-24.txt
  • 2016-06-23.txt

However after I execute my Pig code I've in my HDFS 7 files (because the Map Reduce):

  • part-m-0000
  • part-m-0001
  • part-m-0002
  • part-m-0003
  • ...
  • part-m-0006

How can I obtain only 3 files with it orignally name? Basically I want to add the new column but still have the same files with the same name...

My code is:

  • Src = LOAD '/data/Src/' using PigStorage(' ','-tagFile');
  • STORE Src INTO '/data/Src/Src2' USING PigStorage(' ');
1 ACCEPTED SOLUTION

avatar
Rising Star
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
6 REPLIES 6

avatar
Rising Star
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Rising Star

Many thanks Emily. One problem I think: my column "date" isn't ideitified as date because it apperars like the filename "2016-06-23.txt". So I think it was created like a String. Can I do the Split in same way?

avatar
Rising Star

Hi @João Souza, no problem. Yes you should still be able to use split, just with

IF (date=='2016-06-23')

comparing string type instead of date type.

Hope this helps!

avatar
Rising Star

Emily, only one more question. Mu current code is in attach. It execute succesfully however my final data sets it returns empty... Do you know why?pig-statement.txt

avatar
Rising Star

I haven't tested it, but I believe using -tagFile will prepend the file name, which will place it at position 0 instead of 1. I.e.

GENERATE
(chararray)$0 AS Filename, 
(chararray)$1 AS ID, etc.

Hope this solves it!

avatar
Master Guru

Hi @João Souza, it's not a good idea to base your design on file names in hdfs. You can use file names only in phase 1 of your processing flow (what you are already doing using "-tagFile"), after that just consider your input as a "data set". Using directories, what Emily suggested, is a much better idea, and is often used to partition data for MR jobs and Hive tables.