Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Pig - Load data from two different path

avatar
Expert Contributor

Hi Friends,

I have question on Pig script. I have to load data from two different HDFS paths into single Pig relation.

Ex: /data/input1.csv and another file is in /inputdata/input1 or input2.csv.

Is it possible to load these two tables in to single Pig relation?

Thanks,

Satish.

1 ACCEPTED SOLUTION

avatar
Guru

If these two sources have the same schema it is a simple manner of using the UNION operator to do these three steps:

Source_1 = LOAD "/data/input1.csv" USING PigStorage(',') ...
Source_2 = LOAD "/data/input2.csv" USING PigStorage(',') ...
Source = UNION Source_1, Source_2;

See these references for elaboration:

View solution in original post

7 REPLIES 7

avatar
Guru

If these two sources have the same schema it is a simple manner of using the UNION operator to do these three steps:

Source_1 = LOAD "/data/input1.csv" USING PigStorage(',') ...
Source_2 = LOAD "/data/input2.csv" USING PigStorage(',') ...
Source = UNION Source_1, Source_2;

See these references for elaboration:

avatar
Expert Contributor

Thanks Greg, but is there anyway to load both files from different path into single relation using LOAD?

avatar
Guru

@Satish Sarapuri Yes, you can GLOB the filename pattern. This will work work:

Source = LOAD '/data/input{1,2}.csv' USING PigStorage(,)...

You can use other GLOB patterns. See https://books.google.com/books?id=Nff49D7vnJcC&pg=PA60&lpg=PA60&dq=hdfs+glob&source=bl&ots=IjkvXt9zU...

avatar
Expert Contributor

@Grey Keys, both source data is in different paths.

avatar
Guru

@Satish Sarapuri You can use globs anywhere in the path (not just the filename). There are quite many operators for globs (similar to linux) as shown in the above link, so if there is enough in common with the paths you should be able to leverage globs for the differing parts. If none of that works, you could still use the globs with full paths:

Source = LOAD '/{path1,path2}' USING PigStorage(,)...
where path1 and path2 can be any file path.

avatar
New Contributor

Hi, I am a new one of Big Data. This code is like Union? So mean, @Greg Keys you write two codes. They are working same? Thank you for answering...

avatar
Expert Contributor

@Greg Keys

Both source files are in different paths.