Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Apache Pig - Load 80 files into another direcotry

avatar
Rising Star

Hi experts,

I'm trying to do some data transformations (simple) in my text files using Apache Pig. I've 80 text files in my HDFS and I want to add a new column based on filnename. I test the code for to only one text file and works fine. But when I put the code reading all the files it don't do the job (it stays 0% at long time). Here is my code: A = LOAD '/user/data' using PigStorage(' ','-tagFile')

STORE A INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' '); In your opinion, Pig are the best way to this? Thanks!!

1 ACCEPTED SOLUTION

avatar

I agree with the notes identified in the comments section between you and @Benjamin Leonhardi that the gotcha is the zip files. http://stackoverflow.com/questions/17200731/pig-tagsource-doesnt-work-with-multiple-files suggests that you can set pig.splitCombination to false to get over the hump that you may be running multiple files in a single mapper.

That said, I did a simple test on the 2.4 Sandbox having three files (named file1.txt, file2.txt and file3.txt) with the following contents.

a,b,c
a,b,c
a,b,c

I then ran the following simple script (tried it with MR as well as Tez as the execution engine).

a = LOAD '/user/maria_dev/multFiles' using PigStorage(' ','-tagFile');
DUMP a;

And I got the following (expected) output.

(file1.txt,a,b,c)
(file1.txt,a,b,c)
(file1.txt,a,b,c)
(file2.txt,a,b,c)
(file2.txt,a,b,c)
(file2.txt,a,b,c)
(file3.txt,a,b,c)
(file3.txt,a,b,c)
(file3.txt,a,b,c)

Good luck!

View solution in original post

5 REPLIES 5

avatar
Master Guru

What does happen in the Yarn resource manager? Do you see 80 mappers in the yarn application that is kicked off? Have a look in the logs of one of them to see what is going on.

avatar
Rising Star

No, only one 😞 I put the files into 2 zipped files now. Don't know if I will get a better performance in doing this...

avatar
Master Guru

If the files are small pig will group them together. you could disable that if you wanted to. Now if you zip them each zipped file will have one mapper ( you mean GZ not zip I hope the latter will not work ), since they cannot be split.

Anyway if he groups them into one map task it sounds like they are not too big so the question is why they take so long. I would still look into Resourcem,anager logs and see what is going on. You can see how many bytes go in and out of each mapper and look in logs for what is going on.

avatar
Rising Star

Thanks Benjamin for your support 🙂 When you speak to Resource Manager, are you talking about Job Browser, to see the logs?

avatar

I agree with the notes identified in the comments section between you and @Benjamin Leonhardi that the gotcha is the zip files. http://stackoverflow.com/questions/17200731/pig-tagsource-doesnt-work-with-multiple-files suggests that you can set pig.splitCombination to false to get over the hump that you may be running multiple files in a single mapper.

That said, I did a simple test on the 2.4 Sandbox having three files (named file1.txt, file2.txt and file3.txt) with the following contents.

a,b,c
a,b,c
a,b,c

I then ran the following simple script (tried it with MR as well as Tez as the execution engine).

a = LOAD '/user/maria_dev/multFiles' using PigStorage(' ','-tagFile');
DUMP a;

And I got the following (expected) output.

(file1.txt,a,b,c)
(file1.txt,a,b,c)
(file1.txt,a,b,c)
(file2.txt,a,b,c)
(file2.txt,a,b,c)
(file2.txt,a,b,c)
(file3.txt,a,b,c)
(file3.txt,a,b,c)
(file3.txt,a,b,c)

Good luck!