Created 06-16-2016 10:01 AM
Hi experts,
I'm trying to do some data transformations (simple) in my text files using Apache Pig. I've 80 text files in my HDFS and I want to add a new column based on filnename. I test the code for to only one text file and works fine. But when I put the code reading all the files it don't do the job (it stays 0% at long time). Here is my code: A = LOAD '/user/data' using PigStorage(' ','-tagFile')
STORE A INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' '); In your opinion, Pig are the best way to this? Thanks!!
Created 07-19-2016 03:20 PM
I agree with the notes identified in the comments section between you and @Benjamin Leonhardi that the gotcha is the zip files. http://stackoverflow.com/questions/17200731/pig-tagsource-doesnt-work-with-multiple-files suggests that you can set pig.splitCombination to false to get over the hump that you may be running multiple files in a single mapper.
That said, I did a simple test on the 2.4 Sandbox having three files (named file1.txt, file2.txt and file3.txt) with the following contents.
a,b,c a,b,c a,b,c
I then ran the following simple script (tried it with MR as well as Tez as the execution engine).
a = LOAD '/user/maria_dev/multFiles' using PigStorage(' ','-tagFile'); DUMP a;
And I got the following (expected) output.
(file1.txt,a,b,c) (file1.txt,a,b,c) (file1.txt,a,b,c) (file2.txt,a,b,c) (file2.txt,a,b,c) (file2.txt,a,b,c) (file3.txt,a,b,c) (file3.txt,a,b,c) (file3.txt,a,b,c)
Good luck!
Created 06-16-2016 10:50 AM
What does happen in the Yarn resource manager? Do you see 80 mappers in the yarn application that is kicked off? Have a look in the logs of one of them to see what is going on.
Created 06-16-2016 03:17 PM
No, only one 😞 I put the files into 2 zipped files now. Don't know if I will get a better performance in doing this...
Created 06-16-2016 10:18 PM
If the files are small pig will group them together. you could disable that if you wanted to. Now if you zip them each zipped file will have one mapper ( you mean GZ not zip I hope the latter will not work ), since they cannot be split.
Anyway if he groups them into one map task it sounds like they are not too big so the question is why they take so long. I would still look into Resourcem,anager logs and see what is going on. You can see how many bytes go in and out of each mapper and look in logs for what is going on.
Created 06-17-2016 09:47 AM
Thanks Benjamin for your support 🙂 When you speak to Resource Manager, are you talking about Job Browser, to see the logs?
Created 07-19-2016 03:20 PM
I agree with the notes identified in the comments section between you and @Benjamin Leonhardi that the gotcha is the zip files. http://stackoverflow.com/questions/17200731/pig-tagsource-doesnt-work-with-multiple-files suggests that you can set pig.splitCombination to false to get over the hump that you may be running multiple files in a single mapper.
That said, I did a simple test on the 2.4 Sandbox having three files (named file1.txt, file2.txt and file3.txt) with the following contents.
a,b,c a,b,c a,b,c
I then ran the following simple script (tried it with MR as well as Tez as the execution engine).
a = LOAD '/user/maria_dev/multFiles' using PigStorage(' ','-tagFile'); DUMP a;
And I got the following (expected) output.
(file1.txt,a,b,c) (file1.txt,a,b,c) (file1.txt,a,b,c) (file2.txt,a,b,c) (file2.txt,a,b,c) (file2.txt,a,b,c) (file3.txt,a,b,c) (file3.txt,a,b,c) (file3.txt,a,b,c)
Good luck!