Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Apache Pig - Load 80 files into another direcotry

Solved Go to solution
Highlighted

Apache Pig - Load 80 files into another direcotry

Explorer

Hi experts,

I'm trying to do some data transformations (simple) in my text files using Apache Pig. I've 80 text files in my HDFS and I want to add a new column based on filnename. I test the code for to only one text file and works fine. But when I put the code reading all the files it don't do the job (it stays 0% at long time). Here is my code: A = LOAD '/user/data' using PigStorage(' ','-tagFile')

STORE A INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' '); In your opinion, Pig are the best way to this? Thanks!!

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Apache Pig - Load 80 files into another direcotry

I agree with the notes identified in the comments section between you and @Benjamin Leonhardi that the gotcha is the zip files. http://stackoverflow.com/questions/17200731/pig-tagsource-doesnt-work-with-multiple-files suggests that you can set pig.splitCombination to false to get over the hump that you may be running multiple files in a single mapper.

That said, I did a simple test on the 2.4 Sandbox having three files (named file1.txt, file2.txt and file3.txt) with the following contents.

a,b,c
a,b,c
a,b,c

I then ran the following simple script (tried it with MR as well as Tez as the execution engine).

a = LOAD '/user/maria_dev/multFiles' using PigStorage(' ','-tagFile');
DUMP a;

And I got the following (expected) output.

(file1.txt,a,b,c)
(file1.txt,a,b,c)
(file1.txt,a,b,c)
(file2.txt,a,b,c)
(file2.txt,a,b,c)
(file2.txt,a,b,c)
(file3.txt,a,b,c)
(file3.txt,a,b,c)
(file3.txt,a,b,c)

Good luck!

View solution in original post

5 REPLIES 5
Highlighted

Re: Apache Pig - Load 80 files into another direcotry

What does happen in the Yarn resource manager? Do you see 80 mappers in the yarn application that is kicked off? Have a look in the logs of one of them to see what is going on.

Re: Apache Pig - Load 80 files into another direcotry

Explorer

No, only one :( I put the files into 2 zipped files now. Don't know if I will get a better performance in doing this...

Highlighted

Re: Apache Pig - Load 80 files into another direcotry

If the files are small pig will group them together. you could disable that if you wanted to. Now if you zip them each zipped file will have one mapper ( you mean GZ not zip I hope the latter will not work ), since they cannot be split.

Anyway if he groups them into one map task it sounds like they are not too big so the question is why they take so long. I would still look into Resourcem,anager logs and see what is going on. You can see how many bytes go in and out of each mapper and look in logs for what is going on.

Highlighted

Re: Apache Pig - Load 80 files into another direcotry

Explorer

Thanks Benjamin for your support :) When you speak to Resource Manager, are you talking about Job Browser, to see the logs?

Highlighted

Re: Apache Pig - Load 80 files into another direcotry

I agree with the notes identified in the comments section between you and @Benjamin Leonhardi that the gotcha is the zip files. http://stackoverflow.com/questions/17200731/pig-tagsource-doesnt-work-with-multiple-files suggests that you can set pig.splitCombination to false to get over the hump that you may be running multiple files in a single mapper.

That said, I did a simple test on the 2.4 Sandbox having three files (named file1.txt, file2.txt and file3.txt) with the following contents.

a,b,c
a,b,c
a,b,c

I then ran the following simple script (tried it with MR as well as Tez as the execution engine).

a = LOAD '/user/maria_dev/multFiles' using PigStorage(' ','-tagFile');
DUMP a;

And I got the following (expected) output.

(file1.txt,a,b,c)
(file1.txt,a,b,c)
(file1.txt,a,b,c)
(file2.txt,a,b,c)
(file2.txt,a,b,c)
(file2.txt,a,b,c)
(file3.txt,a,b,c)
(file3.txt,a,b,c)
(file3.txt,a,b,c)

Good luck!

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here