Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Pig- fragment-replicate join

avatar
Expert Contributor

Hi Friends,

Have a question regarding the Pig -fragment-replicate join. Could you please let me know whether my (below) understanding is correct or not.

I have two files File A (400MB) and File B (50MB), when I join these two files using keyword "replicated" then the small file i,e File B is loaded into memory (because it's small file) and during join with file A (as File A is 400MB and it's distributed in hadoop cluster with 4 blocks - 3*128MB and 116MB) pig will load only parts of the file A (one block each time) into memory for the join with file B and once its done, it will load the next block and so on..

Thanks,

Satish.

1 ACCEPTED SOLUTION

avatar

Yep, as http://pig.apache.org/docs/r0.14.0/perf.html#replicated-joins details it, you have the jist of what's happening here. The (obvious) goal is to do a map-side join instead of doing a more classical reduce-side join.

View solution in original post

4 REPLIES 4

avatar

Yep, as http://pig.apache.org/docs/r0.14.0/perf.html#replicated-joins details it, you have the jist of what's happening here. The (obvious) goal is to do a map-side join instead of doing a more classical reduce-side join.

avatar
Expert Contributor

Hi Lester Martin,

Thanks for your reply, but still I am not clear.

Is my understanding on this is correct or not? 🙂 Please let me know on the same.

avatar

Correct, File B will be loaded into memory and used in that context for each block of File A that is processed independent of each other.

avatar
Expert Contributor

Thanks Lester.