Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Pig- fragment-replicate join

avatar
Expert Contributor

Hi Friends,

Have a question regarding the Pig -fragment-replicate join. Could you please let me know whether my (below) understanding is correct or not.

I have two files File A (400MB) and File B (50MB), when I join these two files using keyword "replicated" then the small file i,e File B is loaded into memory (because it's small file) and during join with file A (as File A is 400MB and it's distributed in hadoop cluster with 4 blocks - 3*128MB and 116MB) pig will load only parts of the file A (one block each time) into memory for the join with file B and once its done, it will load the next block and so on..

Thanks,

Satish.

1 ACCEPTED SOLUTION

avatar

Yep, as http://pig.apache.org/docs/r0.14.0/perf.html#replicated-joins details it, you have the jist of what's happening here. The (obvious) goal is to do a map-side join instead of doing a more classical reduce-side join.

View solution in original post

4 REPLIES 4

avatar

Yep, as http://pig.apache.org/docs/r0.14.0/perf.html#replicated-joins details it, you have the jist of what's happening here. The (obvious) goal is to do a map-side join instead of doing a more classical reduce-side join.

avatar
Expert Contributor

Hi Lester Martin,

Thanks for your reply, but still I am not clear.

Is my understanding on this is correct or not? 🙂 Please let me know on the same.

avatar

Correct, File B will be loaded into memory and used in that context for each block of File A that is processed independent of each other.

avatar
Expert Contributor

Thanks Lester.