Archives of Support Questions (Read Only)

SatishS · ‎05-23-2017

Hi Friends,

Have a question regarding the Pig -fragment-replicate join. Could you please let me know whether my (below) understanding is correct or not.

I have two files File A (400MB) and File B (50MB), when I join these two files using keyword "replicated" then the small file i,e File B is loaded into memory (because it's small file) and during join with file A (as File A is 400MB and it's distributed in hadoop cluster with 4 blocks - 3*128MB and 116MB) pig will load only parts of the file A (one block each time) into memory for the join with file B and once its done, it will load the next block and so on..

Thanks,

Satish.

LesterMartin · ‎05-24-2017

Yep, as http://pig.apache.org/docs/r0.14.0/perf.html#replicated-joins details it, you have the jist of what's happening here. The (obvious) goal is to do a map-side join instead of doing a more classical reduce-side join.

View solution in original post

LesterMartin · ‎05-24-2017

Yep, as http://pig.apache.org/docs/r0.14.0/perf.html#replicated-joins details it, you have the jist of what's happening here. The (obvious) goal is to do a map-side join instead of doing a more classical reduce-side join.

SatishS · ‎05-24-2017

Hi Lester Martin,

Thanks for your reply, but still I am not clear.

Is my understanding on this is correct or not? 🙂 Please let me know on the same.

LesterMartin · ‎05-24-2017

Correct, File B will be loaded into memory and used in that context for each block of File A that is processed independent of each other.

SatishS · ‎05-24-2017

Thanks Lester.

Cloudera Community

Archives of Support Questions (Read Only)

Pig- fragment-replicate join