Created 05-23-2017 10:52 PM
Hi Friends,
Have a question regarding the Pig -fragment-replicate join. Could you please let me know whether my (below) understanding is correct or not.
I have two files File A (400MB) and File B (50MB), when I join these two files using keyword "replicated" then the small file i,e File B is loaded into memory (because it's small file) and during join with file A (as File A is 400MB and it's distributed in hadoop cluster with 4 blocks - 3*128MB and 116MB) pig will load only parts of the file A (one block each time) into memory for the join with file B and once its done, it will load the next block and so on..
Thanks,
Satish.
Created 05-24-2017 03:18 PM
Yep, as http://pig.apache.org/docs/r0.14.0/perf.html#replicated-joins details it, you have the jist of what's happening here. The (obvious) goal is to do a map-side join instead of doing a more classical reduce-side join.
Created 05-24-2017 03:18 PM
Yep, as http://pig.apache.org/docs/r0.14.0/perf.html#replicated-joins details it, you have the jist of what's happening here. The (obvious) goal is to do a map-side join instead of doing a more classical reduce-side join.
Created 05-24-2017 03:36 PM
Hi Lester Martin,
Thanks for your reply, but still I am not clear.
Is my understanding on this is correct or not? 🙂 Please let me know on the same.
Created 05-24-2017 06:25 PM
Correct, File B will be loaded into memory and used in that context for each block of File A that is processed independent of each other.
Created 05-24-2017 11:02 PM
Thanks Lester.