Support Questions

mhdeshmukh22 · ‎03-16-2016

bleonhardi · ‎03-16-2016

I assume you mean a map-side join in Hive. ( I.e. small dataset is replicated to all map tasks and then join is done on map side vs. the standard shuffle or distributed join which distributes both tables around. )

Its actually easy.

Assume you have

1 table with 1TB and 1 table with 1MB

Assume as well that we have 50 nodes. ( I think he only copies datasets for the distributed cache once per node and not once per task might be wrong )

So MapSide Join:

You have to copy the small table to every node: 50x1MB = 50MB of data are copied across the network

Shuffle Join:

Both tables need to be copied once i.e. 1TB + 1MB = 1.00001 TB will be copied over the network.

MapSide Join is much better than Shuffle join in this case.

Assume you have 1 table with 1TB and one table with 500GB:

MapSide Join:

500GB table needs to be copied to 50 nodes for 50 * 500 = 25 TB of data being copied over the network

Shuffle Join:

1.5TB of data need to be copied over the network.

So in this case MapSide join is much worse than shuffle join. And the CBO tries to figure out which one is better. This gets harder because of where conditions etc.

The same is true for join types in pig/Spark etc.

You can do the math yourself 🙂

View solution in original post

mhdeshmukh22 · ‎03-16-2016

Please post examples if have any.

LesterMartin · ‎03-16-2016

As http://pig.apache.org/docs/r0.15.0/perf.html#replicated-joins indicates, replicated joins are available to allow for a much more efficient map-side join instead of the standard reduce-side join. Simply put, it makes sense when the smaller relation that you are joining on can fit into memory. In Hive, this is automatically happening when possible, but in Pig we have to explicitly indicate we want it. The explain plan should show you the difference and remember that you are responsible for testing to make sure this will work as the script will fail if the "smaller" relation will not fit into memory.

NOTE: Pig's replicated joins can only work on inner or left outer joins because when a given map task sees a record in the replicated (small) input which does not match any records in the fragment (large) input, it doesn’t know whether there might be a matching record in another map task, so it can’t determine whether to emit a record or not in the current task. The replicated (small) input always needs to comes last – or right – in the join order, which is why this technique can’t be used for right or full outer joins.

bleonhardi · ‎03-16-2016

I assume you mean a map-side join in Hive. ( I.e. small dataset is replicated to all map tasks and then join is done on map side vs. the standard shuffle or distributed join which distributes both tables around. )

Its actually easy.

Assume you have

1 table with 1TB and 1 table with 1MB

Assume as well that we have 50 nodes. ( I think he only copies datasets for the distributed cache once per node and not once per task might be wrong )

So MapSide Join:

You have to copy the small table to every node: 50x1MB = 50MB of data are copied across the network

Shuffle Join:

Both tables need to be copied once i.e. 1TB + 1MB = 1.00001 TB will be copied over the network.

MapSide Join is much better than Shuffle join in this case.

Assume you have 1 table with 1TB and one table with 500GB:

MapSide Join:

500GB table needs to be copied to 50 nodes for 50 * 500 = 25 TB of data being copied over the network

Shuffle Join:

1.5TB of data need to be copied over the network.

So in this case MapSide join is much worse than shuffle join. And the CBO tries to figure out which one is better. This gets harder because of where conditions etc.

The same is true for join types in pig/Spark etc.

You can do the math yourself 🙂

mhdeshmukh22 · ‎03-16-2016

@Benjamin Leonhardi , Thanks for your ans,Finally it means which one is better is depends on small dataset size only.Is it right?

bleonhardi · ‎03-16-2016

Basically yes. But it is more complex since it is often hard to predict the size of one join. Normally you have where conditions multiple hierarchy joins etc. So it's hard to say how big a dataset will be during the query depending on the filter conditions.

That's where statistics come in. Cbo is important and using analyze statement is important to gather the statistics.

mhdeshmukh22 · ‎03-16-2016

Thanks @Benjamin

aervits · ‎03-16-2016

from Apache pig documentation http://pig.apache.org/docs/r0.15.0/perf.html#replicated-joins

Replicated Joins

Fragment replicate join is a special type of join that works well if one or more relations are small enough to fit into main memory. In such cases, Pig can perform a very efficient join because all of the hadoop work is done on the map side. In this type of join the large relation is followed by one or more small relations. The small relations must be small enough to fit into main memory; if they don't, the process fails and an error is generated.

inner joins and outer joins). In this example, a large relation is joined with two smaller relations. Note that the large relation comes first followed by the smaller relations; and, all small relations together must fit into main memory, otherwise an error is generated.

big = LOAD 'big_data' AS (b1,b2,b3);

tiny = LOAD 'tiny_data' AS (t1,t2,t3);

mini = LOAD 'mini_data' AS (m1,m2,m3);

C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';

Cloudera Community

Support Questions

How the Replicated join gives better performance as compared to other join ? or we can call it's efficient join ? In which scenarios we should use it?