Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Specify node distribution to improve two big table joins?

Specify node distribution to improve two big table joins?

New Contributor

Im trying to improve query performance on joining two big table data.

 

E.g. Orders and Customer Site Visit Table.

 

Is it possible to specify that customer A data should always go to node 1 and Customer  Z on node 100 for both entry to the table and prevent broadcast of data when joining the tables?

3 REPLIES 3

Re: Specify node distribution to improve two big table joins?

Explorer

You might try the "[SHUFFLE]" hint, documented at the URL below, which uses a partitioned hash join instead of a broadcast join.

 

https://www.cloudera.com/documentation/enterprise/latest/topics/impala_hints.html

Re: Specify node distribution to improve two big table joins?

New Contributor

There will be a lot of shuffling which will reduce performance.

 

Hive seems to be able to do bucket join if you bucketed both tables by the same key as I suggest.

Wonder if Impala can do this ?

 http://stackoverflow.com/questions/20199077/hive-efficient-join-of-two-tables

 

Does Hive Index impact Impala performance?

Re: Specify node distribution to improve two big table joins?

New Contributor

The link below essentially describer what I am trying to do

 

https://www.mysql.com/products/cluster/scalability.html