Im trying to improve query performance on joining two big table data.
E.g. Orders and Customer Site Visit Table.
Is it possible to specify that customer A data should always go to node 1 and Customer Z on node 100 for both entry to the table and prevent broadcast of data when joining the tables?
You might try the "[SHUFFLE]" hint, documented at the URL below, which uses a partitioned hash join instead of a broadcast join.
There will be a lot of shuffling which will reduce performance.
Hive seems to be able to do bucket join if you bucketed both tables by the same key as I suggest.
Wonder if Impala can do this ?
Does Hive Index impact Impala performance?