02-04-2016 01:24 AM
I would appreciate any help with my query which I try to optimize. The main problem seems to be in the join type impala uses. I specify inner join in the query, but the query is executed with cross join instead.
The second question would be more general. Can I create table using CREATE TABLE tbl_name AS SELECT... with partitions directly?
Thank you for any points.
Query
CREATE TABLE result AS SELECT TA.*, TB.type FROM big_table AS TA INNER JOIN small_table AS TB ON (TA.product_class = TB.produc_class AND TA.id = TB.id) OR TA.name = TB.name WHERE TA._partition_ IN (1,7,12)
Query Plan
---------------- Estimated Per-Host Requirements: Memory=2.28GB VCores=2 F00:PLAN FRAGMENT [RANDOM] WRITE TO HDFS [result, OVERWRITE=false] | partitions=1 | hosts=6 per-host-mem=100.00KB | 02:CROSS JOIN [BROADCAST] | predicates: (TA.product_class = TB.produc_class AND TA.id = TB.id) OR TA.name = TB.name | hosts=6 per-host-mem=3.87KB | tuple-ids=0,1 row-size=68B cardinality=21166283493 | |--03:EXCHANGE [BROADCAST] | hosts=3 per-host-mem=0B | tuple-ids=1 row-size=28B cardinality=143 | 00:SCAN HDFS [big_table TA, RANDOM] partitions=72/238 files=2986 size=47.11GB table stats: 4058041841 rows total column stats: all hosts=6 per-host-mem=2.25GB tuple-ids=0 row-size=40B cardinality=1357944665 F01:PLAN FRAGMENT [RANDOM] DATASTREAM SINK [FRAGMENT=F00, EXCHANGE=03, BROADCAST] 01:SCAN HDFS [small_table TB, RANDOM] partitions=1/1 files=1 size=5.20KB table stats: 143 rows total column stats: all hosts=3 per-host-mem=32.00MB tuple-ids=1 row-size=28B cardinality=143 ----------------
02-04-2016 09:07 AM
For the first question, that looks like a bug to me, but I can't reproduce it yet in my recent build. What version are you on? Does it happen without the create table part? Does it happen with straight_join?
For the second question, based on http://www.cloudera.com/documentation/enterprise/latest/topics/impala_create_table.html?scroll=creat... , I believe we do not support that yet.
02-04-2016 06:15 PM
You can't evaluate the join conditions with a hash join because of the OR, so Impala falls back to using the cross join algorithm (which will be very slow for large data sets).
ON (TA.product_class = TB.produc_class AND TA.id = TB.id) OR TA.name = TB.name