Reply
Highlighted
New Contributor
Posts: 2
Registered: ‎02-04-2016

Running impala query with inner join

I would appreciate any help with my query which I try to optimize. The main problem seems to be in the join type impala uses. I specify inner join in the query, but the query is executed with cross join instead.

 

The second question would be more general. Can I create table using CREATE TABLE tbl_name AS SELECT... with partitions directly?

 

Thank you for any points.

 

Query

CREATE TABLE result AS
SELECT
   TA.*,
   TB.type
FROM
   big_table AS TA
   INNER JOIN small_table AS TB
   ON 
      (TA.product_class = TB.produc_class AND TA.id = TB.id) 
      OR TA.name = TB.name
WHERE 
   TA._partition_ IN (1,7,12)

 

Query Plan

----------------
Estimated Per-Host Requirements: Memory=2.28GB VCores=2

F00:PLAN FRAGMENT [RANDOM]
  WRITE TO HDFS [result, OVERWRITE=false]
  |  partitions=1
  |  hosts=6 per-host-mem=100.00KB
  |
  02:CROSS JOIN [BROADCAST]
  |  predicates: (TA.product_class = TB.produc_class AND TA.id = TB.id) OR TA.name = TB.name
  |  hosts=6 per-host-mem=3.87KB
  |  tuple-ids=0,1 row-size=68B cardinality=21166283493
  |
  |--03:EXCHANGE [BROADCAST]
  |     hosts=3 per-host-mem=0B
  |     tuple-ids=1 row-size=28B cardinality=143
  |
  00:SCAN HDFS [big_table TA, RANDOM]
     partitions=72/238 files=2986 size=47.11GB
     table stats: 4058041841 rows total
     column stats: all
     hosts=6 per-host-mem=2.25GB
     tuple-ids=0 row-size=40B cardinality=1357944665

F01:PLAN FRAGMENT [RANDOM]
  DATASTREAM SINK [FRAGMENT=F00, EXCHANGE=03, BROADCAST]
  01:SCAN HDFS [small_table TB, RANDOM]
     partitions=1/1 files=1 size=5.20KB
     table stats: 143 rows total
     column stats: all
     hosts=3 per-host-mem=32.00MB
     tuple-ids=1 row-size=28B cardinality=143
----------------

 

 

 

 

 

 

 

Cloudera Employee
Posts: 35
Registered: ‎10-20-2015

Re: Running impala query with inner join

For the first question, that looks like a bug to me, but I can't reproduce it yet in my recent build. What version are you on? Does it happen without the create table part? Does it happen with straight_join?

 

For the second question, based on http://www.cloudera.com/documentation/enterprise/latest/topics/impala_create_table.html?scroll=creat... , I believe we do not support that yet.

Cloudera Employee
Posts: 422
Registered: ‎07-29-2015

Re: Running impala query with inner join

You can't evaluate the join conditions with a hash join because of the OR, so Impala falls back to using the cross join algorithm (which will be very slow for large data sets).

 

 

  ON 
      (TA.product_class = TB.produc_class AND TA.id = TB.id) 
      OR TA.name = TB.name