Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Running impala query with inner join

Highlighted

Running impala query with inner join

New Contributor

I would appreciate any help with my query which I try to optimize. The main problem seems to be in the join type impala uses. I specify inner join in the query, but the query is executed with cross join instead.

 

The second question would be more general. Can I create table using CREATE TABLE tbl_name AS SELECT... with partitions directly?

 

Thank you for any points.

 

Query

CREATE TABLE result AS
SELECT
   TA.*,
   TB.type
FROM
   big_table AS TA
   INNER JOIN small_table AS TB
   ON 
      (TA.product_class = TB.produc_class AND TA.id = TB.id) 
      OR TA.name = TB.name
WHERE 
   TA._partition_ IN (1,7,12)

 

Query Plan

----------------
Estimated Per-Host Requirements: Memory=2.28GB VCores=2

F00:PLAN FRAGMENT [RANDOM]
  WRITE TO HDFS [result, OVERWRITE=false]
  |  partitions=1
  |  hosts=6 per-host-mem=100.00KB
  |
  02:CROSS JOIN [BROADCAST]
  |  predicates: (TA.product_class = TB.produc_class AND TA.id = TB.id) OR TA.name = TB.name
  |  hosts=6 per-host-mem=3.87KB
  |  tuple-ids=0,1 row-size=68B cardinality=21166283493
  |
  |--03:EXCHANGE [BROADCAST]
  |     hosts=3 per-host-mem=0B
  |     tuple-ids=1 row-size=28B cardinality=143
  |
  00:SCAN HDFS [big_table TA, RANDOM]
     partitions=72/238 files=2986 size=47.11GB
     table stats: 4058041841 rows total
     column stats: all
     hosts=6 per-host-mem=2.25GB
     tuple-ids=0 row-size=40B cardinality=1357944665

F01:PLAN FRAGMENT [RANDOM]
  DATASTREAM SINK [FRAGMENT=F00, EXCHANGE=03, BROADCAST]
  01:SCAN HDFS [small_table TB, RANDOM]
     partitions=1/1 files=1 size=5.20KB
     table stats: 143 rows total
     column stats: all
     hosts=3 per-host-mem=32.00MB
     tuple-ids=1 row-size=28B cardinality=143
----------------

 

 

 

 

 

 

 

2 REPLIES 2

Re: Running impala query with inner join

Contributor

For the first question, that looks like a bug to me, but I can't reproduce it yet in my recent build. What version are you on? Does it happen without the create table part? Does it happen with straight_join?

 

For the second question, based on http://www.cloudera.com/documentation/enterprise/latest/topics/impala_create_table.html?scroll=creat... , I believe we do not support that yet.

Highlighted

Re: Running impala query with inner join

You can't evaluate the join conditions with a hash join because of the OR, so Impala falls back to using the cross join algorithm (which will be very slow for large data sets).

 

 

  ON 
      (TA.product_class = TB.produc_class AND TA.id = TB.id) 
      OR TA.name = TB.name

 

Don't have an account?
Coming from Hortonworks? Activate your account here