Support Questions

Find answers, ask questions, and share your expertise

Extremely slow join

New Contributor

I am working on a POC that consumes messages from an SQS queue via Spark and inserts the data into two separate but similar tables.

Both of the tables below have roughly 228M records in them in the partition '2017-12-11'. The select query below takes roughly 330 seconds to complete and if I add in clustering the query never completes. I have tested with vectorization, tez index filtering and modifying the tez container size. None of these adjustments have helped improve the overall performance.

Does anyone have any suggestions on how I might improve this performance?

SELECT
  cds.serialid, 
  cds.serialid, 
  get_json_object(cd.data, '$.device_sku') as device_sku
from
  component_data_status cds
INNER JOIN component_data cd ON cds.serialid = cd.serialid AND cds.deviceid = cd.deviceid
WHERE
  cds.xdate = '2017-12-11'
and
  cd.xdate = '2017-12-11'
LIMIT 10;
CREATE TABLE IF NOT EXISTS component_data ( 
  serialid string,
  deviceid string,
  json_blob string) 
PARTITIONED BY (xdate string) 
STORED AS ORCFILE 
tblproperties ("orc.compress"="SNAPPY")
CREATE TABLE IF NOT EXISTS component_data_status (
  serialid string,
  deviceid string,
  component string)
PARTITIONED BY (xdate string) 
STORED AS ORCFILE 
tblproperties ("orc.compress"="SNAPPY")
1 REPLY 1

Explorer

Take a look at the explain plan. The join type matters a lot. Stats collection typically helps CBO pick the right join. Full shuffle join is typically slowest, distributed hash join is faster, broad cast join typically faster still. If you run into memory issues either use the distributed hash join or you could try bucketing.