Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Impala insert select very slow

New Contributor

I am denormalizing a big table by joining several tables together. The total number of rows I am populating is 2b+ rows and the whole process took 3 hours. Although my cluster is relatively small (6 node with 48 virtual cpu each), I still expect a much better performance. The CDH version is 2.6.0-cdh5.7.1

 Query info is as follows:

 

Query Info

  • Query ID: ff40d8437f0131ee:dc6db8ef01f9e1b5
  • User: zinc
  • Database: zinc_bda_poc
  • Coordinator: 
  • Query Type: DML
  • Query State: FINISHED
  • Start Time: Feb 26, 2018 9:59:59 PM
  • End Time: Feb 27, 2018 4:14:59 PM
  • Duration: 18h, 14m
  • Rows Produced: 2,104,465,970
  • Admission Result: Admitted immediately
  • Admission Wait Time: 0ms
  • Aggregate Peak Memory Usage: 124.4 GiB
  • Bytes Streamed: 21.6 GiB
  • Client Fetch Wait Time: 15.3h
  • Client Fetch Wait Time Percentage: 84
  • Connected User: zinc
  • Estimated per Node Peak Memory: 4.7 GiB
  • File Formats: PARQUET/SNAPPY
  • HDFS Average Scan Range: 107.7 MiB
  • HDFS Bytes Read: 21.6 GiB
  • HDFS Bytes Read From Cache: 0 B
  • HDFS Bytes Read From Cache Percentage: 0
  • HDFS Bytes Written: 39.6 GiB
  • HDFS Local Bytes Read: 21.6 GiB
  • HDFS Local Bytes Read Percentage: 100
  • HDFS Remote Bytes Read: 0 B
  • HDFS Remote Bytes Read Percentage: 0
  • HDFS Scanner Average Read Throughput: 244.4 MiB/s
  • HDFS Short Circuit Bytes Read: 21.6 GiB
  • HDFS Short Circuit Bytes Read Percentage: 100
  • Impala Version: impalad version 2.5.0-cdh5.7.1 RELEASE (build 27a4325c18c2a01c7a8097681a0eccf6d4335ea1)
  • Memory Accrual: 232.7 GiB hours
  • Memory Spilled: 0 B
  • Network Address: 171.187.160.71:40044
  • Node with Peak Memory Usage: 
  • Out of Memory: false
  • Per Node Peak Memory Usage: 23.2 GiB
  • Planning Wait Time: 117ms
  • Planning Wait Time Percentage: 0
  • Pool: root.default
  • Query Status: OK
  • Rows Inserted: 2,104,465,970
  • Session ID: b04243d6c5e7f097:637037be300bed8d
  • Session Type: HIVESERVER2
  • Statistics Missing: false
  • Threads: CPU Time: 9.0d
  • Threads: CPU Time Percentage: 100
  • Threads: Network Receive Wait Time: 5.5m
  • Threads: Network Receive Wait Time Percentage: 0
  • Threads: Network Send Wait Time: 1.1m
  • Threads: Network Send Wait Time Percentage: 0
  • Threads: Storage Wait Time: 1.9m
  • Threads: Storage Wait Time Percentage: 0
  • Threads: Total Time: 9.0d
  • Work CPU Time: 9.0d
 

Query Timeline

  1. Start execution: 23.39us (23.39us)
  2. Planning finished: 117ms (117ms)
  3. Submit for admission: 119ms (1ms)
  4. Completed admission: 119ms (65.20us)
  5. Ready to start 29 remote fragments: 119ms (245.52us)
  6. All 29 remote fragments started: 125ms (6ms)
  7. DML data written: 3.0h (3.0h)
  8. DML Metastore update finished: 3.0h (661ms)
  9. Request finished: 3.0h (555.57us)
  10. Unregister query: 18.2h (15.3h)

Planner Timeline

  1. Analysis finished: 24ms (24ms)
  2. Equivalence classes computed: 49ms (25ms)
  3. Single node plan created: 70ms (21ms)
  4. Runtime filters computed: 71ms (647.00us)
  5. Distributed plan created: 81ms (9ms)
  6. Lineage info computed: 93ms (12ms)
  7. Planning finished: 106ms (13ms)
4 REPLIES 4

Champion

are you stilling looking solution ? could you let me know 

New Contributor

Yes I do. Actually I found out that even I insert into an impala table from another impala table with insert...select..., it is very slow too. Also I compared the data volume of the underlying parquet file inserted by impala and spark, sometimes, the spark written parquet file is much smaller.

Champion

could you share the query explain plan ? 

did you compute stats on the table ? 

what file format ? 

is there any compression ? 

 

New Contributor

I tried with a simpler insert statement to make sure that the bottleneck is the parquet writting. Here is the explain plan and query info.

 

Yes. I do have default snappy compression. I tried gzip compression by setting compression_codec=gzip. It improves but only marginally.

 

Query Details

Query Plan
----------------
Estimated Per-Host Requirements: Memory=5.69GB VCores=1
WARNING: The following tables are missing relevant table and/or column statistics.
zinc_bda_poc.zc_inst_snappy

WRITE TO HDFS [zinc_bda_poc.zc_inst, OVERWRITE=false]
|  partitions=1
|  hosts=6 per-host-mem=1.00GB
|
00:SCAN HDFS [zinc_bda_poc.zc_inst_snappy, RANDOM]
   partitions=1/1 files=562 size=50.37GB
   table stats: unavailable
   column stats: unavailable
   hosts=6 per-host-mem=4.69GB
   tuple-ids=0 row-size=3.40KB cardinality=unavailable

Query Info
  • Query ID: 6b4a35a73f643524:11ae80d0b4e0fca5
  • User: zinc
  • Database: zinc_bda_poc
  • Coordinator: 
  • Query Type: DML
  • Query State: FINISHED
  • Start Time: Mar 5, 2018 3:31:07 PM
  • End Time: Mar 5, 2018 6:40:39 PM
  • Duration: 3h, 9m
  • Rows Produced: 2,104,465,970
  • Admission Result: Admitted immediately
  • Admission Wait Time: 0ms
  • Aggregate Peak Memory Usage: 15.4 GiB
  • Client Fetch Wait Time: 1.3h
  • Client Fetch Wait Time Percentage: 42
  • Connected User: zinc
  • Estimated per Node Peak Memory: 5.7 GiB
  • File Formats: PARQUET/SNAPPY
  • HDFS Average Scan Range: 91.8 MiB
  • HDFS Bytes Read: 50.4 GiB
  • HDFS Bytes Read From Cache: 0 B
  • HDFS Bytes Read From Cache Percentage: 0
  • HDFS Bytes Written: 82.3 GiB
  • HDFS Local Bytes Read: 50.4 GiB
  • HDFS Local Bytes Read Percentage: 100
  • HDFS Remote Bytes Read: 0 B
  • HDFS Remote Bytes Read Percentage: 0
  • HDFS Scanner Average Read Throughput: 221.1 MiB/s
  • HDFS Short Circuit Bytes Read: 50.4 GiB
  • HDFS Short Circuit Bytes Read Percentage: 100
  • Impala Version: impalad version 2.5.0-cdh5.7.1 RELEASE (build 27a4325c18c2a01c7a8097681a0eccf6d4335ea1)
  • Memory Accrual: 23.0 GiB hours
  • Memory Spilled: 0 B
  • Network Address: 171.187.160.71:60372
  • Node with Peak Memory Usage: uswxaprefdd41.uslttrr.baml.com:22000
  • Out of Memory: false
  • Per Node Peak Memory Usage: 2.7 GiB
  • Planning Wait Time: 85ms
  • Planning Wait Time Percentage: 0
  • Pool: root.default
  • Query Status: OK
  • Rows Inserted: 2,104,465,970
  • Session ID: 843980de5063bcb:59aafff0cb2924a8
  • Session Type: HIVESERVER2
  • Statistics Missing: true
  • Threads: CPU Time: 5.0d
  • Threads: CPU Time Percentage: 100
  • Threads: Network Receive Wait Time: 0ms
  • Threads: Network Receive Wait Time Percentage: 0
  • Threads: Network Send Wait Time: 0ms
  • Threads: Network Send Wait Time Percentage: 0
  • Threads: Storage Wait Time: 7.5m
  • Threads: Storage Wait Time Percentage: 0
  • Threads: Total Time: 5.0d
  • Work CPU Time: 5.0d
 
Query Timeline
  1. Start execution: 22.14us (22.14us)
  2. Planning finished: 85ms (85ms)
  3. Submit for admission: 86ms (1ms)
  4. Completed admission: 86ms (62.61us)
  5. Ready to start 6 remote fragments: 86ms (122.75us)
  6. All 6 remote fragments started: 91ms (5ms)
  7. DML data written: 1.8h (1.8h)
  8. DML Metastore update finished: 1.8h (828ms)
  9. Request finished: 1.8h (317.25us)
  10. Unregister query: 3.2h (1.3h)
Planner Timeline
  1. Analysis finished: 12ms (12ms)
  2. Equivalence classes computed: 15ms (2ms)
  3. Single node plan created: 30ms (15ms)
  4. Runtime filters computed: 30ms (21.80us)
  5. Distributed plan created: 40ms (9ms)
  6. Lineage info computed: 54ms (14ms)
  7. Planning finished: 69ms (14ms)


Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.