Reply
New Contributor
Posts: 3
Registered: ‎02-27-2018

Impala insert select very slow

I am denormalizing a big table by joining several tables together. The total number of rows I am populating is 2b+ rows and the whole process took 3 hours. Although my cluster is relatively small (6 node with 48 virtual cpu each), I still expect a much better performance. The CDH version is 2.6.0-cdh5.7.1

 Query info is as follows:

 

Query Info

  • Query ID: ff40d8437f0131ee:dc6db8ef01f9e1b5
  • User: zinc
  • Database: zinc_bda_poc
  • Coordinator: 
  • Query Type: DML
  • Query State: FINISHED
  • Start Time: Feb 26, 2018 9:59:59 PM
  • End Time: Feb 27, 2018 4:14:59 PM
  • Duration: 18h, 14m
  • Rows Produced: 2,104,465,970
  • Admission Result: Admitted immediately
  • Admission Wait Time: 0ms
  • Aggregate Peak Memory Usage: 124.4 GiB
  • Bytes Streamed: 21.6 GiB
  • Client Fetch Wait Time: 15.3h
  • Client Fetch Wait Time Percentage: 84
  • Connected User: zinc
  • Estimated per Node Peak Memory: 4.7 GiB
  • File Formats: PARQUET/SNAPPY
  • HDFS Average Scan Range: 107.7 MiB
  • HDFS Bytes Read: 21.6 GiB
  • HDFS Bytes Read From Cache: 0 B
  • HDFS Bytes Read From Cache Percentage: 0
  • HDFS Bytes Written: 39.6 GiB
  • HDFS Local Bytes Read: 21.6 GiB
  • HDFS Local Bytes Read Percentage: 100
  • HDFS Remote Bytes Read: 0 B
  • HDFS Remote Bytes Read Percentage: 0
  • HDFS Scanner Average Read Throughput: 244.4 MiB/s
  • HDFS Short Circuit Bytes Read: 21.6 GiB
  • HDFS Short Circuit Bytes Read Percentage: 100
  • Impala Version: impalad version 2.5.0-cdh5.7.1 RELEASE (build 27a4325c18c2a01c7a8097681a0eccf6d4335ea1)
  • Memory Accrual: 232.7 GiB hours
  • Memory Spilled: 0 B
  • Network Address: 171.187.160.71:40044
  • Node with Peak Memory Usage: 
  • Out of Memory: false
  • Per Node Peak Memory Usage: 23.2 GiB
  • Planning Wait Time: 117ms
  • Planning Wait Time Percentage: 0
  • Pool: root.default
  • Query Status: OK
  • Rows Inserted: 2,104,465,970
  • Session ID: b04243d6c5e7f097:637037be300bed8d
  • Session Type: HIVESERVER2
  • Statistics Missing: false
  • Threads: CPU Time: 9.0d
  • Threads: CPU Time Percentage: 100
  • Threads: Network Receive Wait Time: 5.5m
  • Threads: Network Receive Wait Time Percentage: 0
  • Threads: Network Send Wait Time: 1.1m
  • Threads: Network Send Wait Time Percentage: 0
  • Threads: Storage Wait Time: 1.9m
  • Threads: Storage Wait Time Percentage: 0
  • Threads: Total Time: 9.0d
  • Work CPU Time: 9.0d
 

Query Timeline

  1. Start execution: 23.39us (23.39us)
  2. Planning finished: 117ms (117ms)
  3. Submit for admission: 119ms (1ms)
  4. Completed admission: 119ms (65.20us)
  5. Ready to start 29 remote fragments: 119ms (245.52us)
  6. All 29 remote fragments started: 125ms (6ms)
  7. DML data written: 3.0h (3.0h)
  8. DML Metastore update finished: 3.0h (661ms)
  9. Request finished: 3.0h (555.57us)
  10. Unregister query: 18.2h (15.3h)

Planner Timeline

  1. Analysis finished: 24ms (24ms)
  2. Equivalence classes computed: 49ms (25ms)
  3. Single node plan created: 70ms (21ms)
  4. Runtime filters computed: 71ms (647.00us)
  5. Distributed plan created: 81ms (9ms)
  6. Lineage info computed: 93ms (12ms)
  7. Planning finished: 106ms (13ms)
Champion
Posts: 731
Registered: ‎05-16-2016

Re: Impala insert select very slow

are you stilling looking solution ? could you let me know 

New Contributor
Posts: 3
Registered: ‎02-27-2018

Re: Impala insert select very slow

Yes I do. Actually I found out that even I insert into an impala table from another impala table with insert...select..., it is very slow too. Also I compared the data volume of the underlying parquet file inserted by impala and spark, sometimes, the spark written parquet file is much smaller.

Champion
Posts: 731
Registered: ‎05-16-2016

Re: Impala insert select very slow

could you share the query explain plan ? 

did you compute stats on the table ? 

what file format ? 

is there any compression ? 

 

Highlighted
New Contributor
Posts: 3
Registered: ‎02-27-2018

Re: Impala insert select very slow

I tried with a simpler insert statement to make sure that the bottleneck is the parquet writting. Here is the explain plan and query info.

 

Yes. I do have default snappy compression. I tried gzip compression by setting compression_codec=gzip. It improves but only marginally.

 

Query Details

Query Plan
----------------
Estimated Per-Host Requirements: Memory=5.69GB VCores=1
WARNING: The following tables are missing relevant table and/or column statistics.
zinc_bda_poc.zc_inst_snappy

WRITE TO HDFS [zinc_bda_poc.zc_inst, OVERWRITE=false]
|  partitions=1
|  hosts=6 per-host-mem=1.00GB
|
00:SCAN HDFS [zinc_bda_poc.zc_inst_snappy, RANDOM]
   partitions=1/1 files=562 size=50.37GB
   table stats: unavailable
   column stats: unavailable
   hosts=6 per-host-mem=4.69GB
   tuple-ids=0 row-size=3.40KB cardinality=unavailable

Query Info
  • Query ID: 6b4a35a73f643524:11ae80d0b4e0fca5
  • User: zinc
  • Database: zinc_bda_poc
  • Coordinator: 
  • Query Type: DML
  • Query State: FINISHED
  • Start Time: Mar 5, 2018 3:31:07 PM
  • End Time: Mar 5, 2018 6:40:39 PM
  • Duration: 3h, 9m
  • Rows Produced: 2,104,465,970
  • Admission Result: Admitted immediately
  • Admission Wait Time: 0ms
  • Aggregate Peak Memory Usage: 15.4 GiB
  • Client Fetch Wait Time: 1.3h
  • Client Fetch Wait Time Percentage: 42
  • Connected User: zinc
  • Estimated per Node Peak Memory: 5.7 GiB
  • File Formats: PARQUET/SNAPPY
  • HDFS Average Scan Range: 91.8 MiB
  • HDFS Bytes Read: 50.4 GiB
  • HDFS Bytes Read From Cache: 0 B
  • HDFS Bytes Read From Cache Percentage: 0
  • HDFS Bytes Written: 82.3 GiB
  • HDFS Local Bytes Read: 50.4 GiB
  • HDFS Local Bytes Read Percentage: 100
  • HDFS Remote Bytes Read: 0 B
  • HDFS Remote Bytes Read Percentage: 0
  • HDFS Scanner Average Read Throughput: 221.1 MiB/s
  • HDFS Short Circuit Bytes Read: 50.4 GiB
  • HDFS Short Circuit Bytes Read Percentage: 100
  • Impala Version: impalad version 2.5.0-cdh5.7.1 RELEASE (build 27a4325c18c2a01c7a8097681a0eccf6d4335ea1)
  • Memory Accrual: 23.0 GiB hours
  • Memory Spilled: 0 B
  • Network Address: 171.187.160.71:60372
  • Node with Peak Memory Usage: uswxaprefdd41.uslttrr.baml.com:22000
  • Out of Memory: false
  • Per Node Peak Memory Usage: 2.7 GiB
  • Planning Wait Time: 85ms
  • Planning Wait Time Percentage: 0
  • Pool: root.default
  • Query Status: OK
  • Rows Inserted: 2,104,465,970
  • Session ID: 843980de5063bcb:59aafff0cb2924a8
  • Session Type: HIVESERVER2
  • Statistics Missing: true
  • Threads: CPU Time: 5.0d
  • Threads: CPU Time Percentage: 100
  • Threads: Network Receive Wait Time: 0ms
  • Threads: Network Receive Wait Time Percentage: 0
  • Threads: Network Send Wait Time: 0ms
  • Threads: Network Send Wait Time Percentage: 0
  • Threads: Storage Wait Time: 7.5m
  • Threads: Storage Wait Time Percentage: 0
  • Threads: Total Time: 5.0d
  • Work CPU Time: 5.0d
 
Query Timeline
  1. Start execution: 22.14us (22.14us)
  2. Planning finished: 85ms (85ms)
  3. Submit for admission: 86ms (1ms)
  4. Completed admission: 86ms (62.61us)
  5. Ready to start 6 remote fragments: 86ms (122.75us)
  6. All 6 remote fragments started: 91ms (5ms)
  7. DML data written: 1.8h (1.8h)
  8. DML Metastore update finished: 1.8h (828ms)
  9. Request finished: 1.8h (317.25us)
  10. Unregister query: 3.2h (1.3h)
Planner Timeline
  1. Analysis finished: 12ms (12ms)
  2. Equivalence classes computed: 15ms (2ms)
  3. Single node plan created: 30ms (15ms)
  4. Runtime filters computed: 30ms (21.80us)
  5. Distributed plan created: 40ms (9ms)
  6. Lineage info computed: 54ms (14ms)
  7. Planning finished: 69ms (14ms)


Announcements