Created on 06-23-2015 12:03 AM - edited 09-16-2022 02:32 AM
3 node cluster, CDH 5.2.0, 5.4.2
I have an interesting finding that a query with an ORDER BY clause runs much faster than without ORDER BY clause (4m17s vs 37m4s). The query has no selection criteria, but used a complicated SQL with concat(), many regexp and extract() to format a timestamp to string of specific format, sort of testing of an import job by rebuilding the original data.
Create table statement:
create table impala_timestp_txt ( batch_id string, setup_time_str string, setup_time_ts timestamp) ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' stored as textfile;
Some sample data:
NCWEUI320150301.05;00:06:57.490 HKT Sun May 11 2008;2008-05-11 00:06:57.490000000 NCWEUI320150301.05;00:34:15.141 HKT Sun May 11 2008;2008-05-11 00:34:15.141000000 NCWEUI320150301.05;01:05:46.346 HKT Sun May 11 2008;2008-05-11 01:05:46.346000000 NCWEUI320150301.05;01:06:49.306 HKT Sun May 11 2008;2008-05-11 01:06:49.306000000 NCWEUI320150301.05;01:07:50.340 HKT Sun May 11 2008;2008-05-11 01:07:50.340000000 NCWEUI320150301.05;01:08:46.359 HKT Sun May 11 2008;2008-05-11 01:08:46.359000000 NCWEUI320150301.05;01:09:50.341 HKT Sun May 11 2008;2008-05-11 01:09:50.341000000 NCWEUI320150301.05;01:25:45.978 HKT Sun May 11 2008;2008-05-11 01:25:45.978000000 NCWEUI320150301.05;01:26:49.958 HKT Sun May 11 2008;2008-05-11 01:26:49.958000000 NCWEUI320150301.05;01:27:47.975 HKT Sun May 11 2008;2008-05-11 01:27:47.975000000
Number of Rows: 3.67M
Partitioned: No (similar result if partitioned)
Number of rows per BATCH_ID fluctuates between 100 to 100000 rows. Mostly around 5000.
The following SQL was run:
select batch_id, concat(setup_time_prefix, lpad(cast(hour (setup_time) as string), 2, '0'), ':', lpad(cast(minute (setup_time) as string), 2, '0'), ':', lpad(cast(second (setup_time) as string), 2, '0'), '.', lpad(cast(extract(millisecond from setup_time) as string), 3, '0'), ' ', setup_time_tz, ' ', case dayofweek(v.setup_time) when 1 then 'Sun ' when 2 then 'Mon ' when 3 then 'Tue ' when 4 then 'Wed ' when 5 then 'Thu ' when 6 then 'Fri ' when 7 then 'Sat ' end, case month(v.setup_time) when 1 then 'Jan ' when 2 then 'Feb ' when 3 then 'Mar ' when 4 then 'Apr ' when 5 then 'May ' when 6 then 'Jun ' when 7 then 'Jul ' when 8 then 'Aug ' when 9 then 'Sep ' when 10 then 'Oct ' when 11 then 'Nov ' when 12 then 'Dec ' end, cast(day(v.setup_time) as string), ' ', cast(year(v.setup_time) as string) ) from ( select batch_id, case when regexp_extract(setup_time_str, '(.*) (.*) (.* .* .* .*)', 2)='UTC' then hours_add(setup_time_ts, -8) else setup_time_ts end setup_time, regexp_extract(v.setup_time_str, '^([\*\.])', 1) setup_time_prefix, regexp_extract(v.setup_time_str, '(.*) (.*) (.* .* .* .*)', 2) setup_time_tz from impala_timestp_txt v ) v order by batch_id /* Comment out for not sorted version */;
(Please ignore problem with data structure or time zone usage. However, suggestion on improving the formatting is appreciated)
I generated the execution summary and profile. It seems without the ORDER BY clause, all Impala Daemon sends raw data to the coordinator, so that it needs to do all the formatting. Howver, with ORDER BY clause, besides sorting the data, the daemons also do the formatting, so that the complicated work was distributed among all daemons. (But this cannot explain the over 9x difference in performance on a 3 node cluster)
I can provide the complete or more sample data for reproducing the case, but I think it is easy to generate similar data and reproduce the case.
The result was basically the same with data files in parquet.
Sorted Query Execution Plan and Execution Summary (note that according to summary, the SQL took a few seconds only):
Sorted Query Execution Plan and Execution Summary (note that according to summary, the SQL took a few seconds only): F01:PLAN FRAGMENT [UNPARTITIONED] 02:MERGING-EXCHANGE [UNPARTITIONED] order by: batch_id ASC hosts=3 per-host-mem=unavailable tuple-ids=2 row-size=63B cardinality=unavailable F00:PLAN FRAGMENT [RANDOM] DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=02, UNPARTITIONED] 01:SORT | order by: batch_id ASC | hosts=3 per-host-mem=0B | tuple-ids=2 row-size=63B cardinality=unavailable | 00:SCAN HDFS [anthony.impala_timestp_txt v, RANDOM] partitions=1/1 files=1 size=282.02MB table stats: unavailable column stats: unavailable hosts=3 per-host-mem=176.00MB tuple-ids=0 row-size=46B cardinality=unavailable +---------------------+--------+----------+----------+-------+------------+-----------+---------------+------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +---------------------+--------+----------+----------+-------+------------+-----------+---------------+------------------------------+ | 02:MERGING-EXCHANGE | 1 | 753.03ms | 753.03ms | 3.67M | -1 | 0 B | -1 B | UNPARTITIONED | | 01:SORT | 3 | 88.50s | 147.82s | 3.67M | -1 | 176.32 MB | 0 B | | | 00:SCAN HDFS | 3 | 1.18s | 1.85s | 3.67M | -1 | 34.74 MB | 176.00 MB | impala_timestp_txt v | +---------------------+--------+----------+----------+-------+------------+-----------+---------------+------------------------------+
Unsorted Query Execution Plan and Execution Summary:
F01:PLAN FRAGMENT [UNPARTITIONED] 01:EXCHANGE [UNPARTITIONED] hosts=3 per-host-mem=unavailable tuple-ids=0 row-size=46B cardinality=unavailable F00:PLAN FRAGMENT [RANDOM] DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED] 00:SCAN HDFS [anthony.impala_timestp_txt v, RANDOM] partitions=1/1 files=1 size=282.02MB table stats: unavailable column stats: unavailable hosts=3 per-host-mem=176.00MB tuple-ids=0 row-size=46B cardinality=unavailable +--------------+--------+----------+----------+-------+------------+----------+---------------+------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +--------------+--------+----------+----------+-------+------------+----------+---------------+------------------------------+ | 01:EXCHANGE | 1 | 7.41s | 7.41s | 3.67M | -1 | 0 B | -1 B | UNPARTITIONED | | 00:SCAN HDFS | 3 | 566.29ms | 1.12s | 3.67M | -1 | 34.68 MB | 176.00 MB | impala_timestp_txt v | +--------------+--------+----------+----------+-------+------------+----------+---------------+------------------------------+
Profile values with significant differences:
Sorted query:
Query Runtime Profile: Query (id=4446304fc3a13547:10fea8c6a4752287): Summary: Session ID: 9d4999d13980722c:981b5c7d788c3db2 Session Type: BEESWAX Start Time: 2015-06-15 10:14:13.038578000 End Time: 2015-06-15 10:18:30.939273000 ... Query Timeline: 4m17s - Start execution: 52.752us (52.752us) - Planning finished: 200.34ms (199.981ms) - Ready to start remote fragments: 201.467ms (1.432ms) - Remote fragments started: 648.550ms (447.83ms) - Rows available: 2m31s (2m30s) - First row fetched: 2m31s (478.400ms) - Unregister query: 4m17s (1m45s) ImpalaServer: - ClientFetchWaitTimer: 1m9s - RowMaterializationTimer: 36s073ms Execution Profile 4446304fc3a13547:10fea8c6a4752287:(Total: 2m32s, non-child: 0ns, % non-child: 0.00%) ... Coordinator Fragment F01:(Total: 2m31s, non-child: 10.567ms, % non-child: 0.01%) ... - TotalCpuTime: 4m17s - TotalNetworkReceiveTime: 0ns ... Averaged Fragment F00:(Total: 3m52s, non-child: 0ns, % non-child: 0.00%) ... - TotalCpuTime: 2m55s - TotalNetworkReceiveTime: 0ns - TotalNetworkSendTime: 2m22s
Unsorted query:
Query Runtime Profile: Query (id=794de9ed5ccc55ac:ab1391c885dbf2b9): Summary: Session ID: e14d21c04cf7dd00:bd027a1ab2b142b9 Session Type: BEESWAX Start Time: 2015-06-15 11:21:09.472389000 End Time: 2015-06-15 11:58:13.872735000 ... Query Timeline: 37m4s - Start execution: 81.543us (81.543us) - Planning finished: 332.875ms (332.793ms) - Ready to start remote fragments: 339.182ms (6.307ms) - Remote fragments started: 1s386ms (1s047ms) - Rows available: 1s388ms (2.486ms) - First row fetched: 1s489ms (100.985ms) - Unregister query: 37m4s (37m2s) ImpalaServer: - ClientFetchWaitTimer: 1m26s - RowMaterializationTimer: 35m25s Execution Profile 794de9ed5ccc55ac:ab1391c885dbf2b9:(Total: 8s577ms, non-child: 0ns, % non-child: 0.00%) ... Coordinator Fragment F01:(Total: 7s439ms, non-child: 25.752ms, % non-child: 0.35%) ... - TotalCpuTime: 36m56s - TotalNetworkReceiveTime: 7s277ms ... Averaged Fragment F00:(Total: 22m46s, non-child: 0ns, % non-child: 0.00%) ... - TotalCpuTime: 21m59s - TotalNetworkReceiveTime: 0ns - TotalNetworkSendTime: 22m45s
Created 06-24-2015 02:24 PM
Sorry for the wait. Here's what I think is happening.
Impala deals with inline views by substituting the select-list expressions from the inline view in the parent query block. What that means in your case, is that many of the expensive expressions inside your inline view are executed multiple times in the slow non-ORDER-BY version of your query.
For example, every reference to "setup_time" in the outer select list is replaced by the corresponding expression from the inline view, i.e.
setup_time --> case when regexp_extract(...) then hours_add(...) else setup_time_ts end setup_time
As a result, not only are those expensive expressions only executed at the coordinator, but the expensive expressions are evaluated multiple times because they are referenced multiple times in the outer select list.
In the ORDER BY version of the query, this redundant expression evaluation is avoided because the ORDER BY materialized its input, so while the same inline view expression substitutions takes place, the outer references are substituted with materialized column references (i.e., the expensive expression is only evaluated once), i.e.
setup_time --> materialized column produced by the ORDER BY
Hope this makes sense!
Alex
Created 06-23-2015 09:17 AM
Thank you for posting such a detalied description. Your observation regarding expression evaluation is correct: Impala evaluates the expressions lazily.
To summarize:
- In the slow version without ORDER BY, the SCAN sends the raw data to the coordinator which then evaluates all expressions including those from your inline view.
- In the fast version with ORDER BY, the expressions from the inline view are evaulated and materialized at the SORT NODE, i.e., in paralell on all nodes.
Now, you had already observed this and you asked how this can explain the 10x difference whereas you'd only expect a 3x difference based on the 3x increased paralellizm.
The answer is that in the slow version the entire query execution is CPU bound by the single coordinator node.
Impala's execution engine is streaming, so the coordinator will apply backpressure to the stream sender and in turn the SCANs, if it cannot process the rows quickly enough (which in this case it obviously cannot). So it means while the coordinator is still processing a batch rows, the SCANs will not be able to make progress (it's not quite as simple as this, but it explains the mechanichs).
Created 06-23-2015 09:29 AM
Actually, since there's buffering on both sender/receiver sides I don't see how there could be a 10x difference between the queries. I believe ther is a much simpler explanation. Stay tuned for another response.
Created 06-24-2015 02:24 PM
Sorry for the wait. Here's what I think is happening.
Impala deals with inline views by substituting the select-list expressions from the inline view in the parent query block. What that means in your case, is that many of the expensive expressions inside your inline view are executed multiple times in the slow non-ORDER-BY version of your query.
For example, every reference to "setup_time" in the outer select list is replaced by the corresponding expression from the inline view, i.e.
setup_time --> case when regexp_extract(...) then hours_add(...) else setup_time_ts end setup_time
As a result, not only are those expensive expressions only executed at the coordinator, but the expensive expressions are evaluated multiple times because they are referenced multiple times in the outer select list.
In the ORDER BY version of the query, this redundant expression evaluation is avoided because the ORDER BY materialized its input, so while the same inline view expression substitutions takes place, the outer references are substituted with materialized column references (i.e., the expensive expression is only evaluated once), i.e.
setup_time --> materialized column produced by the ORDER BY
Hope this makes sense!
Alex
Created 06-25-2015 12:41 AM
Created 06-25-2015 03:05 PM
Makes sense.
I appreciate your thorough question, and I completely agree that we should point out this expression-substitution behavior in the performance guide.
It's not the first time it has come, and I'd imagine it will not be the last 🙂
Btw, if you really really want to get the materialization behavior with an inline view without an ORDER BY, then you can apply the following terrible hack.
Original query:
select a, b, c from (select f(x) as a, f(y) as b, f(z) as c from mytable) v
Modified query to force materialization of inline view:
select a, b, c from
(select f(x) as a, f(y) as b, f(z) as c from mytable
union all
select NULL, NULL, NULL from mytable where false) v
The "union all" will force materialization, but the second union operand will be dropped due to the "false" predicate.
Obviously, that behavior is implementation defined and subject to change any time, so it would be wise not to rely on it.