Support Questions

athtsang · ‎06-23-2015

3 node cluster, CDH 5.2.0, 5.4.2

I have an interesting finding that a query with an ORDER BY clause runs much faster than without ORDER BY clause (4m17s vs 37m4s). The query has no selection criteria, but used a complicated SQL with concat(), many regexp and extract() to format a timestamp to string of specific format, sort of testing of an import job by rebuilding the original data.

Create table statement:

create table impala_timestp_txt (
batch_id string,
setup_time_str string,
setup_time_ts timestamp)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
stored as textfile;

Some sample data&colon;

NCWEUI320150301.05;00:06:57.490 HKT Sun May 11 2008;2008-05-11 00:06:57.490000000
NCWEUI320150301.05;00:34:15.141 HKT Sun May 11 2008;2008-05-11 00:34:15.141000000
NCWEUI320150301.05;01:05:46.346 HKT Sun May 11 2008;2008-05-11 01:05:46.346000000
NCWEUI320150301.05;01:06:49.306 HKT Sun May 11 2008;2008-05-11 01:06:49.306000000
NCWEUI320150301.05;01:07:50.340 HKT Sun May 11 2008;2008-05-11 01:07:50.340000000
NCWEUI320150301.05;01:08:46.359 HKT Sun May 11 2008;2008-05-11 01:08:46.359000000
NCWEUI320150301.05;01:09:50.341 HKT Sun May 11 2008;2008-05-11 01:09:50.341000000
NCWEUI320150301.05;01:25:45.978 HKT Sun May 11 2008;2008-05-11 01:25:45.978000000
NCWEUI320150301.05;01:26:49.958 HKT Sun May 11 2008;2008-05-11 01:26:49.958000000
NCWEUI320150301.05;01:27:47.975 HKT Sun May 11 2008;2008-05-11 01:27:47.975000000

Number of Rows: 3.67M

Partitioned: No (similar result if partitioned)
Number of rows per BATCH_ID fluctuates between 100 to 100000 rows. Mostly around 5000.

The following SQL was run:

select batch_id, concat(setup_time_prefix,
       lpad(cast(hour (setup_time) as string), 2, '0'), ':',
       lpad(cast(minute (setup_time) as string), 2, '0'), ':',
       lpad(cast(second (setup_time) as string), 2, '0'), '.',
       lpad(cast(extract(millisecond from setup_time) as string), 3, '0'), ' ',
       setup_time_tz, ' ',
       case dayofweek(v.setup_time)
       when 1 then 'Sun ' when 2 then 'Mon ' when 3 then 'Tue ' when 4 then 'Wed '
       when 5 then 'Thu ' when 6 then 'Fri ' when 7 then 'Sat '
       end,
       case month(v.setup_time)
       when 1 then 'Jan ' when 2 then 'Feb ' when 3 then 'Mar ' when 4 then 'Apr '
       when 5 then 'May ' when 6 then 'Jun ' when 7 then 'Jul ' when 8 then 'Aug '
       when 9 then 'Sep ' when 10 then 'Oct ' when 11 then 'Nov ' when 12 then 'Dec '
       end,
       cast(day(v.setup_time) as string), ' ',
       cast(year(v.setup_time) as string)
)
from ( select batch_id,
case when regexp_extract(setup_time_str, '(.*) (.*) (.* .* .* .*)', 2)='UTC' then
hours_add(setup_time_ts, -8) else setup_time_ts end setup_time,
regexp_extract(v.setup_time_str, '^([\*\.])', 1) setup_time_prefix,
regexp_extract(v.setup_time_str, '(.*) (.*) (.* .* .* .*)', 2) setup_time_tz
from impala_timestp_txt v
) v
order by batch_id /* Comment out for not sorted version */;

(Please ignore problem with data structure or time zone usage. However, suggestion on improving the formatting is appreciated)

I generated the execution summary and profile. It seems without the ORDER BY clause, all Impala Daemon sends raw data to the coordinator, so that it needs to do all the formatting. Howver, with ORDER BY clause, besides sorting the data, the daemons also do the formatting, so that the complicated work was distributed among all daemons. (But this cannot explain the over 9x difference in performance on a 3 node cluster)

I can provide the complete or more sample data for reproducing the case, but I think it is easy to generate similar data and reproduce the case.

The result was basically the same with data files in parquet.

Sorted Query Execution Plan and Execution Summary (note that according to summary, the SQL took a few seconds only):

Sorted Query Execution Plan and Execution Summary (note that according to summary, the SQL took a few seconds only):
F01:PLAN FRAGMENT [UNPARTITIONED]
  02:MERGING-EXCHANGE [UNPARTITIONED]
     order by: batch_id ASC
     hosts=3 per-host-mem=unavailable
     tuple-ids=2 row-size=63B cardinality=unavailable

F00:PLAN FRAGMENT [RANDOM]
  DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=02, UNPARTITIONED]
  01:SORT
  |  order by: batch_id ASC
  |  hosts=3 per-host-mem=0B
  |  tuple-ids=2 row-size=63B cardinality=unavailable
  |
  00:SCAN HDFS [anthony.impala_timestp_txt v, RANDOM]
     partitions=1/1 files=1 size=282.02MB
     table stats: unavailable
     column stats: unavailable
     hosts=3 per-host-mem=176.00MB
     tuple-ids=0 row-size=46B cardinality=unavailable


+---------------------+--------+----------+----------+-------+------------+-----------+---------------+------------------------------+
| Operator            | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                       |
+---------------------+--------+----------+----------+-------+------------+-----------+---------------+------------------------------+
| 02:MERGING-EXCHANGE | 1      | 753.03ms | 753.03ms | 3.67M | -1         | 0 B       | -1 B          | UNPARTITIONED                |
| 01:SORT             | 3      | 88.50s   | 147.82s  | 3.67M | -1         | 176.32 MB | 0 B           |                              |
| 00:SCAN HDFS        | 3      | 1.18s    | 1.85s    | 3.67M | -1         | 34.74 MB  | 176.00 MB     | impala_timestp_txt v         |
+---------------------+--------+----------+----------+-------+------------+-----------+---------------+------------------------------+

Unsorted Query Execution Plan and Execution Summary:

F01:PLAN FRAGMENT [UNPARTITIONED]
  01:EXCHANGE [UNPARTITIONED]
     hosts=3 per-host-mem=unavailable
     tuple-ids=0 row-size=46B cardinality=unavailable

F00:PLAN FRAGMENT [RANDOM]
  DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED]
  00:SCAN HDFS [anthony.impala_timestp_txt v, RANDOM]
     partitions=1/1 files=1 size=282.02MB
     table stats: unavailable
     column stats: unavailable
     hosts=3 per-host-mem=176.00MB
     tuple-ids=0 row-size=46B cardinality=unavailable
+--------------+--------+----------+----------+-------+------------+----------+---------------+------------------------------+
| Operator     | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail                       |
+--------------+--------+----------+----------+-------+------------+----------+---------------+------------------------------+
| 01:EXCHANGE  | 1      | 7.41s    | 7.41s    | 3.67M | -1         | 0 B      | -1 B          | UNPARTITIONED                |
| 00:SCAN HDFS | 3      | 566.29ms | 1.12s    | 3.67M | -1         | 34.68 MB | 176.00 MB     | impala_timestp_txt v         |
+--------------+--------+----------+----------+-------+------------+----------+---------------+------------------------------+

Profile values with significant differences:

Sorted query:

Query Runtime Profile:
Query (id=4446304fc3a13547:10fea8c6a4752287):
  Summary:
    Session ID: 9d4999d13980722c:981b5c7d788c3db2
    Session Type: BEESWAX
    Start Time: 2015-06-15 10:14:13.038578000
    End Time: 2015-06-15 10:18:30.939273000
...
    Query Timeline: 4m17s
       - Start execution: 52.752us (52.752us)
       - Planning finished: 200.34ms (199.981ms)
       - Ready to start remote fragments: 201.467ms (1.432ms)
       - Remote fragments started: 648.550ms (447.83ms)
       - Rows available: 2m31s (2m30s)
       - First row fetched: 2m31s (478.400ms)
       - Unregister query: 4m17s (1m45s)
  ImpalaServer:
     - ClientFetchWaitTimer: 1m9s
     - RowMaterializationTimer: 36s073ms
  Execution Profile 4446304fc3a13547:10fea8c6a4752287:(Total: 2m32s, non-child: 0ns, % non-child: 0.00%)
...
    Coordinator Fragment F01:(Total: 2m31s, non-child: 10.567ms, % non-child: 0.01%)
...
       - TotalCpuTime: 4m17s
       - TotalNetworkReceiveTime: 0ns
...
    Averaged Fragment F00:(Total: 3m52s, non-child: 0ns, % non-child: 0.00%)
...
       - TotalCpuTime: 2m55s
       - TotalNetworkReceiveTime: 0ns
       - TotalNetworkSendTime: 2m22s

Unsorted query:

Query Runtime Profile:
Query (id=794de9ed5ccc55ac:ab1391c885dbf2b9):
  Summary:
    Session ID: e14d21c04cf7dd00:bd027a1ab2b142b9
    Session Type: BEESWAX
    Start Time: 2015-06-15 11:21:09.472389000
    End Time: 2015-06-15 11:58:13.872735000
...
    Query Timeline: 37m4s
       - Start execution: 81.543us (81.543us)
       - Planning finished: 332.875ms (332.793ms)
       - Ready to start remote fragments: 339.182ms (6.307ms)
       - Remote fragments started: 1s386ms (1s047ms)
       - Rows available: 1s388ms (2.486ms)
       - First row fetched: 1s489ms (100.985ms)
       - Unregister query: 37m4s (37m2s)
  ImpalaServer:
     - ClientFetchWaitTimer: 1m26s
     - RowMaterializationTimer: 35m25s
  Execution Profile 794de9ed5ccc55ac:ab1391c885dbf2b9:(Total: 8s577ms, non-child: 0ns, % non-child: 0.00%)
...
    Coordinator Fragment F01:(Total: 7s439ms, non-child: 25.752ms, % non-child: 0.35%)
...
       - TotalCpuTime: 36m56s
       - TotalNetworkReceiveTime: 7s277ms
...
    Averaged Fragment F00:(Total: 22m46s, non-child: 0ns, % non-child: 0.00%)
...
       - TotalCpuTime: 21m59s
       - TotalNetworkReceiveTime: 0ns
       - TotalNetworkSendTime: 22m45s

alex.behm · ‎06-24-2015

Sorry for the wait. Here's what I think is happening.

Impala deals with inline views by substituting the select-list expressions from the inline view in the parent query block. What that means in your case, is that many of the expensive expressions inside your inline view are executed multiple times in the slow non-ORDER-BY version of your query.

For example, every reference to "setup_time" in the outer select list is replaced by the corresponding expression from the inline view, i.e.

setup_time --> case when regexp_extract(...) then hours_add(...) else setup_time_ts end setup_time

As a result, not only are those expensive expressions only executed at the coordinator, but the expensive expressions are evaluated multiple times because they are referenced multiple times in the outer select list.

In the ORDER BY version of the query, this redundant expression evaluation is avoided because the ORDER BY materialized its input, so while the same inline view expression substitutions takes place, the outer references are substituted with materialized column references (i.e., the expensive expression is only evaluated once), i.e.

setup_time --> materialized column produced by the ORDER BY

Hope this makes sense!

Alex

View solution in original post

alex.behm · ‎06-23-2015

Thank you for posting such a detalied description. Your observation regarding expression evaluation is correct: Impala evaluates the expressions lazily.

To summarize:

- In the slow version without ORDER BY, the SCAN sends the raw data to the coordinator which then evaluates all expressions including those from your inline view.

- In the fast version with ORDER BY, the expressions from the inline view are evaulated and materialized at the SORT NODE, i.e., in paralell on all nodes.

Now, you had already observed this and you asked how this can explain the 10x difference whereas you'd only expect a 3x difference based on the 3x increased paralellizm.

The answer is that in the slow version the entire query execution is CPU bound by the single coordinator node.

Impala's execution engine is streaming, so the coordinator will apply backpressure to the stream sender and in turn the SCANs, if it cannot process the rows quickly enough (which in this case it obviously cannot). So it means while the coordinator is still processing a batch rows, the SCANs will not be able to make progress (it's not quite as simple as this, but it explains the mechanichs).

alex.behm · ‎06-23-2015

Actually, since there's buffering on both sender/receiver sides I don't see how there could be a 10x difference between the queries. I believe ther is a much simpler explanation. Stay tuned for another response.

alex.behm · ‎06-24-2015

Sorry for the wait. Here's what I think is happening.

Impala deals with inline views by substituting the select-list expressions from the inline view in the parent query block. What that means in your case, is that many of the expensive expressions inside your inline view are executed multiple times in the slow non-ORDER-BY version of your query.

For example, every reference to "setup_time" in the outer select list is replaced by the corresponding expression from the inline view, i.e.

setup_time --> case when regexp_extract(...) then hours_add(...) else setup_time_ts end setup_time

As a result, not only are those expensive expressions only executed at the coordinator, but the expensive expressions are evaluated multiple times because they are referenced multiple times in the outer select list.

In the ORDER BY version of the query, this redundant expression evaluation is avoided because the ORDER BY materialized its input, so while the same inline view expression substitutions takes place, the outer references are substituted with materialized column references (i.e., the expensive expression is only evaluated once), i.e.

setup_time --> materialized column produced by the ORDER BY

Hope this makes sense!

Alex

athtsang · ‎06-25-2015

It makes sense and provides much insight on Impala operation.

Actually, I always thought Impala would materialize inline views / CTE before moving to "next step". Thanks for clarification!

I suggest Cloudera to consider discussing this in performance guide (if not yet), or optimize the materialization of intermediate data.

alex.behm · ‎06-25-2015

Makes sense.

I appreciate your thorough question, and I completely agree that we should point out this expression-substitution behavior in the performance guide.

It's not the first time it has come, and I'd imagine it will not be the last 🙂

Btw, if you really really want to get the materialization behavior with an inline view without an ORDER BY, then you can apply the following terrible hack.

Original query:

select a, b, c from (select f(x) as a, f(y) as b, f(z) as c from mytable) v

Modified query to force materialization of inline view:

select a, b, c from

(select f(x) as a, f(y) as b, f(z) as c from mytable

union all

select NULL, NULL, NULL from mytable where false) v

The "union all" will force materialization, but the second union operand will be dropped due to the "false" predicate.

Obviously, that behavior is implementation defined and subject to change any time, so it would be wise not to rely on it.

Cloudera Community

Support Questions

Performance Reduced after Removing ORDER BY clause