Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Impala predicate pushdown issues with unix_timestamp()

Highlighted

Impala predicate pushdown issues with unix_timestamp()

Explorer

We are running into an issue with Impala not doing predicate pushdown efficiently with the unix_timestamp() function.

 

We have a table with about 300M rows and a timestamp column that is in epoch time * 1000. Because of the way our tool generates queries, we get a query something like this:

 

SELECT COUNT(*)
FROM table1 AS raw_tracking
WHERE
((cast(raw_tracking.`timestamp`/1000 as bigint)) >= (unix_timestamp('2015-05-01'))

AND

(cast(raw_tracking.`timestamp`/1000 as bigint)) < (unix_timestamp('2015-06-01')))

 

This takes about 50s.

 

If I plug in the actual values for unix_timestamp() like this:

 

SELECT COUNT(*)
FROM table1 AS raw_tracking
WHERE
((cast(raw_tracking.`timestamp`/1000 as bigint)) >= 1430438400

AND

(cast(raw_tracking.`timestamp`/1000 as bigint)) < (1433116800))

 

then the query takes about 2.1s.

 

Here are times for two other permutations of the same query:

 

SELECT COUNT(*)
FROM table1 AS raw_tracking
WHERE
raw_tracking.`timestamp` >= 1430438400000 AND raw_tracking.`timestamp` < 1433116800000

 

2.0s

 

SELECT COUNT(*)
FROM table1 AS raw_tracking
WHERE
raw_tracking.`timestamp` >= unix_timestamp('2015-05-01')*1000 AND raw_tracking.`timestamp` < unix_timestamp('2015-06-01')*1000

 

50s

 

Is anyone else having issues with queries like these, and is there a workaround? Are there other functions to avoid?

 

4 REPLIES 4
Highlighted

Re: Impala predicate pushdown issues with unix_timestamp()

Master Collaborator

Hi Todd,

 

is raw_tracking.`timestamp` a partition column or a regular column?

Can you paste the EXPLAIN plans for the fast and slow versions?

Thanks!

 

Alex

Re: Impala predicate pushdown issues with unix_timestamp()

Explorer

Thanks for your response, Alex. The EXPLAIN is actually the same for both, but I'll include them here.

 

I was also able to get a copy of the output for PROFILE, but they are too large to include. I'll get them to you via Jeff B.

 

The raw_tracking.`timestamp` column is a BIGINT column in epoch*1000. It's not partitioned.

 

There are a few extra points that I wanted to add:

 

1. This appears to be a regression introduced in 2.2.0. I'm trying to figure out what version Impala was upgraded from.

 

2. We found a work around using CTEs like this:

 

WITH
bounds AS
  (SELECT
  unix_timestamp('2015-05-01')*1000 AS lower,
  unix_timestamp('2015-06-01')*1000 AS upper)
SELECT COUNT(*)
FROM raw_tracking_mirror AS raw_tracking
CROSS JOIN bounds
WHERE
raw_tracking.`timestamp` >= bounds.lower AND raw_tracking.`timestamp` < bounds.upper

 

 

Here are the queries and EXPLAIN outputs:

 

=== query 1 (slow) ===

SELECT COUNT(*)
FROM table1 AS raw_tracking
WHERE
raw_tracking.`timestamp` >= unix_timestamp('2015-05-01')*1000 AND raw_tracking.`timestamp` < unix_timestamp('2015-06-01')*1000

+-----------+
| count(*) |
+-----------+
| 326006739 |
+-----------+
Fetched 1 row(s) in 444.91s

 

=== query 2 (fast) ===

SELECT COUNT(*)
FROM table1 AS raw_tracking
WHERE
raw_tracking.`timestamp` >= 1430438400000 AND raw_tracking.`timestamp` < 1433116800000
+-----------+
| count(*) |
+-----------+
| 326006739 |
+-----------+
Fetched 1 row(s) in 9.92s

 

=== EXPLAIN for query 1 ===

Estimated Per-Host Requirements: Memory=98.00MB VCores=1

 

F01:PLAN FRAGMENT [UNPARTITIONED]

  03:AGGREGATE [FINALIZE]

  |  output: count:merge(*)

  |  hosts=6 per-host-mem=unavailable

  |  tuple-ids=1 row-size=8B cardinality=1

  |

  02:EXCHANGE [UNPARTITIONED]

     hosts=6 per-host-mem=unavailable

     tuple-ids=1 row-size=8B cardinality=1

 

F00:PLAN FRAGMENT [RANDOM]

  DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=02, UNPARTITIONED]

  01:AGGREGATE

  |  output: count(*)

  |  hosts=6 per-host-mem=10.00MB

  |  tuple-ids=1 row-size=8B cardinality=1

  |

  00:SCAN HDFS [default.raw_tracking_mirror raw_tracking, RANDOM]

     partitions=1/1 files=134 size=40.78GB

     predicates: raw_tracking.`timestamp` >= unix_timestamp('2015-05-01') * 1000, raw_tracking.`timestamp` < unix_timestamp('2015-06-01') * 1000

     table stats: 341732584 rows total

     column stats: all

     hosts=6 per-host-mem=88.00MB

     tuple-ids=0 row-size=8B cardinality=3417326

 

 

=== EXPLAIN for query 2 ===

Estimated Per-Host Requirements: Memory=98.00MB VCores=1

 

F01:PLAN FRAGMENT [UNPARTITIONED]

  03:AGGREGATE [FINALIZE]

  |  output: count:merge(*)

  |  hosts=6 per-host-mem=unavailable

  |  tuple-ids=1 row-size=8B cardinality=1

  |

  02:EXCHANGE [UNPARTITIONED]

     hosts=6 per-host-mem=unavailable

     tuple-ids=1 row-size=8B cardinality=1

 

F00:PLAN FRAGMENT [RANDOM]

  DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=02, UNPARTITIONED]

  01:AGGREGATE

  |  output: count(*)

  |  hosts=6 per-host-mem=10.00MB

  |  tuple-ids=1 row-size=8B cardinality=1

  |

  00:SCAN HDFS [default.raw_tracking_mirror raw_tracking, RANDOM]

     partitions=1/1 files=134 size=40.78GB

     predicates: raw_tracking.`timestamp` >= 1430438400000, raw_tracking.`timestamp` < 1433116800000

     table stats: 341732584 rows total

     column stats: all

     hosts=6 per-host-mem=88.00MB

     tuple-ids=0 row-size=8B cardinality=3417326

 

Highlighted

Re: Impala predicate pushdown issues with unix_timestamp()

Explorer

I verified that Impala was upgraded from Impala Version 2.1.0 / CDH 5.3.0 to Impala Version 2.2.0 / CDH 5.4.0 if that helps narrow down the regression.

Highlighted

Re: Impala predicate pushdown issues with unix_timestamp()

Master Collaborator

Thanks! That is indeed helpful. We'll need some time to dig into this.

Don't have an account?
Coming from Hortonworks? Activate your account here