Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Impala performance degradation after upgrading from CDH 5.4.5 to 5.5.1

Highlighted

Re: Impala performance degradation after upgrading from CDH 5.4.5 to 5.5.1

Expert Contributor
Tim,

Great.!
I will upgrade to cdh 5.5.1 on my staging and test if there is any major performance issue.

Thank you
Gatsby

Re: Impala performance degradation after upgrading from CDH 5.4.5 to 5.5.1

Master Collaborator

You can rewrite your query in a way that won't evaluate the time for every row by putting it in a subquery.

 
SELECT 
    user_id, B.house_id, style, category_id
FROM
    c2.buzz_p B,
    c2.houses_p H,
(select 
    myTimeStamp
from
    (select from_unixtime(unix_timestamp(now() - interval 365 days), 'yyyy-MM-dd HH:mm:ss.S') as myTimeStamp) a) TimeSource
WHERE
    B.house_id = H.house_id
        AND B.created > myTimeStamp

Re: Impala performance degradation after upgrading from CDH 5.4.5 to 5.5.1

Explorer

Interesting finding.  So is now() the killer or from_unixtimestamp?  If it's now() then in theory we can inject 'now' in the query before running it... in any case we probably prefer having a consistant time being used there across all partitions of the query.

 

regarding it being a regression, I'm guessing the planner just planned this a little different?  We're still working on setting up a true testing environment where we can hopefully repro this.

 

-Sean

Re: Impala performance degradation after upgrading from CDH 5.4.5 to 5.5.1

Master Collaborator

It looked like unixtimestamp() executes a system call to get the timezone. It doesn't look like this system call changed from 5.4 to 5.5.


I gather you're running queries with the same condition in the where clause under 5.4? It'd be interesting to see the profile to see if there's any difference in the plan. Generally it's an optimisation to push conditions to the scan but this is an exception.

 

There may be something subtle like the function being treated as a constant expression in one version or not the other.

Re: Impala performance degradation after upgrading from CDH 5.4.5 to 5.5.1

New Contributor

Here's the profile comparison for the same query:

5.4: https://gist.github.com/birdychang/923af04c5db893efc555

5.5: https://gist.github.com/sobrien05/4cdeeb85d314d08d4e2a

 

It seems like the major difference is in MaterializeTupleTime. However, I don't understand why the HDFS_SCAN_NODE TotalTime in 5.5 is much lower.

 

Also, I don't think unixtimestamp() is causing the slowness since this was happening for simple table joins as well. 

Re: Impala performance degradation after upgrading from CDH 5.4.5 to 5.5.1

New Contributor
Yes, we should compute the stats for that table. But I don't think it's the root cause.
Don't have an account?
Coming from Hortonworks? Activate your account here