You can rewrite your query in a way that won't evaluate the time for every row by putting it in a subquery.
SELECT user_id, B.house_id, style, category_id FROM c2.buzz_p B, c2.houses_p H, (select myTimeStamp from (select from_unixtime(unix_timestamp(now() - interval 365 days), 'yyyy-MM-dd HH:mm:ss.S') as myTimeStamp) a) TimeSource WHERE B.house_id = H.house_id AND B.created > myTimeStamp
Interesting finding. So is now() the killer or from_unixtimestamp? If it's now() then in theory we can inject 'now' in the query before running it... in any case we probably prefer having a consistant time being used there across all partitions of the query.
regarding it being a regression, I'm guessing the planner just planned this a little different? We're still working on setting up a true testing environment where we can hopefully repro this.
It looked like unixtimestamp() executes a system call to get the timezone. It doesn't look like this system call changed from 5.4 to 5.5.
I gather you're running queries with the same condition in the where clause under 5.4? It'd be interesting to see the profile to see if there's any difference in the plan. Generally it's an optimisation to push conditions to the scan but this is an exception.
There may be something subtle like the function being treated as a constant expression in one version or not the other.
Here's the profile comparison for the same query:
It seems like the major difference is in MaterializeTupleTime. However, I don't understand why the HDFS_SCAN_NODE TotalTime in 5.5 is much lower.
Also, I don't think unixtimestamp() is causing the slowness since this was happening for simple table joins as well.