I have a env called ENV-a with 50 GB RAM allocated to impala and i have another env called ENV-b with 92 GB RAM allocated to impala.both have 8 core cpus. impala is installed on 3 nodes on each env.
i am running same query on both ,ENV-a is taking 2 mins to run and consuming 3 GB on each nodes, whereas ENV-b is taking 22 mins to run and taking 70GB memory on 2 machines and 2 GB on 3rd Machine.
after seeing the profile i figured out that ENV-b is using most of the times either 1 nodes or 2 nodes but ENV-a is using all the three nodes. so i guess that may be the problem.
I have done all compute stats and invalidation of the tables but no impact.
what could be the solution ?
48:HASH JOIN 2 31.973ms 38.336ms 83.47K 66.69K 24.24 MB 3.51 MB INNER JOIN, BROADCAST
This is from Env-a, why there is so much diffrence,see the memory consumption,records and time.
Can there be diffrence in planning of query if impala version are diffrent ? becuase in Env-a i am using impala version 2.1.1 which is faster and in Env-b i am using 2.7.0 which is taking 25 mins to complete and taking 70 GB memory on each machines where as estimates says 500 mb memory required.
It seems possible that the query plan was different in both versions, but hard to know without seeing the plans.
It's hard to speculate what the difference could be since there's a big different between those versions. We've seen cases in the past where a new join inversion rule in the planner (https://issues.apache.org/jira/browse/IMPALA-3828) caused some queries to run slower - e.g. https://issues.apache.org/jira/browse/IMPALA-4040 https://issues.apache.org/jira/browse/IMPALA-5612 https://issues.apache.org/jira/browse/IMPALA-4174 . It did improve a lot of queries but there were cases where the previous heuristic worked better for specific queries.