Here i am again into another PROD problem :
I have 6 data nodes,3 in data centre US1 and 3 in another data centre US2 with 4 data replication. imapald running on all the 6 data nodes.
when i run a query it get completes in 18-20 mins. but if if i shutdown US1 data centre and US2 data centre only alive then it takes 5-6 mins to complete,similarly if i shut down the data centre US2 and keep US1 datacentre alive then it takes same 5-7 mins.
but at same time if i run the query on Hive-on-Tez having all 6 nodes up,it takes 9-10 mins to complete which is looks good to me.
Can you advise why such behaviour of impala ? and what is the solution ?
note : I am running query through jdbc and using impalad version 2.7.0-cdh5.9.0
Hi mbigelow, thanks for the reply.
yes you are correct ,datanodes splits between two DCs.
I do impala refresh on all 6 impalad nodes everytime data is loaded and do incremental compute stats too. when i run the query i dont run anything else on the cluster.when i run through hive it creates mapper locally where data is. i saw it created container on 4 nodes only 2 were idle.
Now what i know about impala is : regrdless of block's location which is on 4 nodes,it will use all 6 nodes as it works on MPP based arch.
now we have 4 replication which is more for this size of cluster,what woul have happened if there would have 2or 3 replication factor or what if there are bigger size cluster like 50 nodes with replication of 5 or 10. so i dont think impala will always get opportunity to run locally.
i dont see any miss ,i saw profile too there is enough memory and resource. what ccould be the reason of this problem ? i am not able to figure out .