Support Questions
Find answers, ask questions, and share your expertise

impala taking much time on cross data centre impala cluster setup



Here i am again into another PROD problem :

I have 6 data nodes,3 in data centre US1 and 3 in another data centre US2 with 4 data replication. imapald running on all the 6 data nodes.

when i run a query it get completes in 18-20 mins. but if if i shutdown US1 data centre and US2 data centre only alive then it takes 5-6 mins to complete,similarly if i shut down the data centre US2 and keep US1 datacentre alive then it takes same 5-7 mins.


but at same  time if i run the query on Hive-on-Tez  having all 6 nodes up,it takes 9-10 mins to complete which is looks good to me.

Can you advise why such behaviour of impala ? and what is the solution ?


note : I am running query through jdbc and using  impalad version 2.7.0-cdh5.9.0





To be clear, you have one cluster consisting of 6 datanodes that are split between two DCs. Do I have that correct?

My short answer is, don't do that if you are (caveat is if you are using WanDisco then go ahead).

You have a replication factor of four to ensure that at least one replica ends up on a DN in the other DC. This works for most as they are in YARN and it is likely that YARN has enough room to run everything locally or is better at achieving data locality out of the gate.

I can't be certain but you should be able to check the Hive jobs to see how many mappers are running locally. Impala is more difficult to check if it is running locally. You could sift through the logs to see all of the daemons that handles portions of the query. There may be some CM charts to help as well but I am not certain.

A reminder to run 'refresh' on any tables were new data has been added outside of Impala or items have been replicated (this may have happened when you shutdown US1 depending on how long it had it down and the minimum replication factor).

In summary, if Impala doesn't know the location of the blocks before hand then it has a higher risk of not running locally; if it is not running locally it is going to run much slower. There is the obvious added time of streaming the data over the network (Hive would have this too, though) but a decent chunk of Impala's performance boost comes from being able to access the data directly (meaning it doesn't even talk to HDFS).

One last time, please don't run a cluster like this. While it can work it is very hacky. A replication factor of 4 does get one replica to the second DC but then you are limited in the number of nodes or racks you have in any given DC or you have to keep increasing the replication factor to keep up which creates huge storage overhead. There is just not enough control over how HDFS writes the replicas out to date.









; ;