Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How new data can be accessed/written accross multiple nodes

How new data can be accessed/written accross multiple nodes

Explorer

Hello team, From our web application whenever we are trying to fetch data from Impala, it's doing so from one particular node. There are multiple data nodes. There is haproxy acting as a load balancer. Replication factor set to 3. Is there any configuration in Impala that would allow to fetch data from multiple nodes. Thanks and regards Sayak

4 REPLIES 4

Re: How new data can be accessed/written accross multiple nodes

Master Collaborator
I think you are mixing the data distribution and the Impala query orchestration. The first thing is the distribution of the data, if it is large enough, it will be probably distributed across the cluster and on all datanodes (if they are co-located with impala daemons) will be a part of the data for the query. When the query runs (orchestration) one Impala Daemon is acting as a master and distributes the work across those impala nodes where the data is. The final fetch and sort and merge is done on this coordinator node. So I think the answer is:
-> Your impala coordinator daemon (where you connect) is very likely utilizing all the other impala daemon nodes
-> If you are 100% sure that the processing is local, then it is because all the data is located on the particular impala daemon and thus no parallel execution is needed.

You should check the daemon's UI on the running queries you can see how the job is distributed across the cluster.
Highlighted

Re: How new data can be accessed/written accross multiple nodes

Explorer
Thanks for the detailed explanation.

When the query runs, you say it's picking one Impala Daemon. Does picking up the Impala daemon can be managed ?
Impala Daemons Load Balancer is configured.

Re: How new data can be accessed/written accross multiple nodes

Contributor

when LB is configured then the control of which impala to submit the query is vested with LB and impala-shell will have no control over choosing the coordinator

 

if you are not using loadbalancer, then when you run impala-shell -i <impalad_hostname> then query gets submitted to impalad running on "<impalad_hostname>" -- This host will act as the coordinator for the query.

Re: How new data can be accessed/written accross multiple nodes

Guru
Extra info that you might be interested, from CDH5.12 onwards, Impala supports dedicated coordinators, so you can setup impala daemon to only does coordinator job, not processing job.

https://www.cloudera.com/documentation/enterprise/5-12-x/topics/impala_dedicated_coordinator.html

But as Venkat mentioned, once they are behind LB, you will have no control.
Don't have an account?
Coming from Hortonworks? Activate your account here