Reply
Explorer
Posts: 14
Registered: ‎01-13-2017

How new data can be accessed/written accross multiple nodes

Hello team, From our web application whenever we are trying to fetch data from Impala, it's doing so from one particular node. There are multiple data nodes. There is haproxy acting as a load balancer. Replication factor set to 3. Is there any configuration in Impala that would allow to fetch data from multiple nodes. Thanks and regards Sayak

Master
Posts: 377
Registered: ‎07-01-2015

Re: How new data can be accessed/written accross multiple nodes

I think you are mixing the data distribution and the Impala query orchestration. The first thing is the distribution of the data, if it is large enough, it will be probably distributed across the cluster and on all datanodes (if they are co-located with impala daemons) will be a part of the data for the query. When the query runs (orchestration) one Impala Daemon is acting as a master and distributes the work across those impala nodes where the data is. The final fetch and sort and merge is done on this coordinator node. So I think the answer is:
-> Your impala coordinator daemon (where you connect) is very likely utilizing all the other impala daemon nodes
-> If you are 100% sure that the processing is local, then it is because all the data is located on the particular impala daemon and thus no parallel execution is needed.

You should check the daemon's UI on the running queries you can see how the job is distributed across the cluster.
Explorer
Posts: 14
Registered: ‎01-13-2017

Re: How new data can be accessed/written accross multiple nodes

Thanks for the detailed explanation.

When the query runs, you say it's picking one Impala Daemon. Does picking up the Impala daemon can be managed ?
Impala Daemons Load Balancer is configured.
Highlighted
Cloudera Employee
Posts: 33
Registered: ‎12-11-2015

Re: How new data can be accessed/written accross multiple nodes

when LB is configured then the control of which impala to submit the query is vested with LB and impala-shell will have no control over choosing the coordinator

 

if you are not using loadbalancer, then when you run impala-shell -i <impalad_hostname> then query gets submitted to impalad running on "<impalad_hostname>" -- This host will act as the coordinator for the query.

Announcements