Created on 08-14-2016 11:59 PM - edited 09-16-2022 03:34 AM
Hi
We have a single query consists of 253 plan fragments on a 43 clusters. We encountered an issue saying that "couldn't get a client for cdh-datanode-010.xxxxx.storage:22000" in the middle of the execution. I'm wondering is this because of the dedicated tcp connections required by each channel? The query consists of 212 HDFS SCAN NODE on each impalad node. Each of them is broadcasting/shuffling data to other 42 nodes, which requires, I think, 42 channels/data stream sender/scan node/per server. If each of them requires a tcp connection, then it would be 377496 connections all together, is this correct???
If this is the case, would you have any optimization suggestion to this query?
We only have a partial profile for this query as it stops in the middle of execution
Any comments and suggestion will be appreciated.
Thanks
We are using Impala 2.3
Created 08-15-2016 09:54 AM
This is a known issue that we're actively working on: https://issues.cloudera.org/browse/IMPALA-2567
Your analysis is accurate. Part of the problem is the number of connections and the other part is the # of threads per connection. You may be able to change some operating system config settings to increase limits here (depending on which limit you're hitting).
In order to reduce the # of tcp conncetions required you would either need to reduce the number of fragments or reduce the number of node executing the query.
You could reduce the # of fragments by breaking up the query into smaller queries. E.g. creating temporary tables with the results of some of the subqueries.
You could also try executing the query on a single node by setting num_nodes=1 if the data size is small enough that this makes sense. I suspect your query is too large for that to work, but it's hard to tell (that's a huge query plan!)
Created 08-15-2016 09:54 AM
This is a known issue that we're actively working on: https://issues.cloudera.org/browse/IMPALA-2567
Your analysis is accurate. Part of the problem is the number of connections and the other part is the # of threads per connection. You may be able to change some operating system config settings to increase limits here (depending on which limit you're hitting).
In order to reduce the # of tcp conncetions required you would either need to reduce the number of fragments or reduce the number of node executing the query.
You could reduce the # of fragments by breaking up the query into smaller queries. E.g. creating temporary tables with the results of some of the subqueries.
You could also try executing the query on a single node by setting num_nodes=1 if the data size is small enough that this makes sense. I suspect your query is too large for that to work, but it's hard to tell (that's a huge query plan!)