We have been having an issue with "Impala" in the environment where sometimes the connections between impala daemons gets refused and throws an error causing the queries to fail.
This issue has been happening intermittently in the cluster at random times during the day.
Actual Error Message:
1) Couldn't open transport for <node_address>:22000 (connect() failed: Connection refused)
Failed to create thread SenderThread(1:1) in category DataStreamSender:boost::thread_resource_error: Resource temporarily unavailable
Sender timed out waiting for receiver fragment instance: <query_id>, dest node
Actually, I can see two different types of error from your error messages. "Connection refused" usually means the port 22000 was not open on the peer node. I'd like to check if the impala daemon on the peer node (<node_address>) stopped at that time. The "Resource temporarily unavailable" error was most likely related to the thread resource limits. The impala daemon couldn't create a thread due to insufficient resource so threw this error. I suggest have a look at IMPALA-5605 which should be helpful.
Thanks for the reply.
I will check that.
Also can you check the following errors and let me know why jobs are getting cancelled?
The first error was that the function connect() failed because the peer impala daemon didn't accept the connection in time. You can have a look at charts on CM to check the CPU usage and number of threads in the impala daemon on the peer node for that time. Similar to "Resource temporarily unavailable", this error could also be related to CPU load or thread resource limits.
The second error means the connection was lost. You review impala daemon logs on p1i-hdp-srv07.lnt.com to look for the reason.
Thank you again
I also see jobs getting cancelled. can you give me any reason?
You are welcome.
The query was cancelled due to some exception but there are no details of the exception in your query info. You can download the text query profile from CM. If you still can't see the detail in the query profile, you need to grep the query id 3348f74b129b0dae:1666447600000000 from the impala INFO log files on p1i-hdp-srv11.lnt.com. You should be able to see which query instance hit the exception. Then you can grep the instance id from the impala INFO log files on the host where the instance was running to look for the cause.