Support Questions

Find answers, ask questions, and share your expertise

imapal jobs fail

avatar
Explorer

We have been having an issue with "Impala" in the environment where sometimes the connections between impala daemons gets refused and throws an error causing the queries to fail.
This issue has been happening intermittently in the cluster at random times during the day.

 

 

Actual Error Message:

1) Couldn't open transport for <node_address>:22000 (connect() failed: Connection refused)
Failed to create thread SenderThread(1:1) in category DataStreamSender:boost::thread_resource_error: Resource temporarily unavailable
Sender timed out waiting for receiver fragment instance: <query_id>, dest node

5 REPLIES 5

avatar
Expert Contributor

Actually, I can see two different types of error from your error messages. "Connection refused" usually means the port 22000 was not open on the peer node. I'd like to check if the impala daemon on the peer node (<node_address>) stopped at that time. The "Resource temporarily unavailable" error was most likely related to the thread resource limits. The impala daemon couldn't create a thread due to insufficient resource so threw this error. I suggest have a look at IMPALA-5605 which should be helpful.

 

[1] https://github.com/apache/impala/blob/53ef115e8e5cac231ef948f8670106c348d197fe/be/src/util/thread.cc...

avatar
Explorer

Thanks for the reply.

 

I will check that.

 

Also can you check the following errors and let me know why jobs are getting cancelled?

 

  • Query Status: ExecQueryFInstances rpc query_id=2e4fe80a7382061f:3ef80d1500000000 failed: RPC client failed to connect: Couldn't open transport for p1i-hdp-srv06.lnt.com:22000 (connect() failed: Connection timed out)
  • RPC Error: Client for p1i-hdp-srv07.lnt.com:22000 hit an unexpected exception: No more data to read., type: N6apache6thrift9transport19TTransportExceptionE, rpc: N6impala19TTransmitDataResultE, send: done
  •  

 

 

 

avatar
Expert Contributor

The first error was that the function connect() failed because the peer impala daemon didn't accept the connection in time. You can have a look at charts on CM to check the CPU usage and number of threads in the impala daemon on the peer node for that time. Similar to "Resource temporarily unavailable", this error could also be related to CPU load or thread resource limits.

 

The second error means the connection was lost. You review impala daemon logs on p1i-hdp-srv07.lnt.com to look for the reason.

 

avatar
Explorer

Thank you again

 

I also see jobs getting cancelled. can you give me any reason?

 

  • Query ID: 3348f74b129b0dae:1666447600000000
  • User: pslc_mnr_bu
  • Database: default
  • Coordinator: p1i-hdp-srv11.lnt.com
  • Query Type: QUERY
  • Query State: EXCEPTION
  • Start Time: Nov 14, 2019 10:12:22 AM
  • End Time: Nov 14, 2019 10:13:02 AM
  • Duration: 39.6s
  • Rows Produced: 0
  • Admission Result: Admitted immediately
  • Admission Wait Time: 0ms
  • Aggregate Peak Memory Usage: 260.5 MiB
  • Bytes Streamed: 46.9 MiB
  • Client Fetch Wait Time: 0ms
  • Client Fetch Wait Time Percentage: 0
  • Connected User: psvc_mpr_bi
  • Estimated per Node Peak Memory: 2.3 GiB
  • File Formats: PARQUET/SNAPPY
  • HDFS Average Scan Range: 356.3 KiB
  • HDFS Bytes Read: 2.0 GiB
  • HDFS Bytes Read From Cache: 0 B
  • HDFS Bytes Read From Cache Percentage: 0
  • HDFS Local Bytes Read: 1.0 GiB
  • HDFS Local Bytes Read Percentage: 50
  • HDFS Remote Bytes Read: 1.0 GiB
  • HDFS Remote Bytes Read Percentage: 50
  • HDFS Scanner Average Read Throughput: 412.7 MiB/s
  • HDFS Short Circuit Bytes Read: 1.0 GiB
  • HDFS Short Circuit Bytes Read Percentage: 50
  • Impala Version: impalad version 3.0.0-cdh6.0.1 RELEASE (build 9a74a5053de5f7b8dd983802e6d75e58d31472db)
  • Memory Accrual: 479,253,548 byte seconds
  • Memory Spilled: 0 B
  • Node with Peak Memory Usage: p1i-hdp-srv03.lnt.com:22000
  • Number of Backends: 13
  • Number of Query Fragments Instances: 485
  • Out of Memory: false
  • Per Node Peak Memory Usage: 220.5 MiB
  • Planning Wait Time: 4.55s
  • Planning Wait Time Percentage: 11
  • Query Status: Cancelled
  • Session ID: 3041e4d70d3697bc:efc9a3fe12cfa1a2
  • Session Type: HIVESERVER2
  • Statistics Corrupt: false
  • Statistics Missing: false
  • Threads: CPU Time: 3.6m
  • Threads: CPU Time Percentage: 9
  • Threads: Network Receive Wait Time: 13.5m
  • Threads: Network Receive Wait Time Percentage: 33
  • Threads: Network Send Wait Time: 41.99s
  • Threads: Network Send Wait Time Percentage: 2
  • Threads: Storage Wait Time: 23.5m
  • Threads: Storage Wait Time Percentage: 57
  • Threads: Total Time: 41.3m

 

Thanks

avatar
Expert Contributor

You are welcome.

 

The query was cancelled due to some exception but there are no details of the exception in your query info. You can download the text query profile from CM. If you still can't see the detail in the query profile, you need to grep the query id 3348f74b129b0dae:1666447600000000 from the impala INFO log files on p1i-hdp-srv11.lnt.com. You should be able to see which query instance hit the exception. Then you can grep the instance id from the impala INFO log files on the host where the instance was running to look for the cause.