04-17-2018 03:53 PM
After I upgraded the cluster successfully to the last releases CM 5.14.0 / CDH 5.14.2, I have been faced to this problem in 6 of my nodes, suddenly in the first queries the impala deamon get stopped and the query cancelled and give the error messages below:
Cancelled due to unreachable impalad(s): node1.example.com:22000
Status: RPC Error: Client for node5.example.com:22000 hit an unexpected exception: Unknown: Interrupted system call, type: N6apache6thrift9transport19TTransportExceptionE, rpc: N6impala19TTransmitDataResultE, send: done
Impala Deamon log file:
CancelQueryFInstances query_id= 3423055f3fda78a:a2446bea00000000 failed to connect to node2.example.com:22000 :Couldn't open transport for node2.example.com:22000 (connect() failed: Connection refused)
Statestore log file:
I0413 20:07:01.767758 64122 statestore.cc:729] Unable to send heartbeat message to subscriber firstname.lastname@example.org:22000, received error: Couldn't open transport for node5.exaple.com:23000 (connect() failed: Connection refused)
When I looking for the issue source I have found this crash message in the Impala Daemon logs:
# # A fatal error has been detected by the Java Runtime Environment: # # SIGILL (0x4) at pc=0x0000000000d863e5, pid=13065, tid=0x00007efc499cf700 # # JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [impalad+0x9863e5] impala::HdfsScanNodeBase::StopAndFinalizeCounters()+0x965 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /var/run/cloudera-scm-agent/process/13339-impala-IMPALAD/hs_err_pid13065.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. #
We have Centos OS v6.9 in the 6 servers, I tried to upgrade/downgrade to a several centos 6.9 kernel releases and jdk versions but no result, Here is the releases used:
Centos 6.9 kernel:
Remark: The 6 nodes are the only nodes that does not support SSE4_2.
Thanks in advance.
04-17-2018 05:09 PM
What version of CDH were you running before the upgrade? Were you running on the same hardware?
Can you include the CPU info from your impalad.INFO log. It looks something like this:
I0417 17:05:31.064653 8873 init.cc:237] Cpu Info: Model: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz Cores: 8 Max Possible Cores: 8 L1 Cache: 32.00 KB (Line: 64.00 B) L2 Cache: 256.00 KB (Line: 64.00 B) L3 Cache: 8.00 MB (Line: 64.00 B) Hardware Supports: ssse3 sse4_1 sse4_2 popcnt avx avx2 pclmulqdq Numa Nodes: 1 Numa Nodes of Cores: 0->0 | 1->0 | 2->0 | 3->0 | 4->0 | 5->0 | 6->0 | 7->0 |
04-18-2018 01:46 AM
Hi @Tim Armstrong
Here is the CPU info from impalad.INFO :
I0417 20:54:12.845438 13375 init.cc:230] Cpu Info: Model: Intel(R) Xeon(R) CPU E5405 @ 2.00GHz Cores: 8 Max Possible Cores: 8 L1 Cache: 32.00 KB (Line: 64.00 B) L2 Cache: 6.00 MB (Line: 64.00 B) L3 Cache: 0 (Line: 0) Hardware Supports: ssse3 sse4_1 Numa Nodes: 1 Numa Nodes of Cores: 0->0 | 1->0 | 2->0 | 3->0 | 4->0 | 5->0 | 6->0 | 7->0 |
04-18-2018 02:41 PM
Do you have the JVM error dump file?
I filed https://issues.apache.org/jira/browse/IMPALA-6882 to investigate the issue. I took a look at the code and it doesn't look like anything has changed, so probabyl requires deeper investigation.
04-19-2018 02:07 AM
06-13-2018 04:47 PM
I am running into the same problem on a fresh install of CDH 5.14.3. According to the ticket that Tim pasted above, the issue is fixed. Is there a timeline for when this fix will be available for general release? Is there a workaround for this that one can utilize now?
06-13-2018 07:03 PM
I expect it will be included in the 5.14.4 maintenance release. I'm not aware of a workaround aside from avoiding running on affected hardware without popcnt support.