Created on 04-17-2018 03:53 PM - edited 09-16-2022 06:06 AM
Hi,
After I upgraded the cluster successfully to the last releases CM 5.14.0 / CDH 5.14.2, I have been faced to this problem in 6 of my nodes, suddenly in the first queries the impala deamon get stopped and the query cancelled and give the error messages below:
Impala-shell:
Cancelled due to unreachable impalad(s): node1.example.com:22000
ODBC:
Status: RPC Error: Client for node5.example.com:22000 hit an unexpected exception: Unknown: Interrupted system call, type: N6apache6thrift9transport19TTransportExceptionE, rpc: N6impala19TTransmitDataResultE, send: done
Impala Deamon log file:
CancelQueryFInstances query_id= 3423055f3fda78a:a2446bea00000000 failed to connect to node2.example.com:22000 :Couldn't open transport for node2.example.com:22000 (connect() failed: Connection refused)
Statestore log file:
I0413 20:07:01.767758 64122 statestore.cc:729] Unable to send heartbeat message to subscriber impalad@node5.exaple.com:22000, received error: Couldn't open transport for node5.exaple.com:23000 (connect() failed: Connection refused)
When I looking for the issue source I have found this crash message in the Impala Daemon logs:
# # A fatal error has been detected by the Java Runtime Environment: # # SIGILL (0x4) at pc=0x0000000000d863e5, pid=13065, tid=0x00007efc499cf700 # # JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [impalad+0x9863e5] impala::HdfsScanNodeBase::StopAndFinalizeCounters()+0x965 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /var/run/cloudera-scm-agent/process/13339-impala-IMPALAD/hs_err_pid13065.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. #
We have Centos OS v6.9 in the 6 servers, I tried to upgrade/downgrade to a several centos 6.9 kernel releases and jdk versions but no result, Here is the releases used:
Centos 6.9 kernel:
2.6.32-696.23.1.el6.x86_64
2.6.32-696.16.1.el6.x86_64
2.6.32-696.13.2.el6.x86_64
2.6.32-642.15.1.el6.x86_64
2.6.32-642.11.1.el6.x86_64
JDK:
jdk.1.8.0_144
jdk.1.8.0_121
Remark: The 6 nodes are the only nodes that does not support SSE4_2.
Thanks in advance.
Created 06-13-2018 07:03 PM
I expect it will be included in the 5.14.4 maintenance release. I'm not aware of a workaround aside from avoiding running on affected hardware without popcnt support.
Created 04-17-2018 05:09 PM
What version of CDH were you running before the upgrade? Were you running on the same hardware?
Can you include the CPU info from your impalad.INFO log. It looks something like this:
I0417 17:05:31.064653 8873 init.cc:237] Cpu Info: Model: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz Cores: 8 Max Possible Cores: 8 L1 Cache: 32.00 KB (Line: 64.00 B) L2 Cache: 256.00 KB (Line: 64.00 B) L3 Cache: 8.00 MB (Line: 64.00 B) Hardware Supports: ssse3 sse4_1 sse4_2 popcnt avx avx2 pclmulqdq Numa Nodes: 1 Numa Nodes of Cores: 0->0 | 1->0 | 2->0 | 3->0 | 4->0 | 5->0 | 6->0 | 7->0 |
Created 04-17-2018 06:03 PM
Created 04-18-2018 01:46 AM
Hi @Tim Armstrong
Here is the CPU info from impalad.INFO :
I0417 20:54:12.845438 13375 init.cc:230] Cpu Info: Model: Intel(R) Xeon(R) CPU E5405 @ 2.00GHz Cores: 8 Max Possible Cores: 8 L1 Cache: 32.00 KB (Line: 64.00 B) L2 Cache: 6.00 MB (Line: 64.00 B) L3 Cache: 0 (Line: 0) Hardware Supports: ssse3 sse4_1 Numa Nodes: 1 Numa Nodes of Cores: 0->0 | 1->0 | 2->0 | 3->0 | 4->0 | 5->0 | 6->0 | 7->0 |
Created 04-18-2018 02:41 PM
Do you have the JVM error dump file?
/var/run/cloudera-scm-agent/process/13339-impala-IMPALAD/hs_err_pid13065.log
I filed https://issues.apache.org/jira/browse/IMPALA-6882 to investigate the issue. I took a look at the code and it doesn't look like anything has changed, so probabyl requires deeper investigation.
Created 04-19-2018 02:07 AM
Hi @Tim Armstrong
Thank you for you interaction.
Here is the JVM error dump file: https://ufile.io/j0zat
I have formatted 2 servers and resit them to the centos 6.9 (kernel 2.6.32-696.23.1.el6.x86_64) but always the same problem!
I hope we can resolve this bug asap, good luck.
Created 06-13-2018 04:47 PM
Hello,
I am running into the same problem on a fresh install of CDH 5.14.3. According to the ticket that Tim pasted above, the issue is fixed. Is there a timeline for when this fix will be available for general release? Is there a workaround for this that one can utilize now?
Created 06-13-2018 07:03 PM
I expect it will be included in the 5.14.4 maintenance release. I'm not aware of a workaround aside from avoiding running on affected hardware without popcnt support.
Created 07-16-2018 01:12 PM
Hi,
I am happy to state that after updating to CDH 5.14.4, that this crash bug seems to be fixed. We can run Impala queries now! This is the first we've used Impala and it looks amazingly fast - glad we can use it now 🙂 Thank you for fixing!
Created 07-16-2018 01:37 PM
@AntonyNthanks for following up - glad to hear it!