Support Questions

Find answers, ask questions, and share your expertise

After upgrading to cdh 5.14.2 Impala daemon stopped suddenly! -

avatar
Master Collaborator

Hi,

After I upgraded the cluster successfully to the last releases CM 5.14.0 / CDH 5.14.2, I have been faced to this problem in 6 of my nodes, suddenly in the first queries the impala deamon get stopped and the query cancelled and give the error messages below:

Impala-shell:

Cancelled due to unreachable impalad(s): node1.example.com:22000

ODBC:

Status: RPC Error: Client for node5.example.com:22000 hit an unexpected exception: Unknown: Interrupted system call, type: N6apache6thrift9transport19TTransportExceptionE, rpc: N6impala19TTransmitDataResultE, send: done

Impala Deamon log file:

 

CancelQueryFInstances query_id= 3423055f3fda78a:a2446bea00000000 failed to connect to node2.example.com:22000 :Couldn't open transport for node2.example.com:22000 (connect() failed: Connection refused)

Statestore log file:

I0413 20:07:01.767758 64122 statestore.cc:729] Unable to send heartbeat message to subscriber impalad@node5.exaple.com:22000, received error: Couldn't open transport for node5.exaple.com:23000 (connect() failed: Connection refused)


When I looking for the issue source I have found this crash message in the Impala Daemon logs:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x0000000000d863e5, pid=13065, tid=0x00007efc499cf700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [impalad+0x9863e5]  impala::HdfsScanNodeBase::StopAndFinalizeCounters()+0x965
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /var/run/cloudera-scm-agent/process/13339-impala-IMPALAD/hs_err_pid13065.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#


We have Centos OS v6.9 in the 6 servers, I tried to upgrade/downgrade to a several centos 6.9 kernel releases and jdk versions but no result, Here is the releases used:

Centos 6.9 kernel:
2.6.32-696.23.1.el6.x86_64
2.6.32-696.16.1.el6.x86_64
2.6.32-696.13.2.el6.x86_64
2.6.32-642.15.1.el6.x86_64
2.6.32-642.11.1.el6.x86_64

JDK:
jdk.1.8.0_144
jdk.1.8.0_121


Remark: The 6 nodes are the only nodes that does not support SSE4_2.

Thanks in advance.

1 ACCEPTED SOLUTION

avatar

I expect it will be included in the 5.14.4 maintenance release. I'm not aware of a workaround aside from avoiding running on affected hardware without popcnt support.

View solution in original post

9 REPLIES 9

avatar

What version of CDH were you running before the upgrade? Were you running on the same hardware?

 

Can you include the CPU info from your impalad.INFO log. It looks something like this:

I0417 17:05:31.064653  8873 init.cc:237] Cpu Info:
  Model: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  Cores: 8
  Max Possible Cores: 8
  L1 Cache: 32.00 KB (Line: 64.00 B)
  L2 Cache: 256.00 KB (Line: 64.00 B)
  L3 Cache: 8.00 MB (Line: 64.00 B)
  Hardware Supports:
    ssse3
    sse4_1
    sse4_2
    popcnt
    avx
    avx2
    pclmulqdq
  Numa Nodes: 1
  Numa Nodes of Cores: 0->0 | 1->0 | 2->0 | 3->0 | 4->0 | 5->0 | 6->0 | 7->0 |

avatar
Master Collaborator
Thanks for the reply Tim
It was CDH 5.12.0 and it was working great on the same servers..
I'll share the CPU info of those nodes ASAS.

avatar
Master Collaborator

Hi @Tim Armstrong

Here is the CPU info from impalad.INFO :

I0417 20:54:12.845438 13375 init.cc:230] Cpu Info:
  Model: Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
  Cores: 8
  Max Possible Cores: 8
  L1 Cache: 32.00 KB (Line: 64.00 B)
  L2 Cache: 6.00 MB (Line: 64.00 B)
  L3 Cache: 0 (Line: 0)
  Hardware Supports:
    ssse3
    sse4_1
  Numa Nodes: 1
  Numa Nodes of Cores: 0->0 | 1->0 | 2->0 | 3->0 | 4->0 | 5->0 | 6->0 | 7->0 |




avatar

Do you have the JVM error dump file?

/var/run/cloudera-scm-agent/process/13339-impala-IMPALAD/hs_err_pid13065.log

 

I filed https://issues.apache.org/jira/browse/IMPALA-6882 to investigate the issue. I took a look at the code and it doesn't look like anything has changed, so probabyl requires deeper investigation.

avatar
Master Collaborator

Hi @Tim Armstrong

Thank you for you interaction.

Here is the JVM error dump file: https://ufile.io/j0zat
I have formatted 2 servers and resit them to the centos 6.9 (kernel 2.6.32-696.23.1.el6.x86_64) but always the same problem!


I hope we can resolve this bug asap, good luck.

avatar
Explorer

Hello,

 

I am running into the same problem on a fresh install of CDH 5.14.3.  According to the ticket that Tim pasted above, the issue is fixed.  Is there a timeline for when this fix will be available for general release?  Is there a workaround for this that one can utilize now? 

avatar

I expect it will be included in the 5.14.4 maintenance release. I'm not aware of a workaround aside from avoiding running on affected hardware without popcnt support.

avatar
Explorer

Hi,

 

I am happy to state that after updating to CDH 5.14.4, that this crash bug seems to be fixed.  We can run Impala queries now!  This is the first we've used Impala and it looks amazingly fast - glad we can use it now 🙂  Thank you for fixing!

avatar

@AntonyNthanks for following up - glad to hear it!