Reply
Explorer
Posts: 23
Registered: ‎02-19-2019

Hbase thrift server failed

[ Edited ]

Here is the error message I am seeing in the role logs/Stderr logsrecently, Do I have to make any configuration changes in CM and then restart this server?

+ HBASE_REST_OPTS=
++ replace_pid -Xms1073741824 -Xmx1073741824 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+HeapDumpOnOutOfMemoryError '-XX:HeapDumpPath=/tmp/hbase_hbase-HBASETHRIFTSERVER-8b300b6e4f58408d2e4b5170d308788e_pid{{PID}}.hprof' -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh
++ echo -Xms1073741824 -Xmx1073741824 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+HeapDumpOnOutOfMemoryError '-XX:HeapDumpPath=/tmp/hbase_hbase-HBASETHRIFTSERVER-8b300b6e4f58408d2e4b5170d308788e_pid{{PID}}.hprof' -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh
++ sed 's#{{PID}}#5940#g'
+ export 'HBASE_THRIFT_OPTS=-Xms1073741824 -Xmx1073741824 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hbase_hbase-HBASETHRIFTSERVER-8b300b6e4f58408d2e4b5170d308788e_pid5940.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh'
+ HBASE_THRIFT_OPTS='-Xms1073741824 -Xmx1073741824 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hbase_hbase-HBASETHRIFTSERVER-8b300b6e4f58408d2e4b5170d308788e_pid5940.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh

 

Cloudera Employee
Posts: 29
Registered: ‎01-26-2016

Re: Hbase thrift server failed

The shared stderr snippet looks normal. What do you see in the role logs (under /var/log/hbase folder) during the startup failure? Can you share the complete stack trace if there is any?

Explorer
Posts: 23
Registered: ‎02-19-2019

Re: Hbase thrift server failed

 

It don't look like an issue to me in role log. do I have to look into something, this is like a common error I am facing

2019-03-03 10:39:39,626 INFO org.apache.hadoop.hbase.http.HttpServer: Jetty bound to port 9095
2019-03-03 10:39:39,626 INFO org.mortbay.log: jetty-6.1.26.cloudera.4
2019-03-03 10:39:40,001 INFO org.mortbay.log: Started SslSocketConnectorSecure@0.0.0.0:9095
2019-03-03 10:39:40,074 INFO org.apache.hadoop.hbase.thrift.ThriftServerRunner: starting TBoundedThreadPoolServer on /0.0.0.0:9090 with readTimeout 60000ms; min worker threads=200, max worker threads=1000, max queued requests=1000
2019-03-04 16:23:27,419 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1404ms
No GCs detected
2019-03-05 16:36:39,673 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1035ms
No GCs detected
2019-03-06 16:10:42,642 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1003ms
No GCs detected

Explorer
Posts: 23
Registered: ‎02-19-2019

Re: Hbase thrift server failed

There is nothing in the logs but rest of services on the same node also showing health issues because of Java heap dump directory free space issue. Do I have to increase the heap size to permanently resolve this?

Highlighted
Cloudera Employee
Posts: 27
Registered: ‎11-22-2017

Re: Hbase thrift server failed

You're observing JVM pauses but "No GCs detected". This indicates problem with the underlying host, typically kernel level CPU lockups or general process hangings. 

 

Check the host's /var/log/messages to find clues about such issues. When found, rectify them and then see if you still face Thrift issues. We'll take it further accordingly.

Posts: 1,896
Kudos: 433
Solutions: 303
Registered: ‎07-31-2013

Re: Hbase thrift server failed

If your HBase Thrift Server is not running under a secured cluster, there's a good chance it is crashing out with spurious OutOfMemoryError aborts.

Part of the problem ties with the Thrift RPC layer not checking incoming request packets for validity, which ends up allowing things such as HTTP requests or random protocol request scans from security scanner software (such as Qualys, etc.) through to the RPC layer. This in turn ends up being interpreted incorrectly as a large allocation request at times, causing an OutOfMemoryError in Java due to the size it thinks the RPC request is attempting to send based on its first few bytes.

You can confirm if this is the case by checking the stdout of your failed former Thrift Server processes. If you cannot spot that in the UI, visit the host that runs it and there should be lower numbered directories for the THRIFTSERVER role type under /var/run/cloudera-scm-agent/process/ which should still have the past logs/stdout.log files within it. Within the log file you should see a message such as the below that can help confirm this theory:

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="/usr/lib64/cmf/service/common/killparent.sh"
# Executing /bin/sh -c "/usr/lib64/cmf/service/common/killparent.sh"...

One way to avoid this from recurring is to switch on framed transport mode. This may break some clients if you do have active users using the HBase Thrift Server. To enable it, turn on the flag under HBase - Configuration - "Enable HBase Thrift Server Framed Transport"
Explorer
Posts: 23
Registered: ‎02-19-2019

Re: Hbase thrift server failed

Hello, Thanks for a great explanation. I think what you mentioned is actually what is happening in my cluster. Weird thing is that this error started with HBASE thrift server and then same error was on HDFS,Failover controller, Spark. It stayed for a while and I started trouble shooting it.  Shockingly yesterday , I saw my cluster was running fine. any reason for this abnormal behavior?

 

My cluster is DEV cluster. No data in it. We are just about to enable kerberos in few days, TLS/SSL is already enabled. What would be the best solution?

as of now I can see the stdout.log under /var/run/cloudera-scm-agent/process/ :

 

JAVA_HOME=/usr/java/jdk1.8.0_162/
using /usr/java/jdk1.8.0_162/ as JAVA_HOME
using 5 as CDH_VERSION
using  as HBASE_HOME
using /run/cloudera-scm-agent/process/2813-hbase-HBASETHRIFTSERVER as HBASE_CONF_DIR
using /run/cloudera-scm-agent/process/2813-hbase-HBASETHRIFTSERVER as HADOOP_CONF_DIR
using  as HADOOP_HOME
CONF_DIR=/run/cloudera-scm-agent/process/2813-hbase-HBASETHRIFTSERVER
CMF_CONF_DIR=/etc/cloudera-scm-agent
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /tmp/hbase_hbase-HBASETHRIFTSERVER-8b300b6e4f58408d2e4b5170d308788e_pid24899.hprof ...
Heap dump file created [18196434 bytes in 0.061 secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p
/usr/lib64/cmf/service/common/killparent.sh"
#   Executing /bin/sh -c "kill -9 24899
/usr/lib64/cmf/service/common/killparent.sh"...

 

 

Posts: 1,896
Kudos: 433
Solutions: 303
Registered: ‎07-31-2013

Re: Hbase thrift server failed

As far as the Thrift Server role goes, it will likely resolve itself when
you enable kerberos, as that will introduce an auth negotiation protocol
layer which will reject badly formed requests automatically. That is
assuming bad requests are what's causing the frequent OOMEs despite heap
raises, and not actual usage of the thrift service.

For the Failover Controllers, NameNodes and other roles, this theory does
not apply directly. Those may be a result of some other ongoing or one-off
issues - worth investigating separately (on a different thread if here).
Announcements