Support Questions

Find answers, ask questions, and share your expertise

CDH 5.9 client kills all CDH 5.3.2 RegionServers

avatar
Rising Star

Hi,

 

We have observed an unsafe behaviour from CDH 5.3.2/5.9 libraries. I am sharing the observations with the hope that the issues are/will be fixed in later releases.

 

We have a CDH 5.3.2 cluster that has been running fine for months. Yesterday, out of the blue, Region Servers started dropping like flies. There were no error messages in the logs, just an abrupt startup entry with all the classpath info etc. It took me a good hour to narrow down the source of the problem.

 

Apparently, one of the colleagues tried to fetch some "fresh data" from the cluster using newer CDH 5.9 client libraries. That's it! Whenever he connected to the CDH 5.3.2 cluster and attempted to query a table, all cluster's region servers crashed without an error message.

It is really worrying that an accidental connection using newer libraries (5.9) can bring the whole cluster (5.3.2) offline. 
So I wonder: do hadoop/hbase architecture have some kind of safety mechanisms in terms of library incompatibility? Maybe this safety mechanism has not been implemented? Or maybe it is non-existent whatsoever?

 

Thanks,

Gin

1 ACCEPTED SOLUTION

avatar
Mentor
I can reproduce this easily with some newer versions' shells.

The problem seems to be that the newer versions supply a default scanner caching value of Integer.MAX_VALUE, which the older servers didn't have a checking guard against. When they receive such a numeric request, they try to construct a normal ArrayList of that arbitrary size in-memory which Java cannot do as the size exceeds the maximum allowed array size of a JVM. It therefore throws an OOME which combined with the configuration flag of 'Kill when OutOfMemory' destroys the RS with a kill -9 along with an OOME message in the RS's stdout file.

The enhancements of HBASE-11544 rearchitected around this by adding guards and other better ways of handling scanner requests. So if your server runs any release > CDH 5.5.0 which was the earliest to include this JIRA, it will not crash even if such a request came to it.

In your current situation, aside of a CDH upgrade, I think you can guard against a cascading automatic kill -9 from a rogue client request by disabling the check inside your CM -> HBase -> Configuration called "Kill When Out of Memory" for the RegionServers group. Save and restart all RegionServers after making this change.

If any fatal OOMEs occur in important threads within the RS it is already designed to go down, but a client-request OOME can be safely ignored with this flag unset.

Once you upgrade, you can revert it back to the default value of enabled, as the issue wouldn't occur on newer HBase versions.

Does this help?

View solution in original post

2 REPLIES 2

avatar
Mentor
I can reproduce this easily with some newer versions' shells.

The problem seems to be that the newer versions supply a default scanner caching value of Integer.MAX_VALUE, which the older servers didn't have a checking guard against. When they receive such a numeric request, they try to construct a normal ArrayList of that arbitrary size in-memory which Java cannot do as the size exceeds the maximum allowed array size of a JVM. It therefore throws an OOME which combined with the configuration flag of 'Kill when OutOfMemory' destroys the RS with a kill -9 along with an OOME message in the RS's stdout file.

The enhancements of HBASE-11544 rearchitected around this by adding guards and other better ways of handling scanner requests. So if your server runs any release > CDH 5.5.0 which was the earliest to include this JIRA, it will not crash even if such a request came to it.

In your current situation, aside of a CDH upgrade, I think you can guard against a cascading automatic kill -9 from a rogue client request by disabling the check inside your CM -> HBase -> Configuration called "Kill When Out of Memory" for the RegionServers group. Save and restart all RegionServers after making this change.

If any fatal OOMEs occur in important threads within the RS it is already designed to go down, but a client-request OOME can be safely ignored with this flag unset.

Once you upgrade, you can revert it back to the default value of enabled, as the issue wouldn't occur on newer HBase versions.

Does this help?

avatar
Rising Star
We are about to perform planned upgrade of the cluster to 5.9 thus the problem will be solved.
It is great to know that the issue will become irrelevant.

Thanks, Harsh. A fantastic clarification!