Support Questions

Find answers, ask questions, and share your expertise

Hbase region server is getting stopped frequently without any error log

avatar
Expert Contributor

We are having five nodes hadoop cluster powered by HDP (Version 2.1), Ambari (Version 1.6). We have 1 hbase(Version 0.98) master. 3 are data nodes which are having 3 region servers. We are having hbase application running on this.

For last couple of weeks, region server of data node2 was getting stopped arbitrarily without any error logs. What we observed is that region server was going down on weekly basis. But from last couple of days, region server of data node3 is also going down without any error logs.

Region server logs are as follows-

Log list-

-rw-r--r-- 1 hbase hadoop 191 Dec 29 18:17 hbase-hbase-regionserver-fsdata2c.corp.arc.com.out.2

-rw-r--r-- 1 hbase hadoop 814M Dec 29 18:17 gc.log-201511240826 -rw-r--r-- 1 hbase hadoop 191 Jan 4 18:27 hbase-hbase-regionserver-fsdata2c.corp.arc.com.out.1 -rw-r--r-- 1 hbase hadoop 186M Jan 4 18:27 gc.log-201512300433

[root@fsdata2c hbase]# more hbase-hbase-regionserver-fsdata2c.corp.arc.com.out.1 /usr/lib/hbase/bin/hbase-daemon.sh: line 197: 19217 Killed nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase --config "${HBASE_CONF_DIR }" $command "$@" start >> "$logout" 2>&1 [root@fsdata2c hbase]#

As of now, we are starting region server manually which is solving the problem on temporary basis until region server stops again.

We require permanent solutions. Can anyone please help on this issue.

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Raja Ray are all standard requirements set, i.e. ulimit, swappiness? Also, can you check the disk health? Also, what OS are you running, in case of RPM based, do you have Transparent Huge Pages off?

View solution in original post

10 REPLIES 10

avatar

can you please look for JVM pauses in regionserver logs?

avatar
Expert Contributor

There is no error is GC log and in regionserver log corresponding to JVM pauses(org.apache.hadoop.hbase.util.JvmPauseMonitor).

We are getting following error in logs-

[main-SendThread(fsdata2c.corp.arc.com:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect

I have attached logs.

Also, we have increased Hbase region server heap space to 4096MB and zookeeper session timeout to 40 seconds.

Please share your thought. hbase-region-server-log.txt (4.5 kB) hbase-gc-log.txt (3.7 kB)

avatar

I am not able to access the logs attached by you. Is it possible for you to share the logs through some shared ftp links?

would be intereseted in zookeeper logs, gc logs, datanodes logs, hbase regionserver logs, hbase master logs as well.

avatar
Master Mentor

@Raja Ray are all standard requirements set, i.e. ulimit, swappiness? Also, can you check the disk health? Also, what OS are you running, in case of RPM based, do you have Transparent Huge Pages off?

avatar
Expert Contributor

Hi @Artem Ervits : ulimit is unlimited, Swapiness is 30, OS is: RHEL 7, Transparent Huge Pages was enabled but now we have turned THP off. please suggest to tune these parameters!! Thanks in Advance

avatar
Master Mentor

THP on Centos7 is not a big deal but no biggie, you should turn swappinness to 0. I have a script with some important parameters, take a look.

https://github.com/dbist/scripts/blob/master/administration/hbase.sh

also check for number of open files.

avatar
Expert Contributor

Hi @Artem Ervits thanks for your prompt response. we have set swapiness to 0 and executed hbase.sh script given by you on all regionserver and Hbase master node. here is the number of files that are open.

Hmaster: cat /proc/sys/fs/file-nr

2208 0 1557484

Regionserver #1: cat /proc/sys/fs/file-nr

2976 0 3126194

Regionserver #2: cat /proc/sys/fs/file-nr

3008 0 3126194

Regionserver #3: cat /proc/sys/fs/file-nr

2752 0 3126194

we will observe the hbase components for next couple of weeks and will let you know. Thanks for your expert help!

avatar
Master Mentor

can you still provide the regionserver logs? @Raja Ray what I suggested are just common practices and not necesserily a solution for your problem.

avatar
Expert Contributor

@Artem Ervits After two weeks of observation, it seems that the issue is not occurring again. Thanks for your expert advice, help and solution on the issue. One more thing I am observing that "Blocks health CRIT for about a minute CRITICAL: corrupt_blocks:<1>, missing_blocks:<0>, total_blocks:<1765>", although there is no missing blocks. I will create a separate thread for that.