Created 01-06-2016 07:37 AM
We are having five nodes hadoop cluster powered by HDP (Version 2.1), Ambari (Version 1.6). We have 1 hbase(Version 0.98) master. 3 are data nodes which are having 3 region servers. We are having hbase application running on this.
For last couple of weeks, region server of data node2 was getting stopped arbitrarily without any error logs. What we observed is that region server was going down on weekly basis. But from last couple of days, region server of data node3 is also going down without any error logs.
Region server logs are as follows-
Log list-
-rw-r--r-- 1 hbase hadoop 191 Dec 29 18:17 hbase-hbase-regionserver-fsdata2c.corp.arc.com.out.2
-rw-r--r-- 1 hbase hadoop 814M Dec 29 18:17 gc.log-201511240826 -rw-r--r-- 1 hbase hadoop 191 Jan 4 18:27 hbase-hbase-regionserver-fsdata2c.corp.arc.com.out.1 -rw-r--r-- 1 hbase hadoop 186M Jan 4 18:27 gc.log-201512300433
[root@fsdata2c hbase]# more hbase-hbase-regionserver-fsdata2c.corp.arc.com.out.1 /usr/lib/hbase/bin/hbase-daemon.sh: line 197: 19217 Killed nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase --config "${HBASE_CONF_DIR }" $command "$@" start >> "$logout" 2>&1 [root@fsdata2c hbase]#
As of now, we are starting region server manually which is solving the problem on temporary basis until region server stops again.
We require permanent solutions. Can anyone please help on this issue.
Created 01-06-2016 01:41 PM
@Raja Ray are all standard requirements set, i.e. ulimit, swappiness? Also, can you check the disk health? Also, what OS are you running, in case of RPM based, do you have Transparent Huge Pages off?
Created 01-06-2016 07:42 AM
can you please look for JVM pauses in regionserver logs?
Created 01-06-2016 10:09 AM
There is no error is GC log and in regionserver log corresponding to JVM pauses(org.apache.hadoop.hbase.util.JvmPauseMonitor).
We are getting following error in logs-
[main-SendThread(fsdata2c.corp.arc.com:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
I have attached logs.
Also, we have increased Hbase region server heap space to 4096MB and zookeeper session timeout to 40 seconds.
Please share your thought. hbase-region-server-log.txt (4.5 kB) hbase-gc-log.txt (3.7 kB)
Created 01-06-2016 05:07 PM
I am not able to access the logs attached by you. Is it possible for you to share the logs through some shared ftp links?
would be intereseted in zookeeper logs, gc logs, datanodes logs, hbase regionserver logs, hbase master logs as well.
Created 01-06-2016 01:41 PM
@Raja Ray are all standard requirements set, i.e. ulimit, swappiness? Also, can you check the disk health? Also, what OS are you running, in case of RPM based, do you have Transparent Huge Pages off?
Created 01-06-2016 03:02 PM
Hi @Artem Ervits : ulimit is unlimited, Swapiness is 30, OS is: RHEL 7, Transparent Huge Pages was enabled but now we have turned THP off. please suggest to tune these parameters!! Thanks in Advance
Created 01-06-2016 03:08 PM
THP on Centos7 is not a big deal but no biggie, you should turn swappinness to 0. I have a script with some important parameters, take a look.
https://github.com/dbist/scripts/blob/master/administration/hbase.sh
also check for number of open files.
Created 01-07-2016 06:54 AM
Hi @Artem Ervits thanks for your prompt response. we have set swapiness to 0 and executed hbase.sh script given by you on all regionserver and Hbase master node. here is the number of files that are open.
Hmaster: cat /proc/sys/fs/file-nr
2208 0 1557484
Regionserver #1: cat /proc/sys/fs/file-nr
2976 0 3126194
Regionserver #2: cat /proc/sys/fs/file-nr
3008 0 3126194
Regionserver #3: cat /proc/sys/fs/file-nr
2752 0 3126194
we will observe the hbase components for next couple of weeks and will let you know. Thanks for your expert help!
Created 01-07-2016 01:52 PM
can you still provide the regionserver logs? @Raja Ray what I suggested are just common practices and not necesserily a solution for your problem.
Created 01-19-2016 06:17 AM
@Artem Ervits After two weeks of observation, it seems that the issue is not occurring again. Thanks for your expert advice, help and solution on the issue. One more thing I am observing that "Blocks health CRIT for about a minute CRITICAL: corrupt_blocks:<1>, missing_blocks:<0>, total_blocks:<1765>", although there is no missing blocks. I will create a separate thread for that.