Reply
New Contributor
Posts: 6
Registered: ‎11-24-2017

HBase Region Server Garbage Collection Issue

[ Edited ]

Hello everyone!

 

I am working as a Big Data Software Engineer for almost a year and I have encountered a very serious problem on our production platform.

 

We are using Cloudera Express 5.4.6. Our Hadoop cluster has 2 NameNodes and 9 DataNodes. Each of the DataNodes has a HBase Region Server assigned to it. We have around 14TB of compressed data.

 

We are using OpenTSDB on top of HBase, which is on top of HDFS. There is a OpenTSDB instance on every DataNode and those instances serve to write data to HBase. 

 

We are collecting data with different polling frequencies. Those are (in minutes): 1,5,15,30,60,120.

Also, MapReduces execute every two hours.

 

Data is represented on our own web application which sends OpenTSDB queries.

 

Recently, we started experiencing problems with HBase Region Server Garbage Collection.

 

One of our Region Servers falls into long GCs and slows down the flow of the data to our web application and eventually restarts.

 

We haven`t added anything new or increased our data load. 

 

This GC issue happens every hour and from today it moved to another RegionServer.

 

Does anybody have any recoomendation? Could this be a bug?

 

Thank you in advance. 

 

 

New Contributor
Posts: 6
Registered: ‎11-24-2017

Re: HBase Region Server Garbage Collection Issue

Adittionaly, each Region Server has 8GB of heap and this is JAVA heap configuration:
-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=60 -XX:MaxGCPauseMillis=200
New Contributor
Posts: 5
Registered: ‎08-04-2017

Re: HBase Region Server Garbage Collection Issue

What is your tsdb config like:

Do you have salting enabled ?

Did you pre-split the regions before writing data ?

 

 

New Contributor
Posts: 6
Registered: ‎11-24-2017

Re: HBase Region Server Garbage Collection Issue

Salting is disabled.

 

This is the OpenTSDB configuration from /etc/opentsdb.conf.

 

# --------- NETWORK ----------
# The TCP port TSD should use for communications
# *** REQUIRED ***
tsd.network.port = 4243

# The IPv4 network address to bind to, defaults to all addresses
# tsd.network.bind = 0.0.0.0

# Enables Nagel's algorithm to reduce the number of packets sent over the
# network, default is True
#tsd.network.tcpnodelay = true

# Determines whether or not to send keepalive packets to peers, default
# is True
#tsd.network.keepalive = true

# Determines if the same socket should be used for new connections, default
# is True
#tsd.network.reuseaddress = true

# Number of worker threads dedicated to Netty, defaults to # of CPUs * 2
#tsd.network.worker_threads = 8

# Whether or not to use NIO or tradditional blocking IO, defaults to True
#tsd.network.async_io = true

# ----------- HTTP -----------
# The location of static files for the HTTP GUI interface.
# *** REQUIRED ***
tsd.http.staticroot = /opt/opentsdb-2.2.1/build/staticroot/

# Where TSD should write it's cache files to
# *** REQUIRED ***
tsd.http.cachedir = /opt/opentsdb-2.2.1/build/CACHE

# --------- CORE ----------
# Whether or not to automatically create UIDs for new metric types, default
# is False
tsd.core.auto_create_metrics = true

# --------- STORAGE ----------
# Whether or not to enable data compaction in HBase, default is True
#tsd.storage.enable_compaction = true

# How often, in milliseconds, to flush the data point queue to storage,
# default is 1,000
# tsd.storage.flush_interval = 1000

# Name of the HBase table where data points are stored, default is "tsdb"
tsd.storage.hbase.data_table = tsdb_new

# Name of the HBase table where UID information is stored, default is "tsdb-uid"
tsd.storage.hbase.uid_table = tsdb-uid_new

# Path under which the znode for the -ROOT- region is located, default is "/hbase"
#tsd.storage.hbase.zk_basedir = /hbase

# A comma separated list of Zookeeper hosts to connect to, with or without
# port specifiers, default is "localhost"
#tsd.storage.hbase.zk_quorum = localhost
tsd.storage.hbase.zk_quorum = namenode1,namenode2,datanode1

#Added in order to optimize OpenTSDB
tsd.http.request.enable_chunked = true
tsd.http.request.max_chunk = 16000

#Added fix duplicate data in HBase
tsd.storage.fix_duplicates = true

#Added in order to increase max tags num. Available from TSDB 2.2.
tsd.storage.max_tags = 45

#Added in order to increase max metric number. By default for ID 3B is used (2^24 = 16777216). Available from TSDB 2.2.
tsd.storage.uid.width.metric = 4
tsd.storage.uid.width.tagk = 4
tsd.storage.uid.width.tagv = 4

#Added in order to randomly assign UIDs to new metrics as they are created. Better distribution. Available from TSDB 2.2.
tsd.core.uid.random_metrics = true

New Contributor
Posts: 5
Registered: ‎08-04-2017

Re: HBase Region Server Garbage Collection Issue

How write heavy is your workload ?

 

We've seen issues with hot spotting region servers before we had salting enabled.

 

If you have metrics which are being logged with very different periods then some regions can experience more write load than others, so those region servers will have a higher compaction load.

 

Have you looked at the write_requests_rate and read_requests_rate metrics for your region servers ?

Highlighted
New Contributor
Posts: 6
Registered: ‎11-24-2017

Re: HBase Region Server Garbage Collection Issue

We have looked at our write_requests_rate and read_requests_rate  metrics.

All Regions Server have approximately the same number of regions assigned to them, but only two fall into long GCs, although this does not happen at the same time.

 

When it comes to read and write request load metrics, the situation is confusing.

On a weekly aggregation graph, results are the following:

 

RS1 has around 4K read requests and 7K write requests.

RS2 has around 5K read requests and 4K write requests.

RS7 and RS8 have around 9K and 12K read requests and 9K write requests. 

Others RS report a lot less, around 2K, for both read and write requests.

 

The confusing part is that only RS1 and RS2 fall into long GCs.

 

Also, I have investigated salting a little bit and found this in OpenTSDB configuration.

 

tsd.storage.salt.width (2.2)IntegerOptionalThe width, in bytes, of the salt prefix used to indicate which bucket a time series belongs in. A value of 0 means salting is disabled. WARNING Do not change after writing data to HBase or you will corrupt your tables and not be able to query any more.

 

Since this is our production environment, we cannot allow ourselves to corrupt or lose data.

 

Do you have any suggestion or experience that will help us overcome this?

 

 

Thank you in advance.

Announcements