Community Articles

vrodionov · ‎03-31-2017

General optimizations

Do not run HDFS balancer. It breaks data locality and data locality is important for latency-sensitive applications
For the very same reason disable HBase auto region balancing: balance_switch false
Disable periodic automatic major compactions for time-series data. Time-series data is immutable (means no update/deletes usually). The only reason remaining for major compaction is decreasing number of store files, but we will apply different compaction policy, which limits number of files and does not require major compaction (see below)
Presplit table(s) with time-series data in advance.
Disable region splits completely (set DisabledRegionSplitPolicy). Region splitting results in major compaction and we do not run major compactions because it usually decrease performance, stability and increase operation latencies.
Enable WAL Compression - decrease write IO.

Table design

Do not store data in a raw format - use time-series specific compression (refer to OpenTSDB row key design)
Create coprocessor which will run periodically and compress raw data
Have separate column families for raw and compressed data
Increase hbase.hstore.blockingStoreFiles for both column families
Use FIFOCompactionPolicy for raw data (see below)
Use standard exploring compaction with limit on a maximum selection size for compressed data (see below)
Use gzip block compression for raw data (GZ) – decrease write IO.
Disable block cache for raw data (you will reduce block cache churn significantly)

FIFO compaction

First-In-First-Out
No compaction at all
TTL expired data just get archived
Ideal for raw data storage (minimum IO overhead)
No compaction – no block cache trashing
Sustains 100s MB/s write throughput per RS
Available 0.98.17, 1.2+, HDP-2.4+
Refer to https://issues.apache.org/jira/browse/HBASE-14468 for usage and configuration

Exploring Compaction + Max Size

Set hbase.hstore.compaction.max.size to some appropriate value (say 500MB). With default region size of 10GB this results in maximum 20 store files per region.
This helps in preserving temporal locality of data – data points which are close will be stored in a same file, distant ones – in a separate files.
This compaction works better with block cache
More efficient caching of recent data is possible
Good for most-recent-most-valuable data access pattern.
Use it for compressed and aggregated data
Helps to keep recent data in a block cache.

Efficient time-series applications in HBase