Support Questions

adaher · ‎12-17-2015

Background :

Customer has an 8 Node cluster on AWS with ephemeral storage, 5 of which are Hbase.

OpenTSDB and Grafana were installed on the cluster as well.

Customer was ingesting time series data with OpenTSDB, at a rate of ~50k records/second.

Symptom:

In a span of a couple of hours, the disk utilization of hdfs skyrocketed from a few hundred GB to over 6TB all of it in HBASE / openTSDB.

In attempting to troubleshoot – turning off all data ingest and stopping openTSDBand just running HBase caused the disk utilization to continue to grow unabated and out of control dozens of GB per minute, even when openTSDB was completely shut down.

adaher · ‎12-17-2015

Root Cause -

1. Milliseconds were used in opentsdb metrics, which may generate over 32000 metrics in one hour. each column of milliseconds metrics uses 4 bytes, when compacting, the integrated column size may exceed the 128KB (hfile.index.block.max.size)

2.If the size of ( rowkey + columnfamily:qualifer ) is greater than hfile.index.block.max.size, this may cause the memstore flush to infinite loop during writing the hfile index.

…

That's why the compaction hangs, and the tmp folder of regions on hdfs increases all the time, and makes the region server down.

When HBase is starting, it will create that huge file in a .tmp directory in one of the subdirectories under the tsdb directory.

Solution -

1.Shut down HBase. 
2.Found .tmp disk usage by HBASE in the opentsdb keyspaces and deleted them completely.

3.If the parent directory contains a directory called recovered.edits, delete the recovered.edits directory or rename it to something like recovered.edits.bak.

4.Modified the hbase-site.xml in Ambari and increased the size of the hfile.index.block.maxsize=1024kb (from default of 128kb).

5.Then restarted HBASE followed by OpenTSDB.

The cluster was immediately stable (no more increasing disk space problem) and old data could be seen in openTSDB and viewed in grafana without issue.

Dataflow was turned on and everything appears to be working normally again.

Further Solution Details -

Here’s what openTSDB structure looked like before solution was applied …

5.7 G /apps/hbase/data/data/default/tsdb/08bfcc080d15d1127a0ebe664fdb1d80

5.0 G /apps/hbase/data/data/default/tsdb/0a55f4589b4e4bc9d7f71957a5795b4f

7.9 G /apps/hbase/data/data/default/tsdb/10066f9f83ac300e955ab9d0129ebf22

3.6 G /apps/hbase/data/data/default/tsdb/1cbf2fbf04b1e276b3de3615b95dc68f

5.5 G /apps/hbase/data/data/default/tsdb/1ed939038b8902a9c391d6d6d5a519f4

9.2 G /apps/hbase/data/data/default/tsdb/25b4dadf6621b09a63d2b1b9401203b9

6.5 G /apps/hbase/data/data/default/tsdb/2919b649b3b4a027ce8aece9a3e5ffd9

967.2 G /apps/hbase/data/data/default/tsdb/2bb4cdaaf052abe6eef753470303f099

4.6 G /apps/hbase/data/data/default/tsdb/39ab2524f8aaf2ee5685fc85a8dc1543

1022.7 M /apps/hbase/data/data/default/tsdb/39e3ce75e7225805534595c8c7e03305

4.1 G /apps/hbase/data/data/default/tsdb/4356d552eacfa526df24b400fd8007c7

9.9 G /apps/hbase/data/data/default/tsdb/4f2c1cc6d7c650c3f822136614921076

5.9 G /apps/hbase/data/data/default/tsdb/57eb8cbb2099bd6e1746cd4c8e007207

6.8 G /apps/hbase/data/data/default/tsdb/5e26da2eacba074a132edefca38017a3

1.2 T /apps/hbase/data/data/default/tsdb/69fc543c6f94891d3071005294c3c116

4.1 G /apps/hbase/data/data/default/tsdb/6ccc256f9721216bc9afa29e7d056bd4

7.5 G /apps/hbase/data/data/default/tsdb/6d43c524d221f3f54356e716c8f8849d

6.1 G /apps/hbase/data/data/default/tsdb/70f9f9ee045cde2823cf8ab485662a63

3.2 G /apps/hbase/data/data/default/tsdb/75f232ce81d5de2efb3f763d09d9c76f

8.7 G /apps/hbase/data/data/default/tsdb/7b4f9c05d64151d3f54558c70e5e9811

5.5 G /apps/hbase/data/data/default/tsdb/7fa9d913fd9a059733e9bb7a31b03e22

8.5 G /apps/hbase/data/data/default/tsdb/840f9f977262f1fbf9f4c06a8014c44b

3.0 G /apps/hbase/data/data/default/tsdb/9a3e7dbad294134eae934af980dd8c1c

6.5 G /apps/hbase/data/data/default/tsdb/a161383c9ae7df3c0cb15da093312908

4.0 G /apps/hbase/data/data/default/tsdb/b5e22e68e8f93ac50c9d9d3a3ca3f029

7.1 G /apps/hbase/data/data/default/tsdb/c9a9f105e8c4a44e8fd9172131b929d0

4.7 G /apps/hbase/data/data/default/tsdb/cc27e8c5a020291b7b3ac010dc50e25b

5.6 G /apps/hbase/data/data/default/tsdb/cc7f6645bac92ec514f8545fdb39b617

9.3 G /apps/hbase/data/data/default/tsdb/ce6426bfac53fb06fe8d320f3de150ee

1.8 G /apps/hbase/data/data/default/tsdb/d2ee226094556e6a90599e91bcba70f4

6.8 G /apps/hbase/data/data/default/tsdb/df6e88e27be5d3f0759d477812ab9277

3.0 G /apps/hbase/data/data/default/tsdb/efde9cca49c0f23a4e39e80e4040ac5a

5.9 G /apps/hbase/data/data/default/tsdb/f82790d90791b30aacd2bd990a1d4655

7.5 G /apps/hbase/data/data/default/tsdb/fcbb1b8f04f4e74a80882ef074244173

4.8 G /apps/hbase/data/data/default/tsdb/fece7565715c791028581022b70672e7

Attacked the two largest offenders and found that all of the space was in the ./tmp folder and we recovered all of the lost disk space.

hadoop fs -rm -R -skipTrash /apps/hbase/data/data/default/tsdb/69fc543c6f94891d3071005294c3c116/.tmp

hadoop fs -rm -R -skipTrash /apps/hbase/data/data/default/tsdb/69fc543c6f94891d3071005294c3c116/recovered.edits

hadoop fs -rm -R -skipTrash /apps/hbase/data/data/default/tsdb/2bb4cdaaf052abe6eef753470303f099/.tmp

hadoop fs -rm -R -skipTrash /apps/hbase/data/data/default/tsdb/2bb4cdaaf052abe6eef753470303f099/recovered.edits

Then went into Ambari Configuration for HBASE and added this setting:

Which increases the default of hfile.index.block.maxsize from 128kb to 1024kb

Will keep monitoring and will report any further anomalies.

View solution in original post

adaher · ‎12-17-2015