Created 12-17-2015 01:41 PM
Background :
Customer has an 8 Node cluster on AWS with ephemeral storage, 5 of which are Hbase.
OpenTSDB and Grafana were installed on the cluster as well.
Customer was ingesting time series data with OpenTSDB, at a rate of ~50k records/second.
Symptom:
In a span of a couple of hours, the disk utilization of hdfs skyrocketed from a few hundred GB to over 6TB all of it in HBASE / openTSDB.
In attempting to troubleshoot – turning off all data ingest and stopping openTSDBand just running HBase caused the disk utilization to continue to grow unabated and out of control dozens of GB per minute, even when openTSDB was completely shut down.
Created 12-17-2015 04:22 PM
Root Cause -
1. Milliseconds were used in opentsdb metrics, which may generate over 32000 metrics in one hour. each column of milliseconds metrics uses 4 bytes, when compacting, the integrated column size may exceed the 128KB (hfile.index.block.max.size)
2.If the size of ( rowkey + columnfamily:qualifer ) is greater than hfile.index.block.max.size, this may cause the memstore flush to infinite loop during writing the hfile index.
…
That's why the compaction hangs, and the tmp folder of regions on hdfs increases all the time, and makes the region server down.
When HBase is starting, it will create that huge file in a .tmp directory in one of the subdirectories under the tsdb directory.
Solution -
The cluster was immediately stable (no more increasing disk space problem) and old data could be seen in openTSDB and viewed in grafana without issue.
Dataflow was turned on and everything appears to be working normally again.
Further Solution Details -
5.7 G /apps/hbase/data/data/default/tsdb/08bfcc080d15d1127a0ebe664fdb1d80
5.0 G /apps/hbase/data/data/default/tsdb/0a55f4589b4e4bc9d7f71957a5795b4f
7.9 G /apps/hbase/data/data/default/tsdb/10066f9f83ac300e955ab9d0129ebf22
3.6 G /apps/hbase/data/data/default/tsdb/1cbf2fbf04b1e276b3de3615b95dc68f
5.5 G /apps/hbase/data/data/default/tsdb/1ed939038b8902a9c391d6d6d5a519f4
9.2 G /apps/hbase/data/data/default/tsdb/25b4dadf6621b09a63d2b1b9401203b9
6.5 G /apps/hbase/data/data/default/tsdb/2919b649b3b4a027ce8aece9a3e5ffd9
967.2 G /apps/hbase/data/data/default/tsdb/2bb4cdaaf052abe6eef753470303f099
4.6 G /apps/hbase/data/data/default/tsdb/39ab2524f8aaf2ee5685fc85a8dc1543
1022.7 M /apps/hbase/data/data/default/tsdb/39e3ce75e7225805534595c8c7e03305
4.1 G /apps/hbase/data/data/default/tsdb/4356d552eacfa526df24b400fd8007c7
9.9 G /apps/hbase/data/data/default/tsdb/4f2c1cc6d7c650c3f822136614921076
5.9 G /apps/hbase/data/data/default/tsdb/57eb8cbb2099bd6e1746cd4c8e007207
6.8 G /apps/hbase/data/data/default/tsdb/5e26da2eacba074a132edefca38017a3
1.2 T /apps/hbase/data/data/default/tsdb/69fc543c6f94891d3071005294c3c116
4.1 G /apps/hbase/data/data/default/tsdb/6ccc256f9721216bc9afa29e7d056bd4
7.5 G /apps/hbase/data/data/default/tsdb/6d43c524d221f3f54356e716c8f8849d
6.1 G /apps/hbase/data/data/default/tsdb/70f9f9ee045cde2823cf8ab485662a63
3.2 G /apps/hbase/data/data/default/tsdb/75f232ce81d5de2efb3f763d09d9c76f
8.7 G /apps/hbase/data/data/default/tsdb/7b4f9c05d64151d3f54558c70e5e9811
5.5 G /apps/hbase/data/data/default/tsdb/7fa9d913fd9a059733e9bb7a31b03e22
8.5 G /apps/hbase/data/data/default/tsdb/840f9f977262f1fbf9f4c06a8014c44b
3.0 G /apps/hbase/data/data/default/tsdb/9a3e7dbad294134eae934af980dd8c1c
6.5 G /apps/hbase/data/data/default/tsdb/a161383c9ae7df3c0cb15da093312908
4.0 G /apps/hbase/data/data/default/tsdb/b5e22e68e8f93ac50c9d9d3a3ca3f029
7.1 G /apps/hbase/data/data/default/tsdb/c9a9f105e8c4a44e8fd9172131b929d0
4.7 G /apps/hbase/data/data/default/tsdb/cc27e8c5a020291b7b3ac010dc50e25b
5.6 G /apps/hbase/data/data/default/tsdb/cc7f6645bac92ec514f8545fdb39b617
9.3 G /apps/hbase/data/data/default/tsdb/ce6426bfac53fb06fe8d320f3de150ee
1.8 G /apps/hbase/data/data/default/tsdb/d2ee226094556e6a90599e91bcba70f4
6.8 G /apps/hbase/data/data/default/tsdb/df6e88e27be5d3f0759d477812ab9277
3.0 G /apps/hbase/data/data/default/tsdb/efde9cca49c0f23a4e39e80e4040ac5a
5.9 G /apps/hbase/data/data/default/tsdb/f82790d90791b30aacd2bd990a1d4655
7.5 G /apps/hbase/data/data/default/tsdb/fcbb1b8f04f4e74a80882ef074244173
4.8 G /apps/hbase/data/data/default/tsdb/fece7565715c791028581022b70672e7
hadoop fs -rm -R -skipTrash /apps/hbase/data/data/default/tsdb/69fc543c6f94891d3071005294c3c116/.tmp
hadoop fs -rm -R -skipTrash /apps/hbase/data/data/default/tsdb/69fc543c6f94891d3071005294c3c116/recovered.edits
hadoop fs -rm -R -skipTrash /apps/hbase/data/data/default/tsdb/2bb4cdaaf052abe6eef753470303f099/.tmp
hadoop fs -rm -R -skipTrash /apps/hbase/data/data/default/tsdb/2bb4cdaaf052abe6eef753470303f099/recovered.edits
Which increases the default of hfile.index.block.maxsize from 128kb to 1024kb
Created 12-17-2015 04:22 PM
Root Cause -
1. Milliseconds were used in opentsdb metrics, which may generate over 32000 metrics in one hour. each column of milliseconds metrics uses 4 bytes, when compacting, the integrated column size may exceed the 128KB (hfile.index.block.max.size)
2.If the size of ( rowkey + columnfamily:qualifer ) is greater than hfile.index.block.max.size, this may cause the memstore flush to infinite loop during writing the hfile index.
…
That's why the compaction hangs, and the tmp folder of regions on hdfs increases all the time, and makes the region server down.
When HBase is starting, it will create that huge file in a .tmp directory in one of the subdirectories under the tsdb directory.
Solution -
The cluster was immediately stable (no more increasing disk space problem) and old data could be seen in openTSDB and viewed in grafana without issue.
Dataflow was turned on and everything appears to be working normally again.
Further Solution Details -
5.7 G /apps/hbase/data/data/default/tsdb/08bfcc080d15d1127a0ebe664fdb1d80
5.0 G /apps/hbase/data/data/default/tsdb/0a55f4589b4e4bc9d7f71957a5795b4f
7.9 G /apps/hbase/data/data/default/tsdb/10066f9f83ac300e955ab9d0129ebf22
3.6 G /apps/hbase/data/data/default/tsdb/1cbf2fbf04b1e276b3de3615b95dc68f
5.5 G /apps/hbase/data/data/default/tsdb/1ed939038b8902a9c391d6d6d5a519f4
9.2 G /apps/hbase/data/data/default/tsdb/25b4dadf6621b09a63d2b1b9401203b9
6.5 G /apps/hbase/data/data/default/tsdb/2919b649b3b4a027ce8aece9a3e5ffd9
967.2 G /apps/hbase/data/data/default/tsdb/2bb4cdaaf052abe6eef753470303f099
4.6 G /apps/hbase/data/data/default/tsdb/39ab2524f8aaf2ee5685fc85a8dc1543
1022.7 M /apps/hbase/data/data/default/tsdb/39e3ce75e7225805534595c8c7e03305
4.1 G /apps/hbase/data/data/default/tsdb/4356d552eacfa526df24b400fd8007c7
9.9 G /apps/hbase/data/data/default/tsdb/4f2c1cc6d7c650c3f822136614921076
5.9 G /apps/hbase/data/data/default/tsdb/57eb8cbb2099bd6e1746cd4c8e007207
6.8 G /apps/hbase/data/data/default/tsdb/5e26da2eacba074a132edefca38017a3
1.2 T /apps/hbase/data/data/default/tsdb/69fc543c6f94891d3071005294c3c116
4.1 G /apps/hbase/data/data/default/tsdb/6ccc256f9721216bc9afa29e7d056bd4
7.5 G /apps/hbase/data/data/default/tsdb/6d43c524d221f3f54356e716c8f8849d
6.1 G /apps/hbase/data/data/default/tsdb/70f9f9ee045cde2823cf8ab485662a63
3.2 G /apps/hbase/data/data/default/tsdb/75f232ce81d5de2efb3f763d09d9c76f
8.7 G /apps/hbase/data/data/default/tsdb/7b4f9c05d64151d3f54558c70e5e9811
5.5 G /apps/hbase/data/data/default/tsdb/7fa9d913fd9a059733e9bb7a31b03e22
8.5 G /apps/hbase/data/data/default/tsdb/840f9f977262f1fbf9f4c06a8014c44b
3.0 G /apps/hbase/data/data/default/tsdb/9a3e7dbad294134eae934af980dd8c1c
6.5 G /apps/hbase/data/data/default/tsdb/a161383c9ae7df3c0cb15da093312908
4.0 G /apps/hbase/data/data/default/tsdb/b5e22e68e8f93ac50c9d9d3a3ca3f029
7.1 G /apps/hbase/data/data/default/tsdb/c9a9f105e8c4a44e8fd9172131b929d0
4.7 G /apps/hbase/data/data/default/tsdb/cc27e8c5a020291b7b3ac010dc50e25b
5.6 G /apps/hbase/data/data/default/tsdb/cc7f6645bac92ec514f8545fdb39b617
9.3 G /apps/hbase/data/data/default/tsdb/ce6426bfac53fb06fe8d320f3de150ee
1.8 G /apps/hbase/data/data/default/tsdb/d2ee226094556e6a90599e91bcba70f4
6.8 G /apps/hbase/data/data/default/tsdb/df6e88e27be5d3f0759d477812ab9277
3.0 G /apps/hbase/data/data/default/tsdb/efde9cca49c0f23a4e39e80e4040ac5a
5.9 G /apps/hbase/data/data/default/tsdb/f82790d90791b30aacd2bd990a1d4655
7.5 G /apps/hbase/data/data/default/tsdb/fcbb1b8f04f4e74a80882ef074244173
4.8 G /apps/hbase/data/data/default/tsdb/fece7565715c791028581022b70672e7
hadoop fs -rm -R -skipTrash /apps/hbase/data/data/default/tsdb/69fc543c6f94891d3071005294c3c116/.tmp
hadoop fs -rm -R -skipTrash /apps/hbase/data/data/default/tsdb/69fc543c6f94891d3071005294c3c116/recovered.edits
hadoop fs -rm -R -skipTrash /apps/hbase/data/data/default/tsdb/2bb4cdaaf052abe6eef753470303f099/.tmp
hadoop fs -rm -R -skipTrash /apps/hbase/data/data/default/tsdb/2bb4cdaaf052abe6eef753470303f099/recovered.edits
Which increases the default of hfile.index.block.maxsize from 128kb to 1024kb
Created 07-26-2016 08:21 PM
This issue is tracked on https://issues.apache.org/jira/browse/HBASE-16288