Community Articles

vsundaram · ‎05-11-2017

From the logs and HDFS usage outputs it confirms the growth is related to HBASE snapshots. We can confirm this by checking if all the snapshots having same time stamp. "list_snapshots" command from hbase shell will provide an output like below.

 hbase> list_snapshots 
 SYSTEM.CATALOG-ru-20160512 SYSTEM.CATALOG (Thu May 12 01:47:24 +0000 2016)
 SYSTEM.FUNCTION-ru-20160512 SYSTEM.FUNCTION (Thu May 12 01:47:24 +0000 2016)
 SYSTEM.SEQUENCE-ru-20160512 SYSTEM.SEQUENCE (Thu May 12 01:47:24 +0000 2016)
 SYSTEM.STATS-ru-20160512 SYSTEM.STATS (Thu May 12 01:47:32 +0000 2016)
 US_1-ru-20160512 US_1 (Thu May 12 01:47:32 +0000 2016)
 ambarismoketest-ru-20160512 ambarismoketest (Thu May 12 01:47:32 +0000 2016)
 dev.hadoop-ru-20160512 dev.hadoop(Thu May 12 01:47:33 +0000 2016)
 prod.hadoop-ru-20160512 prod.hadoop (Thu May 12 01:47:35 +0000 2016)
 compact.daily-ru-20160512 compact.daily (Thu May 12 01:47:43 +0000 2016)
 compact.hourly-ru-20160512 compact.hourly (Thu May 12 01:47:43 +0000 2016)
 test-ru-20160512 test (Thu May 12 01:47:43 +0000 2016)

We can confirm the timestamp of these snapshots from "hdfs dfs -ls -R /apps/hbase/ " out put as well.

drwxr-xr-x   - hbase      hdfs            0 2016-05-12 01:58 /apps/hbase/data/.hbase-snapshot
drwxr-xr-x   - hbase      hdfs            0 2016-05-12 01:47 /apps/hbase/data/.hbase-snapshot/SYSTEM.CATALOG-ru-20160512
-rw-r--r--   3 hbase      hdfs           55 2016-05-12 01:47 /apps/hbase/data/.hbase-snapshot/SYSTEM.CATALOG-ru-20160512/.snapshotinfo
-rw-r--r--   3 hbase      hdfs          972 2016-05-12 01:47 /apps/hbase/data/.hbase-snapshot/SYSTEM.CATALOG-ru-20160512/data.manifest
drwxr-xr-x   - hbase      hdfs            0 2016-05-12 01:47 /apps/hbase/data/.hbase-snapshot/SYSTEM.FUNCTION-ru-20160512
-rw-r--r--   3 hbase      hdfs           57 2016-05-12 01:47 /apps/hbase/data/.hbase-snapshot/SYSTEM.FUNCTION-ru-20160512/.snapshotinfo
-rw-r--r--   3 hbase      hdfs         1064 2016-05-12 01:47 /apps/hbase/data/.hbase-snapshot/SYSTEM.FUNCTION-ru-20160512/data.manifest
drwxr-xr-x   - hbase      hdfs            0 2016-05-12 01:47 /apps/hbase/data/.hbase-snapshot/SYSTEM.SEQUENCE-ru-20160512
-rw-r--r--   3 hbase      hdfs           57 2016-05-12 01:47 /apps/hbase/data/.hbase-snapshot/SYSTEM.SEQUENCE-ru-20160512/.snapshotinfo
-rw-r--r--   3 hbase      hdfs        16813 2016-05-12 01:47 /apps/hbase/data/.hbase-snapshot/SYSTEM.SEQUENCE-ru-20160512/data.manifest
drwxr-xr-x   - hbase      hdfs            0 2016-05-12 01:47 /apps/hbase/data/.hbase-snapshot/SYSTEM.STATS-ru-20160512
-rw-r--r--   3 hbase      hdfs           51 2016-05-12 01:47 /apps/hbase/data/.hbase-snapshot/SYSTEM.STATS-ru-20160512/.snapshotinfo
-rw-r--r--   3 hbase      hdfs          928 2016-05-12 01:47 /apps/hbase/data/.hbase-snapshot/SYSTEM.STATS-ru-20160512/data.manifest

Hbase snapshots will be created as part of HDP Upgrade process. During upgrade process “snapshot_all” command will be triggered from “hbase_upgrade.py” script. Hence we can see all the snapshots have the same time stamp. Initially the snapshots will be a reference to the original table. When we run jobs after upgrade or insert data in to Hbase tables, these snapshots will expand with the delta to maintaining its original state. This could cause the gradual increase of snapshot size and hence HDFS size. It is safe to delete the HBASE snapshots since they are just a reference to the original HBASE table. Deletion of the snapshots will clear up respective archive files too. Please remember not to delete archive directly or we will corrupt snapshot.

Cloudera Community

Community Articles

HDFS capacity usage doubled after HDP upgrade

Apache Hadoop

Apache HBase