08-02-2017 06:58 AM
We noticed a ridiculous improvement on a table's size after it's snapshot was exported to another cluster.
old-cluster - table is145.8 GB
new-cluster - table is 54.8 MB
The two clusters are configured similarly. Same number of region servers,etc. The numbers above reflect table size not including replication.
One of the reasons were are moving to the new-cluster is for addtional storage because our data is growing so quickly and we have found that HBASE requires 50% free space to do any compaction. I assume some if not all of the difference in table size is compaction but I'm surprised at the huge difference in the sizes and wondering if there is something we could do to improve the working cluster to avoid such wasted space. Note that we load data in bulk once a day. It is not updated real-time.
I realize this post is a little obtuse. My apologies.
08-02-2017 09:53 AM
Cloudera said I had a new ranking "idiot". My apologies for not realizing this sooner. I was doing a
hdfs dfs -du -h -s hdfs://server:8020/hbase/data/default/*
and expected to see the results of the "restored" snapshot when apparently it is only information pointing to the /hbase/archive/data/default. So for anyone that is interested it looks like to find out the REAL table size and you are using snapshots you need to look in BOTH directories and add them together.
Cloudera didn't really call me that :)
08-02-2017 12:47 PM
Thanks for the laugh @keeblerh. Don't be so hard on yourself and thanks for sharing the solution. :)