Created on 09-08-2016 07:24 PM
HBase supports several different compression algorithms which can be enabled on a ColumnFamily. Data block encoding attempts to limit duplication of information in keys, taking advantage of some of the fundamental designs and patterns of HBase, such as sorted row keys and the schema of a given table. Compressors reduce the size of large, opaque byte arrays in cells, and can significantly reduce the storage space needed to store uncompressed data.
Compressors and data block encoding can be used together on the same ColumnFamily.
Changes Take Effect Upon Compaction. If you change compression or encoding for a ColumnFamily, the changes take effect during compaction.
Some codecs take advantage of capabilities built into Java, such as GZip compression. Others rely on native libraries. Native libraries may be available as part of Hadoop, such as LZ4. In this case, HBase only needs access to the appropriate shared library. Other codecs, such as Google Snappy, need to be installed first. Some codecs are licensed in ways that conflict with HBase's license and cannot be shipped as part of HBase.
This articles discusses common codecs that are used and tested with HBase. No matter what codec you use, be sure to test that it is installed correctly and is available on all nodes in your cluster.
Block Compressors
Unfortunately, HBase cannot ship with LZO because of the licensing issues; HBase is Apache-licensed, LZO is GPL. Therefore LZO install is to be done post-HBase install. See the Using LZO Compression wiki page for how to make LZO work with HBase.
A common problem users run into when using LZO is that while initial setup of the cluster runs smooth, a month goes by and some sysadmin goes to add a machine to the cluster only they'll have forgotten to do the LZO fixup on the new machine. In versions since HBase 0.90.0, we should fail in a way that makes it plain what the problem is, but maybe not.
GZIP will generally compress better than LZO though slower. For some setups, better compression may be preferred. Java will use java's GZIP unless the native Hadoop libs are available on the CLASSPATH; in this case it will use native compressors instead (If the native libs are NOT present, you will see lots of Got brand-new compressor reports in your logs
If snappy is installed, HBase can make use of it (courtesy of hadoop-snappy [32]).
$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy
$ hbase> create 't1', { NAME => 'cf1', COMPRESSION => 'SNAPPY' } hbase> describe 't1'
In the output of the "describe" command, you need to ensure it lists "COMPRESSION => 'SNAPPY'"
You will find the snappy library file under the .libs directory from your Snappy build (For example /home/hbase/snappy-1.0.5/.libs/). The file is called libsnappy.so.1.x.x where 1.x.x is the version of the snappy code you are building. You can either copy this file into your hbase directory under libsnappy.so name, or simply create a symbolic link to it.
The second file you need is the hadoop native library. You will find this file in your hadoop installation directory under lib/native/Linux-amd64-64/ or lib/native/Linux-i386-32/. The file you are looking for is libhadoop.so.1.x.x. Again, you can simply copy this file or link to it, under the name libhadoop.so.
At the end of the installation, you should have both libsnappy.so and libhadoop.so links or files present into lib/native/Linux-amd64-64 or into lib/native/Linux-i386-32
To point hbase at snappy support, in hbase-env.sh set
export HBASE_LIBRARY_PATH=/pathtoyourhadoop/lib/native/Linux-amd64-64<br>
In /pathtoyourhadoop/lib/native/Linux-amd64-64
you should have something like:
libsnappy.a libsnappy.so libsnappy.so.1 libsnappy.so.1.1.2
RowKey:Family:Qualifier0
and the next key might be RowKey:Family:Qualifier1
. In Prefix encoding, an extra column is added which holds the length of the prefix shared between the current key and the previous key. Assuming the first key here is totally different from the key before, its prefix length is 0. The second key's prefix length is 23
, since they have the first 23 characters in common.
Obviously if the keys tend to have nothing in common, Prefix will not provide much benefit.
Created on 02-13-2017 08:14 PM
If you have long keys (compared to the values) or many columns, use a prefix encoder. FAST_DIFF is recommended
Sorry, this post is few months old, the above sentence mean it is recommended to use FAST_DIFF over PREFIX (not PREFIX_TREE) right?