Created on 07-07-2016 06:03 PM
We have Hive over Hbase table and lets say there are few columns with INT datatype, data loaded from Hive. Now if we would like to delete data based on values present in that particular column(INT), is not possible. It is because values are converted to Binary, even HBase API filter(SingleColumnValueFilter) would return wrong result if we query that particular column values from HBase.
Problem to solve: How purge Hive INT datatype column data from HBase?
This is the first textual series containing the resolution of above problem. Next series i'll create a small video on running code and cover other datatypes too.
In such scenario we cant use standard API and unable to apply filters on binary column values,
Solution :- Below JRuby program code.
So you have already heard many advantages of storing data in HBase(specially binary block format) and create Hive table on top of that to query your data. I am not going to explain use case for this, why we required HBase over Hive but simple reason for batter visibility/representation of data in tabular format.
I have come across this problem few days back when we required to purge HBase data after completion of retention period and we struck to delete data from HBase table using HBase API's and filters when particular column/columns is of INT data type from Hive. Below is sample use case:-
There are two type of storage format when for Hive data in HBase:-
1. Binary
2. String
Storing data in Binary block in HBase has its own advantages. Below script to create sample tables in both Hbase and Hive:-
HBase:-
1. create 'tiny_hbase_table1', 'ck', 'o', {NUMREGIONS => 16, SPLITALGO => 'UniformSplit'}
Hive:-
CREATE EXTERNAL TABLE orgdata ( key INT, kingdom STRING, kingdomkey INT, kongo STRING ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,o:kingdom#s,o:kingdomKey#b,o:kongo#b") TBLPROPERTIES( "hbase.table.name" = "tiny_hbase_table1", "hbase.table.default.storage.type" = "binary" ); insert into orgdata(1,'London',1001,'victoria secret'); insert into orgdata values(2,'India',1001,'Indira secret'); insert into orgdata values(3,'Saudi Arabia',1001,'Muqrin'); insert into orgdata values(4,'Swaziland',1001,'King Mswati');
hbase(main):080:0> scan 'tiny_hbase_table1' ROW COLUMN+CELL \x00\x00\x00\x01 column=o:kingdom, timestamp=1467806798430, value=Swaziland \x00\x00\x00\x01 column=o:kingdomKey, timestamp=1467806798430, value=\x00\x00\x03\xE9 \x00\x00\x00\x02 column=o:kingdom, timestamp=1467806928329, value=India \x00\x00\x00\x02 column=o:kingdomKey, timestamp=1467806928329, value=\x00\x00\x03\xE9 \x00\x00\x00\x03 column=o:kingdom, timestamp=1467806933574, value=Saudi Arabia \x00\x00\x00\x03 column=o:kingdomKey, timestamp=1467806933574, value=\x00\x00\x03\xE9 \x00\x00\x00\x04 column=o:kingdom, timestamp=1467807030737, value=Swaziland \x00\x00\x00\x04 column=o:kingdomKey, timestamp=1467807030737, value=\x00\x00\x03\xE9 4 row(s) in 0.0690 seconds
Now lets apply our HBase filter we get no result:-
hbase(main):001:0> scan 'tiny_hbase_table1', {FILTER => "(PrefixFilter ('\x00\x00\x00\x01') hbase(main):002:1" scan 'tiny_hbase_table1', {FILTER => "(PrefixFilter ('1')
If we dont know what is the equivalent value of INT column like kingdomkey, its not possible to apply filter.
Now you can see we get wrong results and with SingleColumnValueFilter would fail in this scenario, see below:-
import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes scan 'tiny_hbase_table1', {LIMIT => 10, FILTER => SingleColumnValueFilter.new(Bytes.toBytes('o'), Bytes.toBytes('kingdomKey'), CompareFilter::CompareOp.valueOf('EQUAL'), Bytes.toBytes('1001')), COLUMNS => 'o:kingdom' } ROW COLUMN+CELL \x00\x00\x00\x01 column=o:kingdom, timestamp=1467806798430, value=Swaziland \x00\x00\x00\x02 column=o:kingdom, timestamp=1467806928329, value=India \x00\x00\x00\x03 column=o:kingdom, timestamp=1467806933574, value=Saudi Arabia \x00\x00\x00\x04 column=o:kingdom, timestamp=1467807030737, value=Swaziland 4 row(s) in 0.3640 seconds
Now Solution is below JRuby program, using it you get proper results and inside program you can apply delete_row hbase command to delete candidate record as soon as you find in loop:-
import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.client.HTable import org.apache.hadoop.hbase.client.Get import org.apache.hadoop.hbase.util.Bytes import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.util.Bytes; import org.apache.hadoop.hbase.client.ResultScanner; import org.apache.hadoop.hbase.client.Result; import java.util.ArrayList; def delete_get_some() var_table = "tiny_hbase_table1" htable = HTable.new(HBaseConfiguration.new, var_table) rs = htable.getScanner(Bytes.toBytes("o"), Bytes.toBytes("kingdomKey")) output = ArrayList.new output.add "ROW\t\t\t\t\t\tCOLUMN\+CELL" rs.each { |r| r.raw.each { |kv| row = Bytes.toInt(kv.getRow) fam = kv.getFamily ql = Bytes.toString(kv.getQualifier) ts = kv.getTimestamp val = Bytes.toInt(kv.getValue) rowval = Bytes.toInt(kv.getRow) output.add "#{row} #{ql} #{val}" } } output.each {|line| puts "#{line}\n"} end delete_get_some
ROW COLUMN+CELL 1 kingdomKey 1001 2 kingdomKey 1001 3 kingdomKey 1001 4 kingdomKey 1001
You can declare variable and apply custom filter on values and delete rowkey based on readable values:-
if val <= myVal and row.include? 'likeme^' output.add "#{val} #{row} <<<<<<<<<<<<<<<<<<<<<<<<<<- Candidate for deletion" deleteall var_table, rowend
Hope this solve a problem you are facing too. Let me know in case of any query and suggestions...