Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar

We have Hive over Hbase table and lets say there are few columns with INT datatype, data loaded from Hive. Now if we would like to delete data based on values present in that particular column(INT), is not possible. It is because values are converted to Binary, even HBase API filter(SingleColumnValueFilter) would return wrong result if we query that particular column values from HBase.

Problem to solve: How purge Hive INT datatype column data from HBase?

This is the first textual series containing the resolution of above problem. Next series i'll create a small video on running code and cover other datatypes too.

In such scenario we cant use standard API and unable to apply filters on binary column values,

Solution :- Below JRuby program code.

So you have already heard many advantages of storing data in HBase(specially binary block format) and create Hive table on top of that to query your data. I am not going to explain use case for this, why we required HBase over Hive but simple reason for batter visibility/representation of data in tabular format.

I have come across this problem few days back when we required to purge HBase data after completion of retention period and we struck to delete data from HBase table using HBase API's and filters when particular column/columns is of INT data type from Hive. Below is sample use case:-

There are two type of storage format when for Hive data in HBase:-

1. Binary

2. String

Storing data in Binary block in HBase has its own advantages. Below script to create sample tables in both Hbase and Hive:-

HBase:-

1.  create 'tiny_hbase_table1', 'ck', 'o', {NUMREGIONS => 16, SPLITALGO => 'UniformSplit'}

Hive:-

CREATE EXTERNAL TABLE orgdata (
key INT,
kingdom STRING,
kingdomkey INT,
kongo STRING
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,o:kingdom#s,o:kingdomKey#b,o:kongo#b")
TBLPROPERTIES(
"hbase.table.name" = "tiny_hbase_table1",
"hbase.table.default.storage.type" = "binary"
);

insert into orgdata(1,'London',1001,'victoria secret');
insert into orgdata values(2,'India',1001,'Indira secret');
insert into orgdata values(3,'Saudi Arabia',1001,'Muqrin');
insert into orgdata values(4,'Swaziland',1001,'King Mswati');
hbase(main):080:0> scan 'tiny_hbase_table1' 

ROW  COLUMN+CELL  
\x00\x00\x00\x01        column=o:kingdom, timestamp=1467806798430, value=Swaziland                    
\x00\x00\x00\x01        column=o:kingdomKey, timestamp=1467806798430, value=\x00\x00\x03\xE9            
\x00\x00\x00\x02        column=o:kingdom, timestamp=1467806928329, value=India                          
\x00\x00\x00\x02        column=o:kingdomKey, timestamp=1467806928329, value=\x00\x00\x03\xE9            
\x00\x00\x00\x03        column=o:kingdom, timestamp=1467806933574, value=Saudi Arabia                  
\x00\x00\x00\x03        column=o:kingdomKey, timestamp=1467806933574, value=\x00\x00\x03\xE9            
\x00\x00\x00\x04        column=o:kingdom, timestamp=1467807030737, value=Swaziland                      
\x00\x00\x00\x04        column=o:kingdomKey, timestamp=1467807030737, value=\x00\x00\x03\xE9  
4 row(s) in 0.0690 seconds

Now lets apply our HBase filter we get no result:-

hbase(main):001:0> scan 'tiny_hbase_table1', {FILTER => "(PrefixFilter ('\x00\x00\x00\x01')
hbase(main):002:1" scan 'tiny_hbase_table1', {FILTER => "(PrefixFilter ('1')

If we dont know what is the equivalent value of INT column like kingdomkey, its not possible to apply filter.

Now you can see we get wrong results and with SingleColumnValueFilter would fail in this scenario, see below:-

import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'tiny_hbase_table1', {LIMIT => 10, FILTER => SingleColumnValueFilter.new(Bytes.toBytes('o'), Bytes.toBytes('kingdomKey'), CompareFilter::CompareOp.valueOf('EQUAL'), Bytes.toBytes('1001')), COLUMNS => 'o:kingdom' }
ROW                                        COLUMN+CELL                                                  
\x00\x00\x00\x01                          column=o:kingdom, timestamp=1467806798430, value=Swaziland   
\x00\x00\x00\x02                          column=o:kingdom, timestamp=1467806928329, value=India       
\x00\x00\x00\x03                          column=o:kingdom, timestamp=1467806933574, value=Saudi Arabia
\x00\x00\x00\x04                          column=o:kingdom, timestamp=1467807030737, value=Swaziland   
4 row(s) in 0.3640 seconds

Now Solution is below JRuby program, using it you get proper results and inside program you can apply delete_row hbase command to delete candidate record as soon as you find in loop:-

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Get
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Result;
import java.util.ArrayList;
 
def delete_get_some()
var_table = "tiny_hbase_table1"
htable = HTable.new(HBaseConfiguration.new, var_table)
rs = htable.getScanner(Bytes.toBytes("o"), Bytes.toBytes("kingdomKey"))
output = ArrayList.new 
output.add "ROW\t\t\t\t\t\tCOLUMN\+CELL"
rs.each { |r| r.raw.each { |kv|
row = Bytes.toInt(kv.getRow)
fam = kv.getFamily
ql = Bytes.toString(kv.getQualifier)
ts = kv.getTimestamp
val = Bytes.toInt(kv.getValue)
rowval = Bytes.toInt(kv.getRow)
output.add "#{row} #{ql} #{val}"
}
}
output.each {|line| puts "#{line}\n"}
end
 
delete_get_some
ROW						COLUMN+CELL
1 kingdomKey 1001
2 kingdomKey 1001
3 kingdomKey 1001
4 kingdomKey 1001

You can declare variable and apply custom filter on values and delete rowkey based on readable values:-

if val <= myVal and row.include? 'likeme^' 
output.add "#{val} #{row} <<<<<<<<<<<<<<<<<<<<<<<<<<- Candidate for deletion"
deleteall var_table, rowend
Hope this solve a problem you are facing too. Let me know in case of any query and suggestions...
1,560 Views