Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

HBase to Hive Mapping table is not showing up complete data

New Contributor

We have a HBase table with 1 column family and has 1.5 billion records in it.

HBase Row count was retrieved using command

"count '<tablename>'", {CACHE => 1000000}.

And HBase to Hive Mapping was done with the below command.

create external table stagingdata(
rowkey String,
col1 String,
col2 String
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES (
'hbase.columns.mapping' = ':key,
n:col1,
n:col2,
') 
TBLPROPERTIES('hbase.table.name' = 'hbase_staging_data');

But While we retrieve the Hive Row Count using the below command,

select count(*) from stagingdata;

It only shows up 140 million rows in the Hive Mapped Table.

We have tried the similar approach for Smaller HBase with 100 million records and complete records were shown up in Hive Mapped Table.

My Question is why the complete 1.5 billion records are not showing up in Hive?

Are we missing here anything ?

Your Immediate Answer would be highly appreciated. Thanks, Madhu.

3 REPLIES 3

Mentor

its not a good approach to count rows via Hive. Please use Hbase-native utility. Hive implementation relies on HBase Serde and I don't know how robust it is.

hbase org.apache.hadoop.hbase.mapreduce.RowCounter $TABLE

Mentor

At the minimum, you can use similar pig script to count rows as well

-- Sample script to count rows in an HBase table
SET DEFAULT_PARALLEL 20;
A = LOAD 'hbase://table_name' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:*', '-loadKey true') as (rowkey:bytearray);
B = GROUP A ALL;
C = FOREACH B GENERATE COUNT(A);
DUMP C;

"Hive implementation relies on HBase Serde and I don't know how robust it is"

Just to try to add some more light on this comment -- if a row's values are malformed (as the Hive type serialization expects them), the row will likely be quietly skipped. There should be a log message that you could see from the Hive execution side when a row is skipped.