Reply
Contributor
Posts: 32
Registered: ‎03-17-2017

Loading Hbase table from Hive loads only a small portion of the total records

I am trying create/load an Hbase tables from existing hive tables, when I do the count on the Hbase tables, some tables are showing the exact record count and some tables are only showing a very small portion of the total records.  Below is one of the table that is showing only a portion of records:

 

1) CREATE TABLE IF NOT EXISTS cvja_transcrpt_t17(
    cnty_num CHAR(3),
    case_year SMALLINT,
    case_court_type CHAR(3),
    case_seq_num INT,
    key STRING,
    abstr_code CHAR(1),
    abstr_num CHAR(3),
    dckt_loc_book_num CHAR(3),
    dckt_loc_page_num CHAR(3),
    trans_cnty_num CHAR(3),
    trans_case_year SMALLINT,
    trans_case_type CHAR(1),
    trans_seq_num INT,
    exempt_ind CHAR(1),
    stat CHAR(1),
    last_upd_ts CHAR(26),
    last_upd_user CHAR(8),
    last_upd_pgm CHAR(8),
    iss_ent_ts CHAR(26) )
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,
dtl:cnty_num,
dtl:case_year,
dtl:case_court_type,
dtl:case_seq_num,
dtl:abstr_code,
dtl:abstr_num,
dtl:dckt_loc_book_num,
dtl:dckt_loc_page_num,
dtl:trans_cnty_num,
dtl:trans_case_year,
dtl:trans_case_type,
dtl:trans_seq_num,
dtl:exempt_ind,
dtl:stat,
dtl:last_upd_ts,
dtl:last_upd_user,
dtl:last_upd_pgm,
dtl:iss_ent_ts")
TBLPROPERTIES ("hbase.table.name" = "cvja_transcrpt_t17");

2) INSERT OVERWRITE TABLE cvprod_hb.cvja_transcrpt_t17 select * from civil_prq.cvja_transcrpt_t17;

 

Now when I check the count using either Hive via Hue: select count(*) from cvprod_hb.cvja_transcrpt_t17; or using hbase shell command count, I am getting only a 100 records out of a total of 137,110 records.

 

I am using the same process on other tables with much larger number of records and I am getting the correct results.  In fact, I have a total of 8 tables with following sizes:

TableCount HBaseCount Hive
CVJA_TRANSCRPT_T17100137,110
CV00_ISS_PRTY_T22 10552,839,401
CV00_PRTY_ALIS_T461032,177,394
CV00_PARTY_T48    10552,839,401
CV00_ORDER_T50    55,317,38655,317,386
CV00_NAME_HST_T65 2,325,1992,325,199
CV00_ISSUE_T23    43,172,77543,172,775
CV00_IDX_HST_T24  15,809,98815,809,988

As you can see from the table above, the process is working for 4 tables and is not working for 4 tables.

 

The confusing part, there are no error messages and the job browser shows that the tasks completed successfully.

I have 2 sets of hive tables, one loaded from cvs files and 1 created with parquet format.  Both, cvs and parquet formats are producing hbase tables with a small subset of the total records.

 

 I tried the above process both from Hue and by running hive against an hql file.  Got same results

 

Any ideas or pointers as why this is happening are truly appreciated.

I am using CDH 5.10

Highlighted
Contributor
Posts: 32
Registered: ‎03-17-2017

Re: Loading Hbase table from Hive loads only a small portion of the total records

I changed the column mapping, where I moved the key to be the 5th column.  I was able to get exact results. 

Please change below.

CREATE TABLE IF NOT EXISTS cvja_transcrpt_t17(
    cnty_num CHAR(3),
    case_year SMALLINT,
    case_court_type CHAR(3),
    case_seq_num INT,
    key STRING,
    abstr_code CHAR(1),
    abstr_num CHAR(3),
    dckt_loc_book_num CHAR(3),
    dckt_loc_page_num CHAR(3),
    trans_cnty_num CHAR(3),
    trans_case_year SMALLINT,
    trans_case_type CHAR(1),
    trans_seq_num INT,
    exempt_ind CHAR(1),
    stat CHAR(1),
    last_upd_ts CHAR(26),
    last_upd_user CHAR(8),
    last_upd_pgm CHAR(8),
    iss_ent_ts CHAR(26) )
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "
dtl:cnty_num,
dtl:case_year,
dtl:case_court_type,
dtl:case_seq_num,

:key,                    **** this the change I made
dtl:abstr_code,
dtl:abstr_num,
dtl:dckt_loc_book_num,
dtl:dckt_loc_page_num,
dtl:trans_cnty_num,
dtl:trans_case_year,
dtl:trans_case_type,
dtl:trans_seq_num,
dtl:exempt_ind,
dtl:stat,
dtl:last_upd_ts,
dtl:last_upd_user,
dtl:last_upd_pgm,
dtl:iss_ent_ts")
TBLPROPERTIES ("hbase.table.name" = "cvja_transcrpt_t17");

 

However, when I made the change for table: cv00_prty_alis_t46, I got 2,175,326 records which is short by 2,068 records from the sourced hive table.  It is even worse for table cv00_iss_prty_t22, where I got 43,159,505 records which short by 9,679,896 records.  That a lot of missing records.  

Any ideas why I am getting so many records missing.

Thanks