Support Questions
Find answers, ask questions, and share your expertise

Problems with creating HFiles from Hive Table

Expert Contributor

I have a Hive ORC Table with the following schema:

CREATE TABLE mydb.orc_table (String tester, int counter, int hashvalue, String valuelist) STORED AS ORC;

Now I found the following tutorial about how to generate (and import) HFiles from Hive Table, to transfer the data into HBase:

In this article they just create a new Hive table that is linked with a HBase table. That's also working fine for me:

create table mydb.hbase_table(rowkey String, hashvalue int, valuelist String)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:hashValue,cf:valueList");

In the next step I need to create the HFiles by running the following query:

set hive.hbase.generatehfiles=true;

INSERT OVERWRITE TABLE mydb.hbase_table SELECT concat_ws("_", tester, counter) as key, hashvalue, valuelist from (select * from mydb.orc_table limit 1000000) a ORDER BY key, hashvalue;

In the example that's everything that is to do. But when I run this query the Tez task is running into the known "Added a key not lexically larger than previous [...]":

Added a key not lexically larger than previous. Current cell = tester1_1/cf:hashValue/1533284948384/Put/vlen=3/seqid=0, lastCell = tester1_1/cf:valueList/1533284948384/Put/vlen=15231/seqid=0

I know, that this exception occurs because Tez tried to write the value of column "hashValue" after writing the "valueList" column value. As the data for HFile must be ordered by <Key>/<CF>:<CQ> I somehow must be able to write the hashValue always before writing the valueList for a row (as hashValue is lexically smaller than valueList).

But how to do this? Why is the example working? Or another idea: Is it somehow possible to split the "hashValue" and "valueList" columns into two different Tez tasks (also not working for me for now, and it wouldn't be a good solution)?

Thank you for the help


Cloudera Employee

It seems that this is happening because columns are not sorted in your query. Kindly refer to the below link with an example for better understanding: