Support Questions
Find answers, ask questions, and share your expertise

Problems with creating HFiles from Hive Table

Expert Contributor

I have a Hive ORC Table with the following schema:

CREATE TABLE mydb.orc_table (String tester, int counter, int hashvalue, String valuelist) STORED AS ORC;

Now I found the following tutorial about how to generate (and import) HFiles from Hive Table, to transfer the data into HBase: https://community.hortonworks.com/articles/2745/creating-hbase-hfiles-from-an-existing-hive-table.ht...

In this article they just create a new Hive table that is linked with a HBase table. That's also working fine for me:

create table mydb.hbase_table(rowkey String, hashvalue int, valuelist String)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:hashValue,cf:valueList");

In the next step I need to create the HFiles by running the following query:

set hfile.family.path=/tmp/stdf_hbase/cf;
set hive.hbase.generatehfiles=true;

INSERT OVERWRITE TABLE mydb.hbase_table SELECT concat_ws("_", tester, counter) as key, hashvalue, valuelist from (select * from mydb.orc_table limit 1000000) a ORDER BY key, hashvalue;

In the example that's everything that is to do. But when I run this query the Tez task is running into the known "Added a key not lexically larger than previous [...]":

Added a key not lexically larger than previous. Current cell = tester1_1/cf:hashValue/1533284948384/Put/vlen=3/seqid=0, lastCell = tester1_1/cf:valueList/1533284948384/Put/vlen=15231/seqid=0

I know, that this exception occurs because Tez tried to write the value of column "hashValue" after writing the "valueList" column value. As the data for HFile must be ordered by <Key>/<CF>:<CQ> I somehow must be able to write the hashValue always before writing the valueList for a row (as hashValue is lexically smaller than valueList).

But how to do this? Why is the example working? Or another idea: Is it somehow possible to split the "hashValue" and "valueList" columns into two different Tez tasks (also not working for me for now, and it wouldn't be a good solution)?

Thank you for the help

1 REPLY 1

Cloudera Employee

It seems that this is happening because columns are not sorted in your query. Kindly refer to the below link with an example for better understanding: https://stackoverflow.com/questions/46325233/spark-issues-in-creating-hfiles-added-a-key-not-lexical...