Created on 11-03-2015 11:53 PM
Demo scripts available at: https://github.com/sakserv/hive-hbase-generatehfiles
Below contains an example of leveraging the Hive HBaseStorageHandler for HFile generation. This pattern provides a means of taking data already stored in Hive, exporting it as HFiles, and bulk loading the HBase table from those HFiles.
The HFile generation feature was added in HIVE-6473.
It adds the following properties that are then leveraged by the Hive HBaseStorageHandler.
Note that for hfile.family.path, the final sudirectory MUST MATCH the column family name.
The scripts in the repo called out above can be used with the Hortonworks Sandbox to test and demo this feature.
The following is an example of how to use this feature. The scripts in the repo above implement the steps below.
It is assumed that the user already has data stored in a hive table, for the sake of this example, the following table was used.
CREATE EXTERNAL TABLE passwd_orc(userid STRING, uid INT, shell STRING) STORED AS ORC LOCATION '/tmp/passwd_orc';
First, decide on the HBase table and column family name. We want to use a single column family. For the example below, the HBase table name is "passwd_hbase", the column family name is "passwd".
Below is the DDL for the HBase table created through Hive. Couple of notes:
CREATE TABLE passwd_hbase(userid STRING, uid INT, shell STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,passwd:uid,passwd:shell');
Next, generate the HFiles for the table. Couple of notes again:
SET hive.hbase.generatehfiles=true; SET hfile.family.path=/tmp/passwd_hfiles/passwd; INSERT OVERWRITE TABLE passwd_hbase SELECT DISTINCT userid,uid,shell FROM passwd_orc CLUSTER BY userid;
Finally, load the HFiles into the HBase table:
export HADOOP_CLASSPATH=`hbase classpath` yarn jar /usr/hdp/current/hbase-client/lib/hbase-server.jar completebulkload /tmp/passwd_hfiles passwd_hbase
The data can now be queried from Hive or HBase.
Created on 11-16-2016 09:58 AM
Hi Shane,
Thanks for the example. This is very useful.
I have question - How will this work with multiple column families of HBase table?
Do you have any tips on it?
Thanks & Regards,
Dhaval
Created on 05-24-2017 05:57 PM
I am on HDP 2.3.4 and when I am running the above step with small data set it works fine but when I am trying to load 15 million rows, I am getting this error:
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1492812474981_9078_1_01, diagnostics=[Task failed, taskId=task_1492812474981_9078_1_01_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: Hive Runtime Error while closing operators: java.io.IOException: Multiple family directories found in hdfs://apps/hive/warehouse/somedb.db/hive_cdi3/_temporary/0/_temporary
I google it and it says that this is known bug: https://issues.apache.org/jira/browse/HIVE-13539
Please let me know if there is any work around.
Created on 06-05-2018 08:01 PM
it's throwing me following error if I have multiple column family in my Hbase table.
Does this approach works only for the single column family ?
java.lang.RuntimeException: Hive Runtime Error while closing operators: java.io.IOException: Multiple family directories found in hdfs://hadoopdev/apps/hive/warehouse/temp.db/employee_details/_temporary/0/_temporary/attempt_1527799542731_1180_r_000000_0