Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar
Rising Star

Hive HBase Generate HFiles

Demo scripts available at: https://github.com/sakserv/hive-hbase-generatehfiles

Below contains an example of leveraging the Hive HBaseStorageHandler for HFile generation. This pattern provides a means of taking data already stored in Hive, exporting it as HFiles, and bulk loading the HBase table from those HFiles.

Overview

The HFile generation feature was added in HIVE-6473.

It adds the following properties that are then leveraged by the Hive HBaseStorageHandler.

  • hive.hbase.generatehfiles - true to generate HFiles
  • hfile.family.path - path in HDFS to put the HFiles.

Note that for hfile.family.path, the final sudirectory MUST MATCH the column family name.

The scripts in the repo called out above can be used with the Hortonworks Sandbox to test and demo this feature.

Example

The following is an example of how to use this feature. The scripts in the repo above implement the steps below.

It is assumed that the user already has data stored in a hive table, for the sake of this example, the following table was used.

CREATE EXTERNAL TABLE passwd_orc(userid STRING, uid INT, shell STRING)
STORED AS ORC
LOCATION '/tmp/passwd_orc';

First, decide on the HBase table and column family name. We want to use a single column family. For the example below, the HBase table name is "passwd_hbase", the column family name is "passwd".

Below is the DDL for the HBase table created through Hive. Couple of notes:

  • userid as my row key.
  • :key is special syntax in the hbase.columns.mapping
  • each column (qualifier) is in the form column family:column (qualifier)
CREATE TABLE passwd_hbase(userid STRING, uid INT, shell STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,passwd:uid,passwd:shell');

Next, generate the HFiles for the table. Couple of notes again:

  • The hfile.family.path is where the hfiles will be generated.
  • The final subdirectory name MUST match the column family name.
SET hive.hbase.generatehfiles=true;
SET hfile.family.path=/tmp/passwd_hfiles/passwd;
INSERT OVERWRITE TABLE passwd_hbase SELECT DISTINCT userid,uid,shell FROM passwd_orc CLUSTER BY userid;

Finally, load the HFiles into the HBase table:

export HADOOP_CLASSPATH=`hbase classpath` 
yarn jar /usr/hdp/current/hbase-client/lib/hbase-server.jar completebulkload /tmp/passwd_hfiles passwd_hbase

The data can now be queried from Hive or HBase.

13,264 Views
Comments
avatar
Contributor

Hi Shane,

Thanks for the example. This is very useful.

I have question - How will this work with multiple column families of HBase table?

Do you have any tips on it?

Thanks & Regards,

Dhaval

avatar
New Contributor

I am on HDP 2.3.4 and when I am running the above step with small data set it works fine but when I am trying to load 15 million rows, I am getting this error:

Vertex failed, vertexName=Reducer 2, vertexId=vertex_1492812474981_9078_1_01, diagnostics=[Task failed, taskId=task_1492812474981_9078_1_01_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: Hive Runtime Error while closing operators: java.io.IOException: Multiple family directories found in hdfs://apps/hive/warehouse/somedb.db/hive_cdi3/_temporary/0/_temporary

I google it and it says that this is known bug: https://issues.apache.org/jira/browse/HIVE-13539

Please let me know if there is any work around.

avatar
Rising Star

it's throwing me following error if I have multiple column family in my Hbase table.

Does this approach works only for the single column family ?

	java.lang.RuntimeException: Hive Runtime Error while closing operators: java.io.IOException: Multiple family directories found in hdfs://hadoopdev/apps/hive/warehouse/temp.db/employee_details/_temporary/0/_temporary/attempt_1527799542731_1180_r_000000_0