Support Questions

Find answers, ask questions, and share your expertise

Best way for moving Hive table data into HBase table

Expert Contributor

I have a huge Hive Table, which works fine so far. Now I want to play around with HBase, so I'm looking for a way to my Hive table data into a (new) HBase table. I already found some solutions for that, but I'm not sure which way is the best one. By the way, I'm familiar with Spark, so working with RDD / Datasets is not a problem.

I'm using the Hortonworks Data Platform 2.6.5.

  • SHC (Spark HBase Connector)
    • Reading the Hive data into a Dataset by SparkSQL
    • creating HBase table via HBase Shell
    • defining a Catalog object that maps the Hive Columns to HBase ColumnFamilies and Qualifiers
    • writing the data of the Dataset via df.write.options(...).format("org.apache.spark.sql.execution.datasources.hbase").save()
  • Phoenix
    • creating Phoenix table via JDBC
    • reading Hive table data into Dataset via SparkSQL
    • writing the Dataset via df.write.options(...).format("org.apache.phoenix.spark").save()
  • Hive-HBase Integration (
    • create HBase table
    • create external Hive Table (as template for the HFile creation)
    • set Hive properties for HFile creation
      • set
      • set hive.hbase.generatehfiles=true
    • move data from Hive to HBase by INSERT OVERWRITE TABLE ... SELECT FROM ... statement
    • insert generated HFiles into HBase by hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles tool
  • Spark Native (
    • Reading the Hive table data into Dataset via SparkSQL
    • Transforming the Dataset into PairRDD<ImmutableBytesWritable, KeyValue>
    • save this RDD into HFiles by calling rdd.saveAsNewAPIHadoopFile
    • importing HFiles into HBase by hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles tool

Are there other interesting ways to bulk load HBase by Hive data? Which way above is the most "common" one?

Thank you for your help!



Hi Daniel,

AFAIK HFiles is the most efficient approach to bulk load data into Hbase. So, either 3rd or 4th approach seems to be good. Personally I would prefer 3rd approach Hive-HBase Integration as it is completely native & simple approach (as it also avoids writing any code)