Support Questions

Find answers, ask questions, and share your expertise

Best way for moving Hive table data into HBase table

Expert Contributor

I have a huge Hive Table, which works fine so far. Now I want to play around with HBase, so I'm looking for a way to my Hive table data into a (new) HBase table. I already found some solutions for that, but I'm not sure which way is the best one. By the way, I'm familiar with Spark, so working with RDD / Datasets is not a problem.

I'm using the Hortonworks Data Platform 2.6.5.

  • SHC (Spark HBase Connector)
    • Reading the Hive data into a Dataset by SparkSQL
    • creating HBase table via HBase Shell
    • defining a Catalog object that maps the Hive Columns to HBase ColumnFamilies and Qualifiers
    • writing the data of the Dataset via df.write.options(...).format("org.apache.spark.sql.execution.datasources.hbase").save()
  • Phoenix
    • creating Phoenix table via JDBC
    • reading Hive table data into Dataset via SparkSQL
    • writing the Dataset via df.write.options(...).format("org.apache.phoenix.spark").save()
  • Hive-HBase Integration (https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration)
    • create HBase table
    • create external Hive Table (as template for the HFile creation)
    • set Hive properties for HFile creation
      • set hfile.family.path=/tmp/my_test_table/cf
      • set hive.hbase.generatehfiles=true
    • move data from Hive to HBase by INSERT OVERWRITE TABLE ... SELECT FROM ... statement
    • insert generated HFiles into HBase by hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles tool
  • Spark Native (https://www.opencore.com/de/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/)
    • Reading the Hive table data into Dataset via SparkSQL
    • Transforming the Dataset into PairRDD<ImmutableBytesWritable, KeyValue>
    • save this RDD into HFiles by calling rdd.saveAsNewAPIHadoopFile
    • importing HFiles into HBase by hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles tool

Are there other interesting ways to bulk load HBase by Hive data? Which way above is the most "common" one?

Thank you for your help!

1 REPLY 1

Explorer

Hi Daniel,

AFAIK HFiles is the most efficient approach to bulk load data into Hbase. So, either 3rd or 4th approach seems to be good. Personally I would prefer 3rd approach Hive-HBase Integration as it is completely native & simple approach (as it also avoids writing any code)