Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best way for moving Hive table data into HBase table

Best way for moving Hive table data into HBase table

Expert Contributor

I have a huge Hive Table, which works fine so far. Now I want to play around with HBase, so I'm looking for a way to my Hive table data into a (new) HBase table. I already found some solutions for that, but I'm not sure which way is the best one. By the way, I'm familiar with Spark, so working with RDD / Datasets is not a problem.

I'm using the Hortonworks Data Platform 2.6.5.

  • SHC (Spark HBase Connector)
    • Reading the Hive data into a Dataset by SparkSQL
    • creating HBase table via HBase Shell
    • defining a Catalog object that maps the Hive Columns to HBase ColumnFamilies and Qualifiers
    • writing the data of the Dataset via df.write.options(...).format("org.apache.spark.sql.execution.datasources.hbase").save()
  • Phoenix
    • creating Phoenix table via JDBC
    • reading Hive table data into Dataset via SparkSQL
    • writing the Dataset via df.write.options(...).format("org.apache.phoenix.spark").save()
  • Hive-HBase Integration (https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration)
    • create HBase table
    • create external Hive Table (as template for the HFile creation)
    • set Hive properties for HFile creation
      • set hfile.family.path=/tmp/my_test_table/cf
      • set hive.hbase.generatehfiles=true
    • move data from Hive to HBase by INSERT OVERWRITE TABLE ... SELECT FROM ... statement
    • insert generated HFiles into HBase by hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles tool
  • Spark Native (https://www.opencore.com/de/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/)
    • Reading the Hive table data into Dataset via SparkSQL
    • Transforming the Dataset into PairRDD<ImmutableBytesWritable, KeyValue>
    • save this RDD into HFiles by calling rdd.saveAsNewAPIHadoopFile
    • importing HFiles into HBase by hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles tool

Are there other interesting ways to bulk load HBase by Hive data? Which way above is the most "common" one?

Thank you for your help!

1 REPLY 1
Highlighted

Re: Best way for moving Hive table data into HBase table

New Contributor

Hi Daniel,

AFAIK HFiles is the most efficient approach to bulk load data into Hbase. So, either 3rd or 4th approach seems to be good. Personally I would prefer 3rd approach Hive-HBase Integration as it is completely native & simple approach (as it also avoids writing any code)

Don't have an account?
Coming from Hortonworks? Activate your account here