I have a huge Hive Table, which works fine so far. Now I want to play around with HBase, so I'm looking for a way to my Hive table data into a (new) HBase table. I already found some solutions for that, but I'm not sure which way is the best one. By the way, I'm familiar with Spark, so working with RDD / Datasets is not a problem.
I'm using the Hortonworks Data Platform 2.6.5.
- SHC (Spark HBase Connector)
- Reading the Hive data into a Dataset by SparkSQL
- creating HBase table via HBase Shell
- defining a Catalog object that maps the Hive Columns to HBase ColumnFamilies and Qualifiers
- writing the data of the Dataset via df.write.options(...).format("org.apache.spark.sql.execution.datasources.hbase").save()
- Phoenix
- creating Phoenix table via JDBC
- reading Hive table data into Dataset via SparkSQL
- writing the Dataset via df.write.options(...).format("org.apache.phoenix.spark").save()
- Hive-HBase Integration (https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration)
- create HBase table
- create external Hive Table (as template for the HFile creation)
- set Hive properties for HFile creation
- set hfile.family.path=/tmp/my_test_table/cf
- set hive.hbase.generatehfiles=true
- move data from Hive to HBase by INSERT OVERWRITE TABLE ... SELECT FROM ... statement
- insert generated HFiles into HBase by hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles tool
- Spark Native (https://www.opencore.com/de/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/)
- Reading the Hive table data into Dataset via SparkSQL
- Transforming the Dataset into PairRDD<ImmutableBytesWritable, KeyValue>
- save this RDD into HFiles by calling rdd.saveAsNewAPIHadoopFile
- importing HFiles into HBase by hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles tool
Are there other interesting ways to bulk load HBase by Hive data? Which way above is the most "common" one?
Thank you for your help!