Created 11-02-2015 08:49 PM
Looking for approaches for loading HBase tables if all I have is the data in an ORC backed Hive table.
I would prefer a bulk load approach, given there are several hundred million rows in the ORC backed Hive table.
I found the following, anyone have experience with Hive's HBase bulk load feature? Would it be better to create a CSV table and CTAS from ORC into the CSV table, and then use ImportTsv on the HBase side?
Any experiences here would be appreciated.
Created 11-02-2015 09:02 PM
Hey
You can Bulk load into Hbase in several different manners.The importTsv tool has been out there for a while. However if your data is in ORC with a HIve table on top the Hive bulk load is an easier option with less moving parts.
This slide from nick has a lot of info http://fr.slideshare.net/HBaseCon/ecosystem-session-3a, slide 12 is the one you want to look at.
Essentially
set hive.hbase.generatehfiles=true
set hfile.family.path=/tmp/somewhere (this can also be a property)
this allows you to do insert into with the result of a sql statement a little more agile then having to go down the csv way. Careful the Hbase user will be picking up the generated files.
Created 11-02-2015 09:02 PM
Hey
You can Bulk load into Hbase in several different manners.The importTsv tool has been out there for a while. However if your data is in ORC with a HIve table on top the Hive bulk load is an easier option with less moving parts.
This slide from nick has a lot of info http://fr.slideshare.net/HBaseCon/ecosystem-session-3a, slide 12 is the one you want to look at.
Essentially
set hive.hbase.generatehfiles=true
set hfile.family.path=/tmp/somewhere (this can also be a property)
this allows you to do insert into with the result of a sql statement a little more agile then having to go down the csv way. Careful the Hbase user will be picking up the generated files.
Created 11-03-2015 11:47 PM
While I've yet to use this on the large table, it worked very well on a small sample. There were some gotchas that aren't explicitly called out anywhere. I will put together a guide and post it to AH, and link it back here when ready.
I've scripted out an example of using this feature here:
https://github.com/sakserv/hive-hbase-generatehfiles
Thanks!
Created 11-03-2015 11:56 PM
Demo article has been added here:
Created 11-02-2015 09:09 PM
HADOOP_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/etc/hbase/conf hadoop jar /usr/hdp/current/phoenix-client/phoenix-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table test --input /user/root/test --zookeeper localhost:2181:/hbase-unsecure
Created 11-03-2015 11:48 PM
This shows promise as well. I plan to give this a try soon. However, the accepted answer avoids needing to go from ORC back to Csv, so it gets the win. 🙂