Support Questions

Find answers, ask questions, and share your expertise

Spark DF Reading Petabytes of data from Hive and persisting Petabytes of data to HBASE?


Iam looking to read really large petabytes scale hive tables into spark, which is a better option in terms of performance to read directly from hive table using select stmt or giving the hdfs path for the ORC files of hive tables to read? Looking for best practice and better performance as well.

Also looking to save petabytes scale data to hbase tables. HW has a spark-hbase connector not sure how its to do with petabytes volume of data and performance as well. Shall i consider using HFILES or spark-hbase connector?
Any suggestions will be really appreciated... thank you


Super Collaborator

If you can read and process the ORC files directly, I would go for that way. There is no need to pass them through Hive except you need the schema.

The same is valid for the bulk load into HBase, writing HFiles and then load them in HBase, but you will have to make sure the splitting of the HBase files is valid, otherwise Hbase may start immediatly with a rebalancing. You can check the details in this blog post:

+1, you should definitely use HBase bulk-loading of HFiles if you're dealing with such a large scale of data as it will greatly reduce the amount of hardware necessary to process the data.


@Harald Berghoff & @Josh Elser thanks for your reply. OK will try to use HFiles from according to the link which i knew exists. However in that link it says about Spark-hbase connector to support bulk loading in future. Now the jira says its resolved then does it mean now we can use the spark-hbase connector?

Super Collaborator

given the timing of the closure it should be available, but just check the version of your spark-hbase-connector (the Jira says fixed with version 3.0.0) to be sure.


@Josh Elser @Harald Berghoff If the jira patch is pushed then i believe we can use this cloudera connector and donot need to rewrite hbase salting and spark partitioner from the below post is it? We can just use the connector api instead isnt it?

Also we have another connector from hortonworks not sure if this supports bulk loading the previous one is from cloudera and rdd version. however the below one supports dataframe version of spark
It would be good if this connector supports dataframe bulk loading? It supports the standard write not sure abt bulk load though