Reply
Expert Contributor
Posts: 61
Registered: ‎02-03-2016

HBase Spark Module work with DataFrames

Has there been any advances in the HBase Spark module included with CDH? So far, I see that it works with RDD's in a very difficult manner. I was wondering if DataFrames support is coming or is already there somewhere? Working with DataFrames would make reading and writing data to HBase far much easier and speedier to code than using RDD's.

 

Thanks,

Ben

Cloudera Employee
Posts: 94
Registered: ‎05-10-2016

Re: HBase Spark Module work with DataFrames

It looks like read support was added with this Jira: https://issues.apache.org/jira/browse/HBASE-14181 which is available since CDH 5.7.  Write support is still a work in progress: https://issues.apache.org/jira/browse/HBASE-15336

Highlighted
Contributor
Posts: 25
Registered: ‎06-13-2017

Re: HBase Spark Module work with DataFrames

I wouldn't recommend this in production environments where performance is important.

 

But you can create a hive external table on top of an HBase table, and use Spark JDBC to create a dataframe on top of the Hive table via Impala.

 

SparkSQL/Spark JDBC (selects and inserts) works and Impala selects and even Inserts (and updates via Inserts) works as well against the Hive external table.

Announcements