I am recently tasked to work out something that can read data from HBase into a Spark DataFrame and also once the transformation / enrichment is done write the DataFrame back into HBase.
What is the best way of doing this? I can see from Cloudera there is sparkOnHBase package (but I think they have given the code to HBase, and the maven modules are with version 0.0.x-clabs-SNAPSHOT which doesnt sound assuring..). There is also a HBase-Spark module on apache HBase but it seems that it is not released yet.
Ideally it would be something similar to these:
// using spark-csv from databricks DataFrame csvDF = sqlContext.read() .format("csv") .options(options) .load(hdfs.getURI("hdfs://sandbox:8020")); // using spark-solr from lucidworks DataFrame solrDF = sqlContext.read() .format("solr") .options(options) .load();
Is there something similar to these in the HBase world?
I have also seen this thread with the experimental connector but I would really prefer something more mature.
Thanks in advance!
Thanks all for the input. The phoenix-spark example looks very close to what we need but I am not sure if people in my team would be happy with phoenix but I will bring this up and see. Meanwhile I think I will also follow the HBase jira and hope that it will be out soon.