Support Questions

David_Tam · ‎02-17-2016

Hello,

I am recently tasked to work out something that can read data from HBase into a Spark DataFrame and also once the transformation / enrichment is done write the DataFrame back into HBase.

What is the best way of doing this? I can see from Cloudera there is sparkOnHBase package (but I think they have given the code to HBase, and the maven modules are with version 0.0.x-clabs-SNAPSHOT which doesnt sound assuring..). There is also a HBase-Spark module on apache HBase but it seems that it is not released yet.

Ideally it would be something similar to these:

// using spark-csv from databricks
DataFrame csvDF = sqlContext.read()
        .format("csv")
        .options(options)
        .load(hdfs.getURI("hdfs://sandbox:8020"));

// using spark-solr from lucidworks
DataFrame solrDF = sqlContext.read()
        .format("solr")
        .options(options)
        .load();

Is there something similar to these in the HBase world?

I have also seen this thread with the experimental connector but I would really prefer something more mature.

Thanks in advance!

rgelhausen · ‎02-17-2016

Hi @David Tam, for a working example using phoenix-spark to read/write HBase DataFrames, checkout https://github.com/randerzander/HiveToPhoenix

View solution in original post

aervits · ‎02-17-2016

@David Tam

right now the only definite answer is https://phoenix.apache.org/phoenix_spark.html

HBase-Spark is not released yet and it's coming very soon, no timeline was announced yet.

nsabharwal · ‎02-17-2016

@David Tam

See this jira https://issues.apache.org/jira/browse/HBASE-13992

nsabharwal · ‎02-17-2016

@David Tam Amazing to see all the jira on the same topic https://issues.apache.org/jira/browse/HBASE-14181

Link

rgelhausen · ‎02-17-2016

Hi @David Tam, for a working example using phoenix-spark to read/write HBase DataFrames, checkout https://github.com/randerzander/HiveToPhoenix

David_Tam · ‎02-18-2016

Thanks all for the input. The phoenix-spark example looks very close to what we need but I am not sure if people in my team would be happy with phoenix but I will bring this up and see. Meanwhile I think I will also follow the HBase jira and hope that it will be out soon.

Thank you!

Cloudera Community

Support Questions

Reading from and Writing to HBase with a spark DataFrame

Write / Read Parquet File in Spark

Spark RDDs vs DataFrames vs SparkSQL

Accessing Hbase tables and querying on Dataframes ...

Impala writes on Iceberg

Spark 2 Can't write dataframe to parquet table

Read/Write throughput HDFS JBOD disk

Spark to read the Hive table sub-directory data

HBase Spark in CDP

Spark 3 legacy configurations list ( Spark 2 behav...

Writing parquet on HDFS using Spark Streaming