Support Questions
Find answers, ask questions, and share your expertise

Spark HBase Connector

Need a solution for the below scenario: Let say 3 million records are stored in HBASE which is a past data. Now on streaming let say 10k records have been pulled and for those 10k records need to get its matching records from HBASE based on the key and the operation should complete in less than a half a minute. We are using Spark HBASE connector.

4 REPLIES 4

Re: Spark HBase Connector

We are working on 5 node system.

Re: Spark HBase Connector

Hi Balakumar,

This documentation shows you how to perform a select using the spark hbase connector

For example using the SQL syntax:

// Load the dataframe
val df = withCatalog(catalog)
//SQL example
df.registerTempTable("table")
sqlContext.sql("select count(col1) from table").show

Re: Spark HBase Connector

I want to improve the performance of HBase Read operation. Is there any option available in spark hbase connector to scale up the hbase read operation?

Re: Spark HBase Connector

Super Collaborator

Hi @Balakumar Balasundaram you may go with the "predicate pushdown" approach for the rowkeys by passing the fewer records as broadcast variables so that only the required records can be pulled out of that.

I recommend to go through the article

https://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector

alternately by adjusting the "spark.sql.autoBroadcastJoinThreshold" to the required lookup size, in spark this can be pushdown to HBase rather pulling all the data into Spark and computing(which is not optimal).