Support Questions

Find answers, ask questions, and share your expertise

Spark HBase Connector

Need a solution for the below scenario: Let say 3 million records are stored in HBASE which is a past data. Now on streaming let say 10k records have been pulled and for those 10k records need to get its matching records from HBASE based on the key and the operation should complete in less than a half a minute. We are using Spark HBASE connector.

4 REPLIES 4

We are working on 5 node system.

Hi Balakumar,

This documentation shows you how to perform a select using the spark hbase connector

For example using the SQL syntax:

// Load the dataframe
val df = withCatalog(catalog)
//SQL example
df.registerTempTable("table")
sqlContext.sql("select count(col1) from table").show

I want to improve the performance of HBase Read operation. Is there any option available in spark hbase connector to scale up the hbase read operation?

Super Collaborator

Hi @Balakumar Balasundaram you may go with the "predicate pushdown" approach for the rowkeys by passing the fewer records as broadcast variables so that only the required records can be pulled out of that.

I recommend to go through the article

https://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector

alternately by adjusting the "spark.sql.autoBroadcastJoinThreshold" to the required lookup size, in spark this can be pushdown to HBase rather pulling all the data into Spark and computing(which is not optimal).

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.