Need a solution for the below scenario: Let say 3 million records are stored in HBASE which is a past data. Now on streaming let say 10k records have been pulled and for those 10k records need to get its matching records from HBASE based on the key and the operation should complete in less than a half a minute. We are using Spark HBASE connector.
For example using the SQL syntax:
// Load the dataframe val df = withCatalog(catalog) //SQL example df.registerTempTable("table") sqlContext.sql("select count(col1) from table").show
I want to improve the performance of HBase Read operation. Is there any option available in spark hbase connector to scale up the hbase read operation?
Hi @Balakumar Balasundaram you may go with the "predicate pushdown" approach for the rowkeys by passing the fewer records as broadcast variables so that only the required records can be pulled out of that.
I recommend to go through the article
alternately by adjusting the "spark.sql.autoBroadcastJoinThreshold" to the required lookup size, in spark this can be pushdown to HBase rather pulling all the data into Spark and computing(which is not optimal).