Created 04-25-2017 06:48 AM
Need a solution for the below scenario: Let say 3 million records are stored in HBASE which is a past data. Now on streaming let say 10k records have been pulled and for those 10k records need to get its matching records from HBASE based on the key and the operation should complete in less than a half a minute. We are using Spark HBASE connector.
Created 04-25-2017 06:49 AM
We are working on 5 node system.
Created 04-25-2017 07:46 AM
Hi Balakumar,
This documentation shows you how to perform a select using the spark hbase connector
For example using the SQL syntax:
// Load the dataframe val df = withCatalog(catalog) //SQL example df.registerTempTable("table") sqlContext.sql("select count(col1) from table").show
Created 04-25-2017 10:14 AM
I want to improve the performance of HBase Read operation. Is there any option available in spark hbase connector to scale up the hbase read operation?
Created 05-13-2017 11:52 AM
Hi @Balakumar Balasundaram you may go with the "predicate pushdown" approach for the rowkeys by passing the fewer records as broadcast variables so that only the required records can be pulled out of that.
I recommend to go through the article
https://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector
alternately by adjusting the "spark.sql.autoBroadcastJoinThreshold" to the required lookup size, in spark this can be pushdown to HBase rather pulling all the data into Spark and computing(which is not optimal).