I have a spark Application, where I am bringing HBase data and keepnig it as a DataFrame and filtering the dataframe. Since I am dealing with 4 million records, this is taking around 3 mins. I want to bring it down to seconds. Is there a way I can optimize dataframe operations? Is it possible to use somethign like index in with Spark DataFrames?\
The problem is the "bringing data from hbase" bit. The current HBase connector does a remote scan which averages 10K rows per second at the high end.
I would suggest you checkout Splice Machine (Open Source, Plugin parcels for Cloudera). Splice Machine wrote an optimization that can read data into Spark reading directly from the Hbase store file. The data transfer rate will be 100X what you get from remote scans.
It uses Spark to process SQL queries and would be able to query 4 million records in a few seconds easily.