Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark SQL?

Highlighted

Spark SQL?

Explorer

I have a spark Application, where I am bringing HBase data and keepnig it as a DataFrame and filtering the dataframe. Since I am dealing with 4 million records, this is taking around 3 mins. I want to bring it down to seconds. Is there a way I can optimize dataframe operations? Is it possible to use somethign like index in with Spark DataFrames?\

 

Thank You

uzi

1 REPLY 1
Highlighted

Re: Spark SQL?

Explorer

Uzi,

 

The problem is the "bringing data from hbase" bit.  The current HBase connector does a remote scan which averages 10K rows per second at the high end.    

 

I would suggest you checkout Splice Machine (Open Source, Plugin parcels for Cloudera).  Splice Machine wrote an optimization that can read data into Spark reading directly from the Hbase store file.  The data transfer rate will be 100X what you get from remote scans.

 

It uses Spark to process SQL queries and would be able to query 4 million records in a few seconds easily.

 

 

 

Cheers,

John

 

Don't have an account?
Coming from Hortonworks? Activate your account here