Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

SparkSQL performance issue

SparkSQL performance issue

New Contributor

Dear pals,

 

I leverage SparkSQL to filter data on HDFS, and input about 700 files(about 350GB) to SparkSQL. As a result, I got about 60K records return, but the performance is dramatically bad(the execution time is about 700 secs). Would anyone tell me what's the key point to get such bad performance? Thanks a lot!

 

Show the detail steps to do SparkSQL query data on HDFS:
step1. get HDFS file path list for given query pattern (I store the file path as index in HBase to speed up query performance) ->        execution time:1sec
step2. input step1 files path(700 files,total 350GB) to SparkSQL, and reflect all records to defined schema objects. Then query by SparkSQL to filter row data. -> 700 secs

 

ps. the file format stored on HDFS is .bz2