I leverage SparkSQL to filter data on HDFS, and input about 700 files(about 350GB) to SparkSQL. As a result, I got about 60K records return, but the performance is dramatically bad(the execution time is about 700 secs). Would anyone tell me what's the key point to get such bad performance? Thanks a lot!
Show the detail steps to do SparkSQL query data on HDFS: step1. get HDFS file path list for given query pattern (I store the file path as index in HBase to speed up query performance) -> execution time:1sec step2. input step1 files path(700 files,total 350GB) to SparkSQL, and reflect all records to defined schema objects. Then query by SparkSQL to filter row data. -> 700 secs