Is there any benefit to query HDFS data using SparkSQL instead of Impala in a pyspark code?
it all depends on what you are looking for throughput or latency , query fault tolerant ?
sparksql is fault tolerant , impala know for low latency.
use impala for exploratory analytics on large data sets .
impala is not fault tolerant meaning if the query runining on that machine goes down the query has to be re-run. however in our enviroment large cluster we hardly have this issue .