Created on
12-05-2017
12:12 AM
- last edited on
12-05-2017
05:40 AM
by
cjervis
Is there any benefit to query HDFS data using SparkSQL instead of Impala in a pyspark code?
Created on 12-05-2017 08:19 AM - edited 12-05-2017 08:34 AM
it all depends on what you are looking for throughput or latency , query fault tolerant ?
sparksql is fault tolerant , impala know for low latency.
use impala for exploratory analytics on large data sets .
impala is not fault tolerant meaning if the query runining on that machine goes down the query has to be re-run. however in our enviroment large cluster we hardly have this issue .