We have a project where currently Shell script, Hive, Execution engine: TEZ is being used. For POC purpose we tried replacing shell scripts with spark and we executed HQLs through spark . One of the client cam back with a question that why would we need spark application as we can set spark as an execution engine and we can run our regular shell scripts and oozie workflow. What is the better option to choose just choose
set hive.execution.engine=spark; OR make spark application and execute HQLs with spark APIs. If performance is same for both of them then why do we need to write code in Spark? What is the advantage of writing spark application using SPARK SQL?
SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.
When Hive uses Spark See the JIRA entry: HIVE-7292