I deployed Hortonworks HDP in a 4 nodes machine, in order to perform some benchmarks between tools like hive and spark(2.0). Since i started with hive i did some research and i found some information that we could use beeline to query hive data with spark (using the commandbeeline -u "jdbc:hive2://hadoop-1:10001/;transportMode=http;httpPath=cliservice" -n spark --force=true -f tpch_query1.sql). I verified that this actually works, but the performance are surprisingly slower than hive, is this a valid comparison betweeen spark and hive performance? If not how can i query the data that i have in hive without losing performance?
Another aspect, i read that Spark uses in-memory processing, same logic as tools like presto, hawq or cloudera impala. But when i execute some query, using the command writed above it seems the processing is made by MapReduce Jobs. Can you share some light on these subjects?
To compare Spark Vs. Hive on a level field ensure that the number of executors (containers) and their resources are identical in both cases. Spark have executor count and memory size per container and dynamic resource allocation. With Hive you should use Tez instead of MR for a fair comparison.