Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark uses in-memory processing or MapReduce Jobs? Slow performances comparing to Hive

Highlighted

Spark uses in-memory processing or MapReduce Jobs? Slow performances comparing to Hive

Explorer

Hello,

I deployed Hortonworks HDP in a 4 nodes machine, in order to perform some benchmarks between tools like hive and spark(2.0). Since i started with hive i did some research and i found some information that we could use beeline to query hive data with spark (using the commandbeeline -u "jdbc:hive2://hadoop-1:10001/;transportMode=http;httpPath=cliservice" -n spark --force=true -f tpch_query1.sql). I verified that this actually works, but the performance are surprisingly slower than hive, is this a valid comparison betweeen spark and hive performance? If not how can i query the data that i have in hive without losing performance?

Another aspect, i read that Spark uses in-memory processing, same logic as tools like presto, hawq or cloudera impala. But when i execute some query, using the command writed above it seems the processing is made by MapReduce Jobs. Can you share some light on these subjects?

1 REPLY 1

Re: Spark uses in-memory processing or MapReduce Jobs? Slow performances comparing to Hive

Rising Star

To compare Spark Vs. Hive on a level field ensure that the number of executors (containers) and their resources are identical in both cases. Spark have executor count and memory size per container and dynamic resource allocation. With Hive you should use Tez instead of MR for a fair comparison.

Please accept this answer if it helped you.

Don't have an account?
Coming from Hortonworks? Activate your account here