Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Why Hive with UDF is faster than Spark ?

avatar
New Contributor

We are running hive with udf vs spark comparison.

The aim is to choose a faster solution for encrypting/decrypting data.

Tech stack we are using is as follows:

HDP 2.6.5

Hive 1.2.1000

Spark2 2.x

YARN + MapReduce2 2.7.3

Data are stored on HDF as csv files:

  • Data set 1 (big data set): 1M rows, 44 columns, 19000 unique customers;
  • Data set 2 (small data set): 25k rows, 44 columns, 494 unique customers;
  • 22 columns are being encrypted using unique key for unique customer.

UDF on Hive is written in Java. Function on Spark is written in Scala. Encrypting decrypting functions are basically the same.

To compare those two frameworks we run count (to decrypt the whole table and get one value as output).

Hive is run from beeline command line. Spark is run from Zeppelin.

Keys are stored in a file on HDFS.

Spark (results in Zeppelin):

The average time for querying encrypted small data set with decryption: 11.8 s

The average time for querying encrypted big data set with decryption: 54.8 s

---

The average time for querying unencrypted small data set (no decryption): 9.6 s

The average time for querying unencrypted big data set (no decryption): 12 s

Hive

The average time for querying encrypted small data set with decryption: 5.7s

The average time for querying encrypted big data set with decryption: 12.3 s

---

The average time for querying unencrypted small data set (no decryption): 4.8 s

The average time for querying unencrypted big data set (no decryption): 10.4 s

Additionally we use custom function for Spark execution time measruremnt (additionally to the execution time Zeppeline displays)

def time[A](f: => A) = {

val s = System.nanoTime

val ret = f

println("time: "+(System.nanoTime-s)/1e6+"ms")

ret

}

Results are on average 10 seconds smaller (above function to Zeppeline time). So for example the average time for querying encrypted small data set with decryption: 1.8 s (instead of 11.8 s)

The question: Why Hive is faster than Spark ?

1 ACCEPTED SOLUTION

avatar

There are a TON of variables at play here. First up, the "big" dataset isn't really all that big for Hive or Spark and that will always play into the variables. My *hunch* (just a hunch) is that your Hive query from beeline is able to use an existing session and that it is able to get access to as many containers as it would like. Conversely, Zeppelin may have a SparkContext that has a smaller number of executors than your Hive query can get access to. Of course, the "flaw in my slaw" is that these datasets are relatively small anyways.

Spark's "100x improvement" line is always related to reiterative (aka ML/AI) processing, but for traditional querying and data pipelining, Spark runs faster when there is a bunch of tasks (mappers and reducers) that need to run and it can transition between those milliseconds within the pre-allocated executor containers instead of seconds that Hive has to burn talking to YARN's RM to get the needed containers.

I realize this isn't as much of an answer as you were looking for more than it was an opinion piece now that a I review it before hitting "post answer". 🙂 Either way, good luck and happy Hadooping/Sparking!

View solution in original post

2 REPLIES 2

avatar

There are a TON of variables at play here. First up, the "big" dataset isn't really all that big for Hive or Spark and that will always play into the variables. My *hunch* (just a hunch) is that your Hive query from beeline is able to use an existing session and that it is able to get access to as many containers as it would like. Conversely, Zeppelin may have a SparkContext that has a smaller number of executors than your Hive query can get access to. Of course, the "flaw in my slaw" is that these datasets are relatively small anyways.

Spark's "100x improvement" line is always related to reiterative (aka ML/AI) processing, but for traditional querying and data pipelining, Spark runs faster when there is a bunch of tasks (mappers and reducers) that need to run and it can transition between those milliseconds within the pre-allocated executor containers instead of seconds that Hive has to burn talking to YARN's RM to get the needed containers.

I realize this isn't as much of an answer as you were looking for more than it was an opinion piece now that a I review it before hitting "post answer". 🙂 Either way, good luck and happy Hadooping/Sparking!

avatar
New Contributor

Hi Lester, thank you very much for your answer.

I agree that there are a “TON of variables at play”. This is true that those tests are not performed on really big data and it definitely can have an impact. I have named those data sets to differ smaller and bigger data set rather than emphasising the fact that I am dealing with “really big data sets” 🙂

I will check the number of executors. Thank you for good tip.

I like your logical reasoning. Even though I realize that there are a “TON” of parameters the tests were prepared reasonably precisely. The results were so surprising that I decided to post this question. Especially when everywhere I can see that Spark (which by the way I am a big fan of) is “always” faster 😉