About michal_zbikowsk

michal_zbikowsk · ‎02-25-2019

Hi Lester, thank you very much for your answer. I agree that there are a “TON of variables at play”. This is true that those tests are not performed on really big data and it definitely can have an impact. I have named those data sets to differ smaller and bigger data set rather than emphasising the fact that I am dealing with “really big data sets” 🙂 I will check the number of executors. Thank you for good tip. I like your logical reasoning. Even though I realize that there are a “TON” of parameters the tests were prepared reasonably precisely. The results were so surprising that I decided to post this question. Especially when everywhere I can see that Spark (which by the way I am a big fan of) is “always” faster 😉

michal_zbikowsk · ‎02-13-2019

We are running hive with udf vs spark comparison. The aim is to choose a faster solution for encrypting/decrypting data. Tech stack we are using is as follows: HDP 2.6.5 Hive 1.2.1000 Spark2 2.x YARN + MapReduce2 2.7.3 Data are stored on HDF as csv files: Data set 1 (big data set): 1M rows, 44 columns, 19000 unique customers; Data set 2 (small data set): 25k rows, 44 columns, 494 unique customers; 22 columns are being encrypted using unique key for unique customer. UDF on Hive is written in Java. Function on Spark is written in Scala. Encrypting decrypting functions are basically the same. To compare those two frameworks we run count (to decrypt the whole table and get one value as output). Hive is run from beeline command line. Spark is run from Zeppelin. Keys are stored in a file on HDFS. Spark (results in Zeppelin): The average time for querying encrypted small data set with decryption: 11.8 s The average time for querying encrypted big data set with decryption: 54.8 s --- The average time for querying unencrypted small data set (no decryption): 9.6 s The average time for querying unencrypted big data set (no decryption): 12 s Hive The average time for querying encrypted small data set with decryption: 5.7s The average time for querying encrypted big data set with decryption: 12.3 s --- The average time for querying unencrypted small data set (no decryption): 4.8 s The average time for querying unencrypted big data set (no decryption): 10.4 s Additionally we use custom function for Spark execution time measruremnt (additionally to the execution time Zeppeline displays) def time[A](f: => A) = { val s = System.nanoTime val ret = f println("time: "+(System.nanoTime-s)/1e6+"ms") ret } Results are on average 10 seconds smaller (above function to Zeppeline time). So for example the average time for querying encrypted small data set with decryption: 1.8 s (instead of 11.8 s) The question: Why Hive is faster than Spark ?

Online	Offline
Last Visited	‎02-25-2019 09:28 PM

Member Since	‎02-12-2019 04:42 PM
Last Visited	‎02-25-2019 09:28 PM
Posts	2
Kudos received	1

Cloudera Community

Re: Why Hive with UDF is faster than Spark ?

Why Hive with UDF is faster than Spark ?