Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Tested spark in CDH5.1.0 along with Hive & Impala but it turned out that spark was slower ?

Highlighted

Tested spark in CDH5.1.0 along with Hive & Impala but it turned out that spark was slower ?

New Contributor

Different tests was run and results :
1. count(distinct id) from a table with 40 millioin records and 200-million-records
execution time : Impala < Hive < Spark < Spark SQL

2. count(id) from a table with 300-million-records
execution time : Impala(60s) < Hive(90s) < Spark(170s) < Spark SQL (170+)

3. count(id) group by city , order by counts in a table with 300-million-records
execution time : Impala(60s) << Hive(150s) < Spark(200s) < Spark SQL (200+)

4. join A (1 million) & B(55 million)
execution time : Impala(130s) << Hive(200s) << Spark(400s) < Spark SQL (400+)

 

5. PageRank : hadoop MR streaming vs Spark , 1.2 million records
execution time : hadoop 700sec , Spark 400sec (ranks.collect() took a long time, iteration took little)


Cluster Setup:

12 nodes , 10 for DN , Hive GW, Spark Worker, Impalad .

Memory : 32G * 3 , 16G * 7, 8G * 2

CPU : 2 * 4 cores Xeon

Network : 1000 Mb/s

 

cluster conf:

yarn container Min/Max Memory :  1G/8G

spark total executor : 40 -- 60

 

swapping accoured during job running , on one nodes to at most half of them.

the HDFS IO Read was High when job running , with very little Write.

the Cluster Disk IO was low. 

 

Qs :

1 when Impala was runnig, the Disk IO is very higt compare with HDFS IO, why is that ? 

2 where may be the bottlenecks ? 

3 any advices for tunning ?

 

3 REPLIES 3
Highlighted

Re: Tested spark in CDH5.1.0 along with Hive & Impala but it turned out that spark was slower ?

Master Collaborator

I think the big problem here is swapping. You never want to run such that you swap, especially not for a performance test. It means apps are being told a certain amount of memory is available to take advantage of, when using that memory would be very slow.

 

You probably want to adjust your Spark config (and others' config) to ensure that it's not using so much memory that swapping occurs. All bets are off then. For example, you don't want 8GB YARN containers if your node has 8GB RAM!

 

Also make sure in all cases the resource manager is aware of the difference in the nodes' size.

Highlighted

Re: Tested spark in CDH5.1.0 along with Hive & Impala but it turned out that spark was slower ?

New Contributor
tried to use less container min/max memory (256m/1G , 512m/1G,2G) , swaps can be reduce ( when 256m, no swap) ,but compare with hive with the same conf , spark was still slower.
also tried stop workers on small mem nodes ,but the same.
another thing is , when mem for container increases , Hive gets big boost in performance, say 2x.
Highlighted

Re: Tested spark in CDH5.1.0 along with Hive & Impala but it turned out that spark was slower ?

Master Collaborator

Container memory means YARN memory? you're making 256m YARN containers? that's far too small. So are you just running more containers? If you're swapping, something is still wrong and the test isn't really valid. What is causing you to swap? how about turning it off for the test? "sudo swapoff -a" 

Don't have an account?
Coming from Hortonworks? Activate your account here