Different tests was run and results :
1. count(distinct id) from a table with 40 millioin records and 200-million-records
execution time : Impala < Hive < Spark < Spark SQL
2. count(id) from a table with 300-million-records
execution time : Impala(60s) < Hive(90s) < Spark(170s) < Spark SQL (170+)
3. count(id) group by city , order by counts in a table with 300-million-records
execution time : Impala(60s) << Hive(150s) < Spark(200s) < Spark SQL (200+)
4. join A (1 million) & B(55 million)
execution time : Impala(130s) << Hive(200s) << Spark(400s) < Spark SQL (400+)
5. PageRank : hadoop MR streaming vs Spark , 1.2 million records
execution time : hadoop 700sec , Spark 400sec (ranks.collect() took a long time, iteration took little)
12 nodes , 10 for DN , Hive GW, Spark Worker, Impalad .
Memory : 32G * 3 , 16G * 7, 8G * 2
CPU : 2 * 4 cores Xeon
Network : 1000 Mb/s
yarn container Min/Max Memory : 1G/8G
spark total executor : 40 -- 60
swapping accoured during job running , on one nodes to at most half of them.
the HDFS IO Read was High when job running , with very little Write.
the Cluster Disk IO was low.
1 when Impala was runnig, the Disk IO is very higt compare with HDFS IO, why is that ?
2 where may be the bottlenecks ?
3 any advices for tunning ?
I think the big problem here is swapping. You never want to run such that you swap, especially not for a performance test. It means apps are being told a certain amount of memory is available to take advantage of, when using that memory would be very slow.
You probably want to adjust your Spark config (and others' config) to ensure that it's not using so much memory that swapping occurs. All bets are off then. For example, you don't want 8GB YARN containers if your node has 8GB RAM!
Also make sure in all cases the resource manager is aware of the difference in the nodes' size.
Container memory means YARN memory? you're making 256m YARN containers? that's far too small. So are you just running more containers? If you're swapping, something is still wrong and the test isn't really valid. What is causing you to swap? how about turning it off for the test? "sudo swapoff -a"