Support Questions
Find answers, ask questions, and share your expertise

Spark two nodes cluster performance improvement

Spark two nodes cluster performance improvement


Hello Members, I have executed the following program (K-mer counter) in 2 nodes spark cluster on the data set size 10.2 GB. The amount of time it has taken to run is around 35 minutes. In another experiment, I have executed a Java program for the same K-mer counter problem in Hadoop-2 two nodes cluster. The amount of time it has taken is around 35 minutes. We know that Spark's cluster performance should be greater than Hadoop cluster. But in my case, the running time for both the cluster is same. I would like to know that In Spark cluster, am i utilizing all the available resources in 2 nodes cluster efficiently or not? Also, I would like to know that my Scala program for k-mer counter can be improvised than what i have now?

My hardware is as follows:

CPU: Intel i7 processor with 8 cores in both the machines in cluster

RAM: 8GB in both the machines in cluster

OS: Ubuntu 16.04

Harddisk capacity: Master node 1 TB and Slave node 500 GB

Nodes are connected through switch in the network.

Ip address of master:, Ip address of slave:

In case, if you need some explanation about the k-mer counter problem, Please look at the following

If the given string is ATCGATGATT and the k-mer size is 5 then the following will be list of k-mers generated







My execution command is as follows

spark-submit --class Kmer1 --master spark://saravanan:7077 --executor-memory 5g /home/hduser/sparkapp/target/scala-2.11/sparkapp_2.11-0.1.jar hdfs:// hdfs://

Sample input


My code will filter 1st line, 3rd line and the 4th line.

Number of executors created is 2

Number of partitions created is 78

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Kmer1 {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("Kmer1")
    val sc = new SparkContext(sparkConf)
    val input = args(0)
    val K = 25
    val broadcastK = sc.broadcast(K)
    val records = sc.textFile(input)
    val only_three = records.zipWithIndex.filter{case (_,i)=>(i+1)%4 !=0}.map{case (e,_) =>e}
    // remove the records, which are not an actual sequence data
    val filteredRDD = only_three.filter(line => {
         line.startsWith("@") || 
         line.startsWith("+") || 
         line.startsWith(";") ||
         line.startsWith("!") || 
         line.startsWith("~") ||
    val kmers = filteredRDD.flatMap(_.sliding(broadcastK.value, 1).map((_, 1)))
    // find frequencies of kmers
    val kmersGrouped = kmers.reduceByKey(_ + _)
    // done!

Re: Spark two nodes cluster performance improvement

Cloudera Employee

Did you see any spill happening in disk? or 2 executor was able to use all 10gb of memory without disk spill. The only difference would be to seen if spark is using memory effectively .

You may be using Dynamic Allocation of executor feature of spark , if so please manually increase number of reducer, check gc logs for memory utilization check.

Above will give you insight in to how much you can improve.

Re: Spark two nodes cluster performance improvement


Thanks for your reply. My input data set is 10.2GB. But after map function in the program, the amount of intermediate data generated is around 40GB. Number of unique keys is around 4,55,55500. I think that Spark doesn't keep this much big intermediate data in memory but keeps in harddisk. Whats the best way to tune spark when we have to reduce massive number of unique keys