Support Questions
Find answers, ask questions, and share your expertise

Is something wrong with my cluster settings : Spark Job running for 8 hours on 1TB of data

New Contributor

Recently I setup an 8 node Hadoop and Spark cluster using Ambari on Microsoft Azure Platform. Each of the nodes (Standard DS2 v2 (2 cores, 7 GB memory)) in the cluster has the following standard configuration.

Memory : 7GB 

HDD : 1TB 

CPU : 2 cores 

OS : Ubuntu 14.04

I am trying to run some benchmarks using the Intel HiBench suite but the time it takes to execute the wordcount workload takes too much to me. So I am not sure if this has something to do with my configuration or this is normal.

My data and job configurations specified in conf/hibench.conf are presented below.

Data : 1TB (bigdata)

Mapper number in hadoop, partition number in Spark         8

# Reducer nubmer in hadoop, shuffle partition number in Spark

hibench.default.shuffle.parallelism     4

The time it took Spark to run the wordcount workload or job is 8 hours 22 minutes. Is this normal or is there something wrong with my cluster configuration.

More information about the benchmark tool I am using can be obtained from


Expert Contributor

Hi @SC SC,

DS2v2 is too small for data nodes and not properly sized for benchmarking. Please see for Azure sizing recommendations and HDP.

/Best regards, Mats

New Contributor

Hi @Mats Johansson,

Thank you for the reply. The link is very useful. However I have seen that the recommended instances seems to be very beefy and expensive, is this not against the design principles of such distributed systems? That is scaling out rather than having more CPU and memory.

I will be happy to know your take on this.

@Ancil McBarnett you might also want to add a comment on this about the link provided by Mats.

Thank you.

the main principle is "use more than one machine for workload scalability", "commodity parts" for financial scalability, and a filesystem and execution platform which handles the failures that follow from both scale and hardware choices.

It doesn't mean you should rush to have a 20 node cluster over a ten node one, not if that node cluster can get your work done faster. In the cloud, if you can do that and so either finish your work more rapidly, or rent less machine time, you get a good outcome. Spark, in particular, loves having lots of RAM, so it can cache the generated results of RDDs. If it has to discard work due to running out of memory, then, if it needs that data later, it will need calculating again, costing time,

Without recommending any specific machines, then: look for more RAM when you use Spark. 7GB isn't that much these days; it's less than consumer laptops ship with