Community Articles

wbekker · ‎05-09-2017

To get an idea of the write performance of a Spark cluster i've created a Spark version of the standard TestDFSIO tool, which measures the I/O performance of HDFS in your cluster. Lies, damn lies and benchmarks, so the goal of this tool is providing a sanity check of your Spark setup, focusing on the HDFS writing performance, not on the compute performance. Think the tool can be improved? Feel free to submit a pull request or raise a Github issue

Getting the Spark Jar

Download the Spark Jar from here: https://github.com/wardbekker/benchmark/releases/download/v0.1/benchmark-1.0-SNAPSHOT-jar-with-depen...

It's build for Spark 1.6.2 / Scala 2.10.5

Or build from from source

$ git clone https://github.com/wardbekker/benchmark

$ cd benchmark && mvn clean package

Submit args explains

<file/partitions> : should ideally be equal to recommended spark.default.parallelism (cores x instances).

<bytes_per_file> : should fit in memory: for example: 90000000.

<write_repetitions> : no of re-writing of the test RDD to disk. benchmark will be averaged.

spark-submit --class org.ward.Benchmark  --master yarn --deploy-mode cluster --num-executors X --executor-cores Y --executor-memory Z target/benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar <files/partitions> <bytes_per_file> <write_repetitions>

CLI Example for 12 workers with 30GB mem per node:

It's important to get the amount of executors and cores right: you want to get the maximum amount of parallelism without going over the maximum capacity of the cluster.

This command will write out the generated RDD 10 times, and will calculate an aggregate throughput over it.

spark-submit --class org.ward.Benchmark  --master yarn --deploy-mode cluster --num-executors 60 --executor-cores 3 --executor-memory 4G target/benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar 180 90000000 10

Retrieving benchmark results:

You can retrieve the benchmark results by running yarn log in this way:

yarn logs -applicationId <application_id> | grep 'Benchmark'

for example:

Benchmark: Total volume         : 81000000000 Bytes
Benchmark: Total write time     : 74.979 s
Benchmark: Aggregate Throughput : 1.08030246E9 Bytes per second

So that's about 1 GB write per sec for this run.

Cloudera Community