- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 05-09-2017 09:22 AM
To get an idea of the write performance of a Spark cluster i've created a Spark version of the standard TestDFSIO tool, which measures the I/O performance of HDFS in your cluster. Lies, damn lies and benchmarks, so the goal of this tool is providing a sanity check of your Spark setup, focusing on the HDFS writing performance, not on the compute performance. Think the tool can be improved? Feel free to submit a pull request or raise a Github issue
Getting the Spark Jar
Download the Spark Jar from here: https://github.com/wardbekker/benchmark/releases/download/v0.1/benchmark-1.0-SNAPSHOT-jar-with-depen...
It's build for Spark 1.6.2 / Scala 2.10.5
Or build from from source
$ git clone https://github.com/wardbekker/benchmark
$ cd benchmark && mvn clean package
Submit args explains
<file/partitions>
: should ideally be equal to recommended spark.default.parallelism (cores x instances).
<bytes_per_file>
: should fit in memory: for example: 90000000.
<write_repetitions>
: no of re-writing of the test RDD to disk. benchmark will be averaged.
spark-submit --class org.ward.Benchmark --master yarn --deploy-mode cluster --num-executors X --executor-cores Y --executor-memory Z target/benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar <files/partitions> <bytes_per_file> <write_repetitions>
CLI Example for 12 workers with 30GB mem per node:
It's important to get the amount of executors and cores right: you want to get the maximum amount of parallelism without going over the maximum capacity of the cluster.
This command will write out the generated RDD 10 times, and will calculate an aggregate throughput over it.
spark-submit --class org.ward.Benchmark --master yarn --deploy-mode cluster --num-executors 60 --executor-cores 3 --executor-memory 4G target/benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar 180 90000000 10
Retrieving benchmark results:
You can retrieve the benchmark results by running yarn log in this way:
yarn logs -applicationId <application_id> | grep 'Benchmark'
for example:
Benchmark: Total volume : 81000000000 Bytes Benchmark: Total write time : 74.979 s Benchmark: Aggregate Throughput : 1.08030246E9 Bytes per second
So that's about 1 GB write per sec for this run.