To get an idea of the write performance of a Spark cluster i've created a Spark version of the standard
TestDFSIO tool, which measures the I/O performance of HDFS in your cluster. Lies, damn lies and benchmarks, so the goal of this tool is providing a sanity check of your Spark setup, focusing on the HDFS writing performance, not on the compute performance. Think the tool can be improved? Feel free to submit a pull request or raise a Github issue
<file/partitions> : should ideally be equal to recommended spark.default.parallelism (cores x instances).
<bytes_per_file> : should fit in memory: for example: 90000000.
<write_repetitions> : no of re-writing of the test RDD to disk. benchmark will be averaged.
spark-submit --class org.ward.Benchmark --master yarn --deploy-mode cluster --num-executors X --executor-cores Y --executor-memory Z target/benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar <files/partitions> <bytes_per_file> <write_repetitions>
CLI Example for 12 workers with 30GB mem per node:
It's important to get the amount of executors and cores right: you want to get the maximum amount of parallelism without going over the maximum capacity of the cluster.
This command will write out the generated RDD 10 times, and will calculate an aggregate throughput over it.