Community Articles

sunile_manjee · ‎09-13-2016

I often hear stories of wanting faster performance from Hadoop & spark without knowing basic statistics within ones environment. One of the first questions I ask is whether the hardware can perform at the level which is being expected. The software is still bound to the physics of the hardware.

If your IO disk speed is 10MB per sec, Hadoop/Spark nor any other software will magically make that disk speed faster. Again we are bound to the physical limits of the hardware we choose. What makes Hadoop and other distributed processing engines amazing is its ability to add more "cheap" nodes to the cluster to increase performance. However we should be aware the maximum throughput per node. This will help level set expectations before committing to any SLA bound to performance.

Typically I love to use the sysbench tool. SysBench is a modular & multi-threaded benchmark tool for evaluating OS parameters ie. CPU, ram, IO, and mutex. I use sysbench prior to installing any software outside the kernel and pre/post Hadoop/Spark upgrades. Pre/post upgrades should not have any impact to your OS benchmarks but I play it safe. My neck is on the line when I commit to a SLA so I rather play it safe.

The below tests I generally wrap in a shell script for ease of execution. For this article I call out each test for clarification.

RAM test

I start with testing RAM performance. This test can be used to benchmark sequential memory reads or writes. I test both.

To test read performance I set memory block size to HDFS block size, number-threads = approx concurrency you expect on your cluster, and memory total size the avg size of each work load.

sysbench --test=memory --memory-block-size=128M --memory-oper=read --num-threads=4 --memory-total-size=10G run

To test write performance I set memory block size to HDFS block size, number of threads = approx concurrency you expect on your cluster, and memory total size the avg size of each work load.

sysbench --test=memory --memory-block-size=128M --memory-oper=write --num-threads=4 --memory-total-size=10G run

CPU test

Next I grab the CPU performance numbers. This test consists in calculation of prime numbers up to a value specified by the --cpu-max-primes option. I set the number of threads = approx concurrency you expect on your cluster.

sysbench --test=cpu --cpu-max-prime=20000 --num-threads=2 run

IO test

Lastly I fetch the IO performance numbers. When using fileio, you will need to create a set of test files to work on. It is recommended that the size is larger than the available memory to ensure that file caching does not influence the workload too much - https://wiki.gentoo.org/wiki/Sysbench#Using_the_fileio_workload

Run this command to prepare a file which is larger then the available memory (Ram) on the box. In this example my box has 128GB of ram. I set the file size to 150G. I named the file here fileio.

sysbench --test=fileio --file-total-size=150G prepare

Next I run the io test using the file I just created (fileio). file-test-mode is the type of workload to produce.

Possible values:

     seqwr
               sequential write

           seqrewr
               sequential rewrite

           seqrd
               sequential read

           rndrd
               random read

           rndwr
               random write

           rndrw
               combined random read/write

init-rng - specifies if random numbers generator should be initialized from timer before the test start - http://imysql.com/wp-content/uploads/2014/10/sysbench-manual.pdf

max-time - is the limit for the total execution time in seconds. 0 means unlimited. be careful. set a limit.

max-request - is the limit for the total request. 0 means unlimited

sysbench --test=fileio --file-total-size=150G --file-test-mode=rndrw --init-rng=on --max-time=300 --max-requests=0 run

Cloudera Community

Community Articles

Benchmark your hardware for Hadoop & Spark

Apache Hadoop

Apache Spark

HDFS

Benchmarking Hadoop with TeraGen, TeraSort, and Te...

Capturing hardware statistics for Spark, Hadoop, N...

Apache Spark Performance Improvement on NUMA Capab...

Kafka Benchmarking

NiFi Listenhttp Benchmark

spark benchmarking tools

Basic CDC in Hadoop using Spark with Data Frames

Using Spark to Virtually Integrate Hadoop with Ext...

Spark Deployment and Hardware Provisioning

Setting up a Hadoop/Spark cluster with Docker on a...