I have written articles in the past benchmarking Hadoop cloud environments such at BigStep and AWS. What I didn't dive into those article is how I ran the script. I built scripts to rapidly launch TeraGen, TeraSort, and TeraValidate. Why? I found myself running the same script over and over and over again. Why not make it easier by simply executing a shell script.
To run TeraGen, TeraSort, and TeraValidate a determination of the volume of data and number of records is required. For example you can generate 500GB of data with 5000000000 rows.
The script comes with the following predefined sets
#SIZE=500G
#ROWS=5000000000
#SIZE=100G
#ROWS=1000000000
#This will be used as it only value uncommented out
SIZE=1T
ROWS=10000000000
# SIZE=10G
#ROWS=100000000
# SIZE=1G
# ROWS=10000000
Above 1T (for terabyte) and rows 10000000000 are uncommented out. Meaning this script will generate 1TB of data with 10000000000 rows. If you want to use different dataset size and rows, simply comment out all other size and rows. Essentially using only the one you want. Only 1 SIZE and ROWS should be set (uncommented out). This applies to all scripts (teragen.sh, terasort.sh, validate.sh). All scripts must have same SIZE & ROWS setting.
Logs
A log directory is created based on where you run the script. Run output and stats are stored in the logs directory.
For example if you run /home/sunile/teragen.sh
It will create the logs directory here, /home/sunile/logs. All the logs from teragen, sort, and validate will reside here.
Parameters
This is an important piece for tuning. To benchmark your environments parameters should be configured. Much of this is trial and error. I would say experience is required here. ie How each parameter impacts a MapReduce job. Get help here.
For tuning change/add parameters here:
For ease of first time execution, use the ones set in the script. Run it as is and grab your stats. If the stats are acceptable then move on. What is acceptable? Take a look the articles I published on BigStep and AWS. If stats not acceptable, starting tuning.
Run the jobs in the following order
TeraGen (teragen.sh)
TeraSort (terasort.sh)
TeraValidate (validate.sh)
Hope these scripts help you quickly benchmark your environment. Now go build some cool stuff!
@sunile_manjee Your article is too good and informative. I am searching for Benchmarking Hadoop with TeraGen, TeraSort, and TeraValidate with ease and I get exact article i am thankful to you for sharing this educational article . and the way you written is also good, you covered up all the points which i searching for & I am impressed by reading this article. Keep writing and sharing educational article like this which can help us to grow our knowledge.