Created on 12-13-201603:50 AM - edited 08-17-201907:22 AM
I have written articles in the past benchmarking Hadoop cloud environments such at BigStep and AWS. What I didn't dive into those article is how I ran the script. I built scripts to rapidly launch TeraGen, TeraSort, and TeraValidate. Why? I found myself running the same script over and over and over again. Why not make it easier by simply executing a shell script.
To run TeraGen, TeraSort, and TeraValidate a determination of the volume of data and number of records is required. For example you can generate 500GB of data with 5000000000 rows.
The script comes with the following predefined sets
#This will be used as it only value uncommented out
Above 1T (for terabyte) and rows 10000000000 are uncommented out. Meaning this script will generate 1TB of data with 10000000000 rows. If you want to use different dataset size and rows, simply comment out all other size and rows. Essentially using only the one you want. Only 1 SIZE and ROWS should be set (uncommented out). This applies to all scripts (teragen.sh, terasort.sh, validate.sh). All scripts must have same SIZE & ROWS setting.
A log directory is created based on where you run the script. Run output and stats are stored in the logs directory.
For example if you run /home/sunile/teragen.sh
It will create the logs directory here, /home/sunile/logs. All the logs from teragen, sort, and validate will reside here.
This is an important piece for tuning. To benchmark your environments parameters should be configured. Much of this is trial and error. I would say experience is required here. ie How each parameter impacts a MapReduce job. Get help here.
For tuning change/add parameters here:
For ease of first time execution, use the ones set in the script. Run it as is and grab your stats. If the stats are acceptable then move on. What is acceptable? Take a look the articles I published on BigStep and AWS. If stats not acceptable, starting tuning.
Run the jobs in the following order
Hope these scripts help you quickly benchmark your environment. Now go build some cool stuff!