Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Master Guru

A continuation article from my IaaS Hadoop performance testing. My previous performance test was on BigStep.

Objective

Test 1 Terabyte of data using the Tera Suite (TeraGen, TeraSort, and TeraValidate) on similar hardware profiles using core baseline settings across multiple IaaS providing and Hadoop as a Service offerings. Here we will capture EMR performance statistics using EMRFS(s3), an object storage.

AWS EMR

The natural next step is to test the Tera suite on AWS EMR which is Amazon's Hadoop as a Service offering. I used EMR with "EMRFS which is an implementation of HDFS which allows EMR clusters to store data on Amazon S3". Object storage with Hadoop has not traditionally performed well. I was very interested in testing the new EMRFS. EMRFS/S3 was chosen as the storage device for this test due to fact much of the s3's (with EMR) allure is around EMR's ability to store and process data directly off s3. Using EMR's local storage (not s3) mayincrease performance.

Hardware

Instance TypevCPURamDisk
i2.4xlarge161224x800SSD

1 master and 3 data nodes

Observation

I have run the same core scripts on other platforms (100s of times) without any modification. That is the objective of these test. Test the same job/script on similar hardware profiles and number of nodes. With EMR that was not the case. I had to change various script settings, MR jar file, and timeout settings for the scripts to work on EMR. Jobs on EMR failed using 1 terabyte of data. Issue posted here on AWS forum. I set mapred.task.timeout=12000000 to get around the EMRFS connection reset issue. This issue did not occur for smaller dataset.

TeraGen

Results: 26 Minutes, 45 Seconds

10360-teragen-screenshot.jpg

TeraSort

Results: 2 Hours, 57 Minutes, 49 seconds

10371-terasort-screenshot.jpg

TeraValidate

Results: 23 minutes, 55 Seconds

10372-teravalidate.jpg

Performance Numbers

IaaSTeraGenTeraSortTeraValidate
AWS EMR (EMRFS/s3)26 Minutes, 45 Seconds2 Hours, 57 Minutes, 49 seconds23 minutes, 55 Seconds
BigStep/HDP (DAS)11 Mins 49 Secs51 Mins 12 secs4 mins 42 seconds

Note - Bigstep test used local disk. EMR test used EMRFS. These numbers show different in performance between local storage and EMRFS(s3). Performance statistics on EMR processing performance using local storage (non S3/EMRFS) are not provided here.

The objective of the test was to capture performance statistics using same jobs/scripts with same configuration on similar hardware and document results. That's it. Keep it simple. This is not a reflection of the capabilities of a/the specific IaaS provider.

All my scripts are located here. The EMR specific scripts are here.

6,730 Views
Comments
avatar
New Contributor

Good reference.

Hi @sunile_manjee .

Thank you very much for this excellent writeup and testing.

 

I got the chance to run TeraGen, Terasort and TeraValidate with an environment which have around 40+ worker node with 3 master recently. Comparing to your results, it seems that for the Teragen mine is quite bad. 

 

My results:

Teragen 57mnt 44sec,

TeraSort 49mnt 01sec

TeraValidate 4mnt16sec

 

My spec:

Masternode,cpu:16core , RAM:384GB Storage:12x2TB SATA 6Gb 7.2K RPM

Workernode,cpu:24core, RAM;384GB Storage: 12x2TB NL SAS  12Gb 7.2k RPM

network badwidth:20gbps

 

My observation

It believe the speed of the disk is very important here. Since my disk is only using SATA&SAS instead of SSD, I'm assuming this is the core reason why my Teragen results is bad even I got more nodes. Please correct me if this is not the case. But one thing that I saw from your BigStep test, the workernode is using local HDD not SSD like the AWS nodes. Shouldn't this decrease the TeraGen results? but it was the fastest one at 11min 49seconds.

 

I also read from your test here https://community.cloudera.com/t5/Community-Articles/More-Hadoop-nodes-faster-IO-and-processing-time.... The performance is way better using 5nodes against 3nodes. I wass hoping my 40+ node will top this, but it is not. Any advice is very very appreciated.

 

P/s: I play with YARN configuration to try to get a better result. I set Map&Reduce cores&memory to 8 & 64GB. Do you think this help? Default setting (1core and 1gb ram)will make my terasort crash.

Hi, went through my testing again. 

Unfortunately, I missed the step where I need to change/add the parameters on my commandline. Change my TeraGen, TeraSort and TeraValidate parameter and got better results

 

TeraGen: 1 min 57 sec

TeraSort: 22min 55sec

TeraValidate: 1 min 23sec.

 

Thank you very much for your writeup again.