Created on 12-15-201611:20 PM - edited 08-17-201907:24 AM
A continuation article from my IaaS Hadoop performance testing. My previous performance test was on BigStep.
Test 1 Terabyte of data using the Tera Suite (TeraGen, TeraSort, and TeraValidate) on similar hardware profiles using core baseline settings across multiple IaaS providing and Hadoop as a Service offerings. Here we will capture EMR performance statistics using EMRFS(s3), an object storage.
The natural next step is to test the Tera suite on AWS EMR which is Amazon's Hadoop as a Service offering. I used EMR with "EMRFS which is an implementation of HDFS which allows EMR clusters to store data on Amazon S3". Object storage with Hadoop has not traditionally performed well. I was very interested in testing the new EMRFS. EMRFS/S3 was chosen as the storage device for this test due to fact much of the s3's (with EMR) allure is around EMR's ability to store and process data directly off s3. Using EMR's local storage (not s3) mayincrease performance.
1 master and 3 data nodes
I have run the same core scripts on other platforms (100s of times) without any modification. That is the objective of these test. Test the same job/script on similar hardware profiles and number of nodes. With EMR that was not the case. I had to change various script settings, MR jar file, and timeout settings for the scripts to work on EMR. Jobs on EMR failed using 1 terabyte of data. Issue posted here on AWS forum. I set mapred.task.timeout=12000000 to get around the EMRFS connection reset issue. This issue did not occur for smaller dataset.
Results: 26 Minutes, 45 Seconds
Results: 2 Hours, 57 Minutes, 49 seconds
Results: 23 minutes, 55 Seconds
AWS EMR (EMRFS/s3)
26 Minutes, 45 Seconds
2 Hours, 57 Minutes, 49 seconds
23 minutes, 55 Seconds
11 Mins 49 Secs
51 Mins 12 secs
4 mins 42 seconds
Note - Bigstep test used local disk. EMR test used EMRFS. These numbers show different in performance between local storage and EMRFS(s3). Performance statistics on EMR processing performance using local storage (non S3/EMRFS) are not provided here.
The objective of the test was to capture performance statistics using same jobs/scripts with same configuration on similar hardware and document results. That's it. Keep it simple. This is not a reflection of the capabilities of a/the specific IaaS provider.
All my scripts are located here. The EMR specific scripts are here.