Community Articles
Find and share helpful community-sourced technical articles
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Super Guru

A continuation article from my IaaS Hadoop performance testing. My previous performance test was on BigStep.


Test 1 Terabyte of data using the Tera Suite (TeraGen, TeraSort, and TeraValidate) on similar hardware profiles using core baseline settings across multiple IaaS providing and Hadoop as a Service offerings. Here we will capture EMR performance statistics using EMRFS(s3), an object storage.


The natural next step is to test the Tera suite on AWS EMR which is Amazon's Hadoop as a Service offering. I used EMR with "EMRFS which is an implementation of HDFS which allows EMR clusters to store data on Amazon S3". Object storage with Hadoop has not traditionally performed well. I was very interested in testing the new EMRFS. EMRFS/S3 was chosen as the storage device for this test due to fact much of the s3's (with EMR) allure is around EMR's ability to store and process data directly off s3. Using EMR's local storage (not s3) mayincrease performance.


Instance TypevCPURamDisk

1 master and 3 data nodes


I have run the same core scripts on other platforms (100s of times) without any modification. That is the objective of these test. Test the same job/script on similar hardware profiles and number of nodes. With EMR that was not the case. I had to change various script settings, MR jar file, and timeout settings for the scripts to work on EMR. Jobs on EMR failed using 1 terabyte of data. Issue posted here on AWS forum. I set mapred.task.timeout=12000000 to get around the EMRFS connection reset issue. This issue did not occur for smaller dataset.


Results: 26 Minutes, 45 Seconds



Results: 2 Hours, 57 Minutes, 49 seconds



Results: 23 minutes, 55 Seconds


Performance Numbers

AWS EMR (EMRFS/s3)26 Minutes, 45 Seconds2 Hours, 57 Minutes, 49 seconds23 minutes, 55 Seconds
BigStep/HDP (DAS)11 Mins 49 Secs51 Mins 12 secs4 mins 42 seconds

Note - Bigstep test used local disk. EMR test used EMRFS. These numbers show different in performance between local storage and EMRFS(s3). Performance statistics on EMR processing performance using local storage (non S3/EMRFS) are not provided here.

The objective of the test was to capture performance statistics using same jobs/scripts with same configuration on similar hardware and document results. That's it. Keep it simple. This is not a reflection of the capabilities of a/the specific IaaS provider.

All my scripts are located here. The EMR specific scripts are here.

Not applicable

Good reference.

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 07:24 AM
Updated by:
Top Kudoed Authors