Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Super Guru

A continuation article from my IaaS Hadoop performance testing. My previous performance test was on BigStep.

Objective

Test 1 Terabyte of data using the Tera Suite (TeraGen, TeraSort, and TeraValidate) on similar hardware profiles using core baseline settings across multiple IaaS providing and Hadoop as a Service offerings. Here we will capture EMR performance statistics using EMRFS(s3), an object storage.

AWS EMR

The natural next step is to test the Tera suite on AWS EMR which is Amazon's Hadoop as a Service offering. I used EMR with "EMRFS which is an implementation of HDFS which allows EMR clusters to store data on Amazon S3". Object storage with Hadoop has not traditionally performed well. I was very interested in testing the new EMRFS. EMRFS/S3 was chosen as the storage device for this test due to fact much of the s3's (with EMR) allure is around EMR's ability to store and process data directly off s3. Using EMR's local storage (not s3) mayincrease performance.

Hardware

Instance TypevCPURamDisk
i2.4xlarge161224x800SSD

1 master and 3 data nodes

Observation

I have run the same core scripts on other platforms (100s of times) without any modification. That is the objective of these test. Test the same job/script on similar hardware profiles and number of nodes. With EMR that was not the case. I had to change various script settings, MR jar file, and timeout settings for the scripts to work on EMR. Jobs on EMR failed using 1 terabyte of data. Issue posted here on AWS forum. I set mapred.task.timeout=12000000 to get around the EMRFS connection reset issue. This issue did not occur for smaller dataset.

TeraGen

Results: 26 Minutes, 45 Seconds

10360-teragen-screenshot.jpg

TeraSort

Results: 2 Hours, 57 Minutes, 49 seconds

10371-terasort-screenshot.jpg

TeraValidate

Results: 23 minutes, 55 Seconds

10372-teravalidate.jpg

Performance Numbers

IaaSTeraGenTeraSortTeraValidate
AWS EMR (EMRFS/s3)26 Minutes, 45 Seconds2 Hours, 57 Minutes, 49 seconds23 minutes, 55 Seconds
BigStep/HDP (DAS)11 Mins 49 Secs51 Mins 12 secs4 mins 42 seconds

Note - Bigstep test used local disk. EMR test used EMRFS. These numbers show different in performance between local storage and EMRFS(s3). Performance statistics on EMR processing performance using local storage (non S3/EMRFS) are not provided here.

The objective of the test was to capture performance statistics using same jobs/scripts with same configuration on similar hardware and document results. That's it. Keep it simple. This is not a reflection of the capabilities of a/the specific IaaS provider.

All my scripts are located here. The EMR specific scripts are here.

2,431 Views
Comments
Not applicable

Good reference.

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 07:24 AM
Updated by:
 
Contributors
Top Kudoed Authors