Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.
Labels (2)

SYNOPSIS

Which is faster when analyzing data using Spark 1.6.1:

HDP with HDFS for storage, or EMR?

  • Testing shows that HDP using HDFS has performance gains over using EMR.
  • HDP/HDFS outperforms EMR by 46% when tested against 1 full day of 37 GB Clickstream (Web) data.
HDP/HDFSEMR
Time Elapsed3 mins, 29 sec5 mins, 5 sec

* See below at end of article for validation and screen prints showing Resource Manager logs

HDP

  • Hortonworks Data Platform (HDP) is the industry's only true secure, enterprise-ready open source Apache Hadoop distribution.
  • The Hadoop Distributed File System (HDFS) is a Java-based distributed block storage file system that is used to store all data in HDP.

EMR

  • Amazon Elastic MapReduce (Amazon EMR) is a managed Hadoop framework to distribute and process vast amounts of data across dynamically scalable Amazon EC2 instances.
  • S3 is an inexpensive object store that can theoretically scale out infinitely without the limitations inherent to a hierarchical block storage file system.
  • Objects are not stored in file systems; instead, users create objects and associate keys with them.
  • Object storage also has the option of tagging metadata with your data.

TEST

  • Spark (PySpark) using DataFrames to get a Count of Page Views by Operating System (Desktop and Mobile OS types) against a full day of Clickstream data (24 hours) and listing the top 20 most used operating systems.
  • Ran the same Spark code against an AWS HDP cluster on EC2 instances with data stored in HDFS and against an AWS EMR cluster.

Test Data

  • COMPANY X is a global online marketplace connecting consumers with merchants.
  • Total data size is 37 GB. 1 Full day of page views data (24 hours of Clickstream logs).
  • 22.3 Million page view records from 13 countries in North America, Latin America, and Asia.
  • Data is in JSON format and uncompressed.
  • 143 files totaling 37 GB. Each file averages 256 MB.
  • All 143 source JSON files were placed into HDFS on HDP and into S3 on EMR.

Platform Versions

  • HDP 2.3.0 - Hadoop version 2.7.1
  • EMR 4.5.0 - Hadoop version 2.7.2

AWS HDP and EMR Clusters were sized/configured similarly

  1. m4.2xlarge Instances
  2. 1 master and 4 worker nodes

TEST RESULTS

  • Spark 1.6.1 on HDP/HDFS outperformed Spark 1.6.1 on EMR 46%
  • Total elapsed time for HDP/HDFS: 3 minutes 29 seconds
  • Total elapsed time for EMR: 5 minutes 5 seconds

TESTING VALIDATION

Sample JSON record

Screen Shot 2016-04-04 at 9.24.25 PM.png

Total disk usage in HDFS consumed by all files is 37 G

Screen Shot 2016-04-03 at 12.05.29 PM.png

Source data consists of 143 JSON files. Each file averages 256 MB for a total data volume of 37 GB

Screen Shot 2016-04-03 at 12.07.28 PM.png

Output produced. Operating system and total page view count:

Screen Shot 2016-04-04 at 7.22.15 PM.png

HDP Resource Manager log

Screen Shot 2016-04-04 at 7.23.28 PM.png

EMR Resource Manager log

Screen Shot 2016-04-04 at 8.51.24 PM.png

4,694 Views
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.
Version history
Last update:
‎08-25-2016 02:08 PM
Updated by:
Contributors
Top Kudoed Authors