Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar

SYNOPSIS

Which is faster when analyzing data using Spark 1.6.1:

HDP with HDFS for storage, or EMR?

  • Testing shows that HDP using HDFS has performance gains over using EMR.
  • HDP/HDFS outperforms EMR by 46% when tested against 1 full day of 37 GB Clickstream (Web) data.
HDP/HDFSEMR
Time Elapsed3 mins, 29 sec5 mins, 5 sec

* See below at end of article for validation and screen prints showing Resource Manager logs

HDP

  • Hortonworks Data Platform (HDP) is the industry's only true secure, enterprise-ready open source Apache Hadoop distribution.
  • The Hadoop Distributed File System (HDFS) is a Java-based distributed block storage file system that is used to store all data in HDP.

EMR

  • Amazon Elastic MapReduce (Amazon EMR) is a managed Hadoop framework to distribute and process vast amounts of data across dynamically scalable Amazon EC2 instances.
  • S3 is an inexpensive object store that can theoretically scale out infinitely without the limitations inherent to a hierarchical block storage file system.
  • Objects are not stored in file systems; instead, users create objects and associate keys with them.
  • Object storage also has the option of tagging metadata with your data.

TEST

  • Spark (PySpark) using DataFrames to get a Count of Page Views by Operating System (Desktop and Mobile OS types) against a full day of Clickstream data (24 hours) and listing the top 20 most used operating systems.
  • Ran the same Spark code against an AWS HDP cluster on EC2 instances with data stored in HDFS and against an AWS EMR cluster.

Test Data

  • COMPANY X is a global online marketplace connecting consumers with merchants.
  • Total data size is 37 GB. 1 Full day of page views data (24 hours of Clickstream logs).
  • 22.3 Million page view records from 13 countries in North America, Latin America, and Asia.
  • Data is in JSON format and uncompressed.
  • 143 files totaling 37 GB. Each file averages 256 MB.
  • All 143 source JSON files were placed into HDFS on HDP and into S3 on EMR.

Platform Versions

  • HDP 2.3.0 - Hadoop version 2.7.1
  • EMR 4.5.0 - Hadoop version 2.7.2

AWS HDP and EMR Clusters were sized/configured similarly

  1. m4.2xlarge Instances
  2. 1 master and 4 worker nodes

TEST RESULTS

  • Spark 1.6.1 on HDP/HDFS outperformed Spark 1.6.1 on EMR 46%
  • Total elapsed time for HDP/HDFS: 3 minutes 29 seconds
  • Total elapsed time for EMR: 5 minutes 5 seconds

TESTING VALIDATION

Sample JSON record

Screen Shot 2016-04-04 at 9.24.25 PM.png

Total disk usage in HDFS consumed by all files is 37 G

Screen Shot 2016-04-03 at 12.05.29 PM.png

Source data consists of 143 JSON files. Each file averages 256 MB for a total data volume of 37 GB

Screen Shot 2016-04-03 at 12.07.28 PM.png

Output produced. Operating system and total page view count:

Screen Shot 2016-04-04 at 7.22.15 PM.png

HDP Resource Manager log

Screen Shot 2016-04-04 at 7.23.28 PM.png

EMR Resource Manager log

Screen Shot 2016-04-04 at 8.51.24 PM.png

5,000 Views