Community Articles

bmathew · ‎08-25-2016

SYNOPSIS

Which is faster when analyzing data using Spark 1.6.1:

HDP with HDFS for storage, or EMR?

Testing shows that HDP using HDFS has performance gains over using EMR.
HDP/HDFS outperforms EMR by 46% when tested against 1 full day of 37 GB Clickstream (Web) data.

	HDP/HDFS	EMR
Time Elapsed	3 mins, 29 sec	5 mins, 5 sec

* See below at end of article for validation and screen prints showing Resource Manager logs

HDP

Hortonworks Data Platform (HDP) is the industry's only true secure, enterprise-ready open source Apache Hadoop distribution.
The Hadoop Distributed File System (HDFS) is a Java-based distributed block storage file system that is used to store all data in HDP.

EMR

Amazon Elastic MapReduce (Amazon EMR) is a managed Hadoop framework to distribute and process vast amounts of data across dynamically scalable Amazon EC2 instances.
S3 is an inexpensive object store that can theoretically scale out infinitely without the limitations inherent to a hierarchical block storage file system.
Objects are not stored in file systems; instead, users create objects and associate keys with them.
Object storage also has the option of tagging metadata with your data.

Spark (PySpark) using DataFrames to get a Count of Page Views by Operating System (Desktop and Mobile OS types) against a full day of Clickstream data (24 hours) and listing the top 20 most used operating systems.
Ran the same Spark code against an AWS HDP cluster on EC2 instances with data stored in HDFS and against an AWS EMR cluster.

Test Data

COMPANY X is a global online marketplace connecting consumers with merchants.
Total data size is 37 GB. 1 Full day of page views data (24 hours of Clickstream logs).
22.3 Million page view records from 13 countries in North America, Latin America, and Asia.
Data is in JSON format and uncompressed.
143 files totaling 37 GB. Each file averages 256 MB.
All 143 source JSON files were placed into HDFS on HDP and into S3 on EMR.

Platform Versions