Which is faster when analyzing data using Spark 1.6.1:
HDP with HDFS for storage, or EMR?
Testing shows that HDP using HDFS has performance gains over using EMR.
HDP/HDFS outperforms EMR by 46% when tested against 1 full day of 37 GB Clickstream (Web) data.
HDP/HDFS
EMR
Time Elapsed
3 mins, 29 sec
5 mins, 5 sec
* See below at end of article for validation and screen prints showing Resource Manager logs
HDP
Hortonworks Data Platform (HDP) is the industry's only true secure, enterprise-ready open source Apache Hadoop distribution.
The Hadoop Distributed File System (HDFS) is a Java-based distributed block storage file system that is used to store all data in HDP.
EMR
Amazon Elastic MapReduce (Amazon EMR) is a managed Hadoop framework to distribute and process vast amounts of data across dynamically scalable Amazon EC2 instances.
S3 is an inexpensive object store that can theoretically scale out infinitely without the limitations inherent to a hierarchical block storage file system.
Objects are not stored in file systems; instead, users create objects and associate keys with them.
Object storage also has the option of tagging metadata with your data.
TEST
Spark (PySpark) using DataFrames to get a Count of Page Views by Operating System (Desktop and Mobile OS types) against a full day of Clickstream data (24 hours) and listing the top 20 most used operating systems.
Ran the same Spark code against an AWS HDP cluster on EC2 instances with data stored in HDFS and against an AWS EMR cluster.
Test Data
COMPANY X is a global online marketplace connecting consumers with merchants.
Total data size is 37 GB. 1 Full day of page views data (24 hours of Clickstream logs).
22.3 Million page view records from 13 countries in North America, Latin America, and Asia.
Data is in JSON format and uncompressed.
143 files totaling 37 GB. Each file averages 256 MB.
All 143 source JSON files were placed into HDFS on HDP and into S3 on EMR.
Platform Versions
HDP 2.3.0 - Hadoop version 2.7.1
EMR 4.5.0 - Hadoop version 2.7.2
AWS HDP and EMR Clusters were sized/configured similarly
m4.2xlarge Instances
1 master and 4 worker nodes
TEST RESULTS
Spark 1.6.1 on HDP/HDFS outperformed Spark 1.6.1 on EMR 46%
Total elapsed time for HDP/HDFS: 3 minutes 29 seconds
Total elapsed time for EMR: 5 minutes 5 seconds
TESTING VALIDATION
Sample JSON record
Total disk usage in HDFS consumed by all files is 37 G
Source data consists of 143 JSON files. Each file averages 256 MB for a total data volume of 37 GB
Output produced. Operating system and total page view count: