- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 08-25-2016 02:08 PM
Which is faster when analyzing data using Spark 1.6.1:
HDP with HDFS for storage, or EMR?
- Testing shows that HDP using HDFS has performance gains over using EMR.
- HDP/HDFS outperforms EMR by 46% when tested against 1 full day of 37 GB Clickstream (Web) data.
Time Elapsed | 3 mins, 29 sec | 5 mins, 5 sec |
* See below at end of article for validation and screen prints showing Resource Manager logs
- Hortonworks Data Platform (HDP) is the industry's only true secure, enterprise-ready open source Apache Hadoop distribution.
- The Hadoop Distributed File System (HDFS) is a Java-based distributed block storage file system that is used to store all data in HDP.
- Amazon Elastic MapReduce (Amazon EMR) is a managed Hadoop framework to distribute and process vast amounts of data across dynamically scalable Amazon EC2 instances.
- S3 is an inexpensive object store that can theoretically scale out infinitely without the limitations inherent to a hierarchical block storage file system.
- Objects are not stored in file systems; instead, users create objects and associate keys with them.
- Object storage also has the option of tagging metadata with your data.
- Spark (PySpark) using DataFrames to get a Count of Page Views by Operating System (Desktop and Mobile OS types) against a full day of Clickstream data (24 hours) and listing the top 20 most used operating systems.
- Ran the same Spark code against an AWS HDP cluster on EC2 instances with data stored in HDFS and against an AWS EMR cluster.
Test Data
- COMPANY X is a global online marketplace connecting consumers with merchants.
- Total data size is 37 GB. 1 Full day of page views data (24 hours of Clickstream logs).
- 22.3 Million page view records from 13 countries in North America, Latin America, and Asia.
- Data is in JSON format and uncompressed.
- 143 files totaling 37 GB. Each file averages 256 MB.
- All 143 source JSON files were placed into HDFS on HDP and into S3 on EMR.
Platform Versions
- HDP 2.3.0 - Hadoop version 2.7.1
- EMR 4.5.0 - Hadoop version 2.7.2
AWS HDP and EMR Clusters were sized/configured similarly
- m4.2xlarge Instances
- 1 master and 4 worker nodes
- Spark 1.6.1 on HDP/HDFS outperformed Spark 1.6.1 on EMR 46%
- Total elapsed time for HDP/HDFS: 3 minutes 29 seconds
- Total elapsed time for EMR: 5 minutes 5 seconds
Sample JSON record
Total disk usage in HDFS consumed by all files is 37 G
Source data consists of 143 JSON files. Each file averages 256 MB for a total data volume of 37 GB
Output produced. Operating system and total page view count:
HDP Resource Manager log
EMR Resource Manager log