As the Apache Hadoop project
completed a decade, I will start with a fun fact. Hadoop is not an acronym- it
was named after a toy elephant belonging to one of the creator’s son and now,
is the fastest growing segment of the data management and analytics market! The
data management and analytics market is projected to grow from $69.6B in 2015
to $132B in 2020 and the Hadoop segment is forecasted to grow 2.7 times faster
than the data management and analytics market (according to a recent analyst
Though Apache Hadoop started as a
single system for batch workloads, Apache Hadoop is becoming the multi-use data
platform for batch, interactive, online and streaming workloads. At the core of
the Hadoop eco-system, we have Apache Hadoop File System (HDFS) to reliably
store data and Apache Hadoop YARN (Yet Another Resource
Negotiator) to reconcile the way applications use Hadoop system resources.
While this post is focused on Apache HDFS, I will recommend Vinod Kumar
Vavilapalli’s blog covering the latest of Apache Hadoop YARN blog.
Now, for many of you attending the
Hadoop Summit San Jose, 2016, I want to capture some of the high-level themes
behind the Apache HDFS focused sessions: mixed workload support, enterprise
hardening (supportability/multi-tenancy), storage efficiency, geo-dispersed
clusters, and cloud. Data volume has been growing at an unprecedented rate as
users retain more data, longer for data insights. Active archive of data at
petabyte scale is becoming common and users want logical separation due to
compliance and security reasons. Users expect reliability and ease of
supportability from Apache HDFS, which is not unlike any enterprise grade
storage platform. There is a great deal of interest to retain data in an
efficient manner without incurring a large storage overhead. For disaster
recovery and various other reasons, many Hadoop clusters are now spread across
geographies. In addition to batch oriented large sequential writes, we are
seeing small random files coming to the Apache Hadoop file system. Last and not
the least, the public cloud is on everyone’s mind and there are activities in
the community to seamlessly plug-in cloud storage as an extension to the Hadoop
eco-system. You might find the following sessions interesting and some are
presented by the luminaries from the Apache Hadoop community of committers and
We are evolving HDFS to a
distributed storage system that will support not just a distributed file
system, but other storage services. We plan to evolve the Datanodes’
fault-tolerant block storage layer to a generalized subsystem over which to
build other storage services such as HDFS and Object store, etc.
At Expedia, multiple business teams
run ETL jobs which push their data on HDFS. With such enormous usage, cluster
scalability and performance of jobs is crucial. This project aimed at scanning
the cluster on a weekly basis and cataloging these small files , their
location, which team they belong to and track their growth over time.
Performance and stability of HDFS
are crucial to the correct functioning of applications at higher layers in the
Hadoop stack. This session is a technical deep dive into recent enhancements
committed to HDFS by the entire Apache contributor community.
Hadoop, as an central enterprise
data hub, naturally demands multi-tenancy. In this talk, we will explore
existing multi-tenancy features including their use cases and limitations, and
ongoing work to provide better multi-tenancy support for Hadoop Ecosystem from
HDFS layer such as Effective Namenode Throttling, Datanode and Yarn Qos
Ever since its creation, HDFS has
been relying on data replication to shield against most failure scenarios.
However, with the explosive growth in data volume, replication is getting quite
expensive. Erasure coding (EC) uses far less storage space while still
providing the same level of fault tolerance. In this talk, we will present the
first-ever performance study of the new HDFS erasure coding feature.
Twitter stores hundreds of petabytes
in multiple datacenters, on multiple Hadoop clusters. ViewFileSystem makes the
interaction with our HDFS infrastructure as simple as a single namespace
spanning all datacenters and clusters.
Today's typical Apache Hadoop
deployments use HDFS for persistent, fault-tolerant storage of big data files.
However, recent emerging architectural patterns increasingly rely on cloud
object storage such as S3, Azure Blob Store, GCS, which are designed for
cost-efficiency, scalability and geographic distribution. This session explores
the challenges around cloud object storage and presents recent work to address
them in a comprehensive effort.
HDFS also stores transient and
operational data in cloud offerings, such as Azure HDInsight and Amazon EMR. In
these settings- but also in more traditional, on-premise deployments-
applications often manage data stored in multiple filesystems, each with unique
traits. Building on existing heterogeneous storage support contributed to
Apache Hadoop 2.3 and expanded in 2.7, we embed a tiered storage architecture
in HDFS to work with external stores.