Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Contributor

As the Apache Hadoop project completed a decade, I will start with a fun fact. Hadoop is not an acronym- it was named after a toy elephant belonging to one of the creator’s son and now, is the fastest growing segment of the data management and analytics market! The data management and analytics market is projected to grow from $69.6B in 2015 to $132B in 2020 and the Hadoop segment is forecasted to grow 2.7 times faster than the data management and analytics market (according to a recent analyst report).

Though Apache Hadoop started as a single system for batch workloads, Apache Hadoop is becoming the multi-use data platform for batch, interactive, online and streaming workloads. At the core of the Hadoop eco-system, we have Apache Hadoop File System (HDFS) to reliably store data and Apache Hadoop YARN (Yet Another Resource Negotiator) to reconcile the way applications use Hadoop system resources. While this post is focused on Apache HDFS, I will recommend Vinod Kumar Vavilapalli’s blog covering the latest of Apache Hadoop YARN blog.

Now, for many of you attending the Hadoop Summit San Jose, 2016, I want to capture some of the high-level themes behind the Apache HDFS focused sessions: mixed workload support, enterprise hardening (supportability/multi-tenancy), storage efficiency, geo-dispersed clusters, and cloud. Data volume has been growing at an unprecedented rate as users retain more data, longer for data insights. Active archive of data at petabyte scale is becoming common and users want logical separation due to compliance and security reasons. Users expect reliability and ease of supportability from Apache HDFS, which is not unlike any enterprise grade storage platform. There is a great deal of interest to retain data in an efficient manner without incurring a large storage overhead. For disaster recovery and various other reasons, many Hadoop clusters are now spread across geographies. In addition to batch oriented large sequential writes, we are seeing small random files coming to the Apache Hadoop file system. Last and not the least, the public cloud is on everyone’s mind and there are activities in the community to seamlessly plug-in cloud storage as an extension to the Hadoop eco-system. You might find the following sessions interesting and some are presented by the luminaries from the Apache Hadoop community of committers and contributors.

Evolving HDFS to a Generalized Distributed Storage Subsystem By Sanjay Radia, Hortonworksand Jitendra Pandey, Hortonworks

We are evolving HDFS to a distributed storage system that will support not just a distributed file system, but other storage services. We plan to evolve the Datanodes’ fault-tolerant block storage layer to a generalized subsystem over which to build other storage services such as HDFS and Object store, etc.

Hdfs Analysis for Small File By Rohit Jangid, Expedia and Raman Goyal, Expedia

At Expedia, multiple business teams run ETL jobs which push their data on HDFS. With such enormous usage, cluster scalability and performance of jobs is crucial. This project aimed at scanning the cluster on a weekly basis and cataloging these small files , their location, which team they belong to and track their growth over time.

HDFS: Optimization, Stabilization and Supportability By Chris Nauroth, Hortonworksand Arpit Agarwal, Hortonworks

Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community.

Toward Better Multi-Tenancy Support from HDFS By Xiaoyu Yao, Hortonworks

Hadoop, as an central enterprise data hub, naturally demands multi-tenancy. In this talk, we will explore existing multi-tenancy features including their use cases and limitations, and ongoing work to provide better multi-tenancy support for Hadoop Ecosystem from HDFS layer such as Effective Namenode Throttling, Datanode and Yarn Qos integration.

Debunking the Myths of HDFS Erasure Coding Performance By Zhe Zhang, LinkedIn and Uma Maheswara Rao Gangumalla, Intel

Ever since its creation, HDFS has been relying on data replication to shield against most failure scenarios. However, with the explosive growth in data volume, replication is getting quite expensive. Erasure coding (EC) uses far less storage space while still providing the same level of fault tolerance. In this talk, we will present the first-ever performance study of the new HDFS erasure coding feature.

Cross-DC Fault-Tolerant ViewFileSystem at Twitter By Gera Shegalov, Twitter and Ming Ma, Twitter

Twitter stores hundreds of petabytes in multiple datacenters, on multiple Hadoop clusters. ViewFileSystem makes the interaction with our HDFS infrastructure as simple as a single namespace spanning all datacenters and clusters.

Hadoop & Cloud Storage: Object Store Integration in ProductionBy Chris Nauroth, Hortonworks and Rajesh Balamohan, Hortonworks

Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. This session explores the challenges around cloud object storage and presents recent work to address them in a comprehensive effort.

HDFS Tiered Storage By Chris Douglas, Microsoft and Virajith Jalaparti, Microsoft

HDFS also stores transient and operational data in cloud offerings, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on-premise deployments- applications often manage data stored in multiple filesystems, each with unique traits. Building on existing heterogeneous storage support contributed to Apache Hadoop 2.3 and expanded in 2.7, we embed a tiered storage architecture in HDFS to work with external stores.

726 Views