Community Articles

sburagohain · ‎02-16-2017

The new year brings new innovation and collaborative efforts. Various teams from the Apache community have been working hard for the last eighteen months to bring the EZ button to Apache Hadoop technology and Data Lake. In the coming months, we will publish a series of blogs introducing our Data Lake 3.0 architecture and highlighting our innovations within Apache Hadoop core and its related technologies.

The What

You probably heard of the Deep Learning powered cucumber sorter from a Japanese farmer Makoto Koike! In their cucumber farm, Makoto’s mother spends up to eight hours per day classifying cucumbers into different classes. Makoto is a trained embedded systems designer but not a trained “Machine Learning” engineer. He leveraged TensorFlow, a deep learning framework, with minor configurations to automate his mom’s complex art of cucumber sorting so that they can focus more on cucumber farming instead.

This simple, yet powerful example mirrors the trip we have embarked on with our valued enterprise customers to reduce the time to deployment and insight (from days to minutes), while reducing the Total Cost of Ownership (TCO) by 2x. Instead of a component-centric approach, we envision an application-centric Data Lake 3.0. If you look back, Data Lake 1.0 was a single use system for batch applications and Data Lake 2.0 was a multi-use platform for batch, interactive, online and streaming components. In Data Lake 3.0, we want to deploy pre-packaged applications with minor customizations and the focus will shift from the platform management to solving the business problems.

The Why

We begin with a few real-world problems – ranging from simple to complex. The common threads behind the Data Lake 3.0 architecture are: reduce the time to deployment; reduce the time to insight; reduce the TCO of a Petabyte (PB) scale Hadoop infrastructure, while increasing utilization of the cluster with additional workloads.

The customer wants to empower its dev-op tenants to be able to spin up a logical cluster in minutes instead of days with the tenants sharing a common set of servers, yet using their own version of Hortonworks Data Platform (HDP). The customer also wants to dynamically allocate the compute and memory resources between its globally dispersed tenants that are following the Sun.
The customer has a standard procedure to upgrade the underlying production Hadoop infrastructure less frequently, however, wants the agility and faster cadence in the applications layer and possibly run various versions (i.e. dev, test & production) of each application side to by side.
The enterprise customer wants to move towards a business-value focused audience (versus selling to infrastructure focused audience) and wants the ability to sell pre-assembled Big Data applications (such as focused on Cyber Security or Internet of Things), with minor customization efforts. Similar to the AppStore of a SmartPhone Operating System, the customer wants a hub where its end consumers can download the Big Data applications.
The customer is deploying expensive hardware resources like GPUs, FPGAs for deep learning and wants to share as a cluster-wide resource pool, with network and IO level SLA guidance per tenant and improve the performance of the app and utilization of the cluster.
The customer has a corporate mandate to archive data for five years instead of one and needs the data lake to provide 2x the present storage efficiency, without sacrificing the ability to query the data in seconds.
The customer is running business critical (aka Tier1) applications on Hadoop infrastructure and requires a Disaster Recovery Business Continuity strategy so that data is available in minutes or hours, should the production site go down.

The How

Our Data Lake 3.0 vision requires us to execute on a complex set of machinery under the hood. While not an exhaustive list, the following sections provides a high-level overview of capabilities and we are setting the stage with this introductory blog.

Application Assemblies: A baseline set of services running on bare-metal facilitates running dockerized services for a longer duration. We can leverage the benefits of docker packaging and distribution, along with the isolation. We can cut down the “time to deployment” from days to minutes and enable use cases, such as running multiple versions of applications side by side; running multiple Hortonworks Data Platform (HDP) clusters logically on a single data lake; running use-case focused data intensive micro-services that we refer to as “Assemblies”.

Storage Enhancements: Naturally, we store the datasets in a single Hadoop data lake to increase the analytics efficacy and reduce the siloes, while providing multiple logical application-centric services on top. Data needs to be kept for many years in an active archive fashion. Depending on the access pattern and temperature, the data needs to sit on both fast (Solid State Drive) and slow (Hard Drive) media. This is where Reed Solomon based Erasure Coding plays a pivotal role in reducing the storage overhead by 2x (vs. existing 3 replica approach) especially for cold storage. In future, we intend to provide an “auto-tiering” mechanism to move the data between hot and cold tiers of media automagically. Liberated from the storage overhead and TCO burden, customers can now retain data for many years. Features such as three NameNode configuration make sure that the administrator has a large servicing window just in case, the Active NameNode goes down on a Saturday night.

Resource Isolation & Sharing: Compute intensive analytics such as deep learning require not only a large compute pool, but also a fast and expensive processing pool made of Graphic Processing Unit (GPU)s in tandem to cut the time of insight from months to days. We intend to provide a resource vector attribute that can be mapped to the cluster-wide GPU resources -so, a customer does not have to dedicate a GPU node to a single tenant or workload. In addition to providing CPU and Memory level isolation, we will provide Network and IO level isolation between tenants and facilitate dynamic allocation of the resources.

The Road Ahead

At Hortonworks, we are incredibly lucky to be guided by many of the world’s advanced analytics users, representing a wide set of verticals in our customer advisory and briefing meetings. Based on their invaluable input, we are on an exciting journey to supercharge Apache Hadoop. Our trip will have many legs, however, 2017 is going to be the exciting year to deliver on many of our promises. If you have made this far, I encourage you to follow this blog series, as we continue to provide more detailed updates from our rockstar technology leaders.

Hope, you enjoy the demo video that captures a glimpse of 2017! Please contact us if you are interested in a limited early access.

Cloudera Community

Community Articles

Data Lake 3.0 -Containerization, Erasure Coding, GPU Pooling

Docker

Tensorflow

The What

The Why

The How

The Road Ahead

Data recovery in Erasure coding.

Connecting CML to Data Lake & CDW with R

Compatibility of HBase with HDFS - Erasure Coding...

Data Lake Architecture

Running docker containerized services in HDP 3.x P...

Tactical modularity in CDE Airflow by loading code...

Running docker containerized services in HDP 3.x P...

CDSW 1.10.5 does not Recognize NVIDIA GPUs

How to create a data lake with Cloudbreak 2.9.0

Setting up GPU-enabled Tensorflow to work with Zep...