About sburagohain

sburagohain · ‎04-28-2017

This is a good article by our intern James Medel to protect against accidental deletion: USING HDFS SNAPSHOTS TO PROTECT IMPORTANT ENTERPRISE DATASETS Sometime back, we introduced the ability to create snapshots to protect important enterprise data sets from user or application errors. HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system and are: Performant and Reliable: Snapshot creation is atomic and instantaneous, no matter the size or depth of the directory subtree Scalable: Snapshots do not create extra copies of blocks on the file system. Snapshots are highly optimized in memory and stored along with the NameNode’s file system namespace In this blog post we’ll walk through how to administer and use HDFS snapshots. ENABLE SNAPSHOTS In an example scenario, Web Server logs are being loaded into HDFS on a daily basis for processing and long term storage. The logs are loaded in a few times a day, and the dataset is organized into directories that hold log files per day in HDFS. Since the Web Server logs are stored only in HDFS, it’s imperative that they are protected from deletion. /data/weblogs /data/weblogs/20130901 /data/weblogs/20130902 /data/weblogs/20130903 In order to provide data protection and recovery for the Web Server log data, snapshots are enabled for the parent directory: hdfs dfsadmin -allowSnapshot /data/weblogs Snapshots need to be explicitly enabled for directories. This provides system administrators with the level of granular control they need to manage data in HDP. TAKE POINT IN TIME SNAPSHOTS The following command creates a point in time snapshot of the /data/weblogs/directory and its subtree: hdfs dfs -createSnapshot /data/weblogs This will create a snapshot, and give it a default name which matches the timestamp at which the snapshot was created. Users can provide an optional snapshot name instead of the default. With the default name, the created snapshot path will be: /data/weblogs/.snapshot/s20130903-000941.091. Users can schedule a CRON job to create snapshots at regular intervals. Example, when you run CRON job: 30 18 * * * rm /home/someuser/tmp/*, the comand tells your file system to run the content from the tmp folder at 18:30 every day. For instance, to integrate CRON jobs with HDFS snapshots, run the command: 30 18 * * * hdfs dfs -createSnapshot /data/weblogs/* to schedule Snapshots to be created each day at 6:30. To view the state of the directory at the recently created snapshot: hdfs dfs -ls /data/weblogs/.snapshot/s20130903-000941.091 Found3 items drwxr-xr-x - web hadoop 02013-09-0123:59/data/weblogs/.snapshot/s20130903-000941.091/20130901 drwxr-xr-x - web hadoop 02013-09-0200:55/data/weblogs/.snapshot/s20130903-000941.091/20130902 drwxr-xr-x - web hadoop 02013-09-0323:57/data/weblogs/.snapshot/s20130903-000941.091/20130903 RECOVER LOST DATA As new data is loaded into the web logs dataset, there could be an erroneous deletion of a file or directory. For example, an application could delete the set of logs pertaining to Sept 2nd, 2013 stored in the /data/weblogs/20130902 directory. Since /data/weblogs has a snapshot, the snapshot will protect from the file blocks being removed from the file system. A deletion will only modify the metadata to remove /data/weblogs/20130902 from the working directory. To recover from this deletion, data is restored by copying the needed data from the snapshot path: hdfs dfs -cp /data/weblogs/.snapshot/s20130903-000941.091/20130902/data/weblogs/ This will restore the lost set of files to the working data set: hdfs dfs -ls /data/weblogs Found3 items drwxr-xr-x - web hadoop 02013-09-0123:59/data/weblogs/20130901 drwxr-xr-x - web hadoop 02013-09-0412:10/data/weblogs/20130902 drwxr-xr-x - web hadoop 02013-09-0323:57/data/weblogs/20130903 Since snapshots are read-only, HDFS will also protect against user or application deletion of the snapshot data itself. The following operation will fail: hdfs dfs -rmdir /data/weblogs/.snapshot/s20130903-000941.091/20130902 NEXT STEPS With HDP 2.1, you can use snapshots to protect your enterprise data from accidental deletion, corruption and errors. Download HDP to get started.

sburagohain · ‎02-16-2017

The new year brings new innovation and collaborative efforts. Various teams from the Apache community have been working hard for the last eighteen months to bring the EZ button to Apache Hadoop technology and Data Lake. In the coming months, we will publish a series of blogs introducing our Data Lake 3.0 architecture and highlighting our innovations within Apache Hadoop core and its related technologies. The What You probably heard of the Deep Learning powered cucumber sorter from a Japanese farmer Makoto Koike! In their cucumber farm, Makoto’s mother spends up to eight hours per day classifying cucumbers into different classes. Makoto is a trained embedded systems designer but not a trained “Machine Learning” engineer. He leveraged TensorFlow, a deep learning framework, with minor configurations to automate his mom’s complex art of cucumber sorting so that they can focus more on cucumber farming instead. This simple, yet powerful example mirrors the trip we have embarked on with our valued enterprise customers to reduce the time to deployment and insight (from days to minutes), while reducing the Total Cost of Ownership (TCO) by 2x. Instead of a component-centric approach, we envision an application-centric Data Lake 3.0. If you look back, Data Lake 1.0 was a single use system for batch applications and Data Lake 2.0 was a multi-use platform for batch, interactive, online and streaming components. In Data Lake 3.0, we want to deploy pre-packaged applications with minor customizations and the focus will shift from the platform management to solving the business problems. The Why We begin with a few real-world problems – ranging from simple to complex. The common threads behind the Data Lake 3.0 architecture are: reduce the time to deployment; reduce the time to insight; reduce the TCO of a Petabyte (PB) scale Hadoop infrastructure, while increasing utilization of the cluster with additional workloads. The customer wants to empower its dev-op tenants to be able to spin up a logical cluster in minutes instead of days with the tenants sharing a common set of servers, yet using their own version of Hortonworks Data Platform (HDP). The customer also wants to dynamically allocate the compute and memory resources between its globally dispersed tenants that are following the Sun. The customer has a standard procedure to upgrade the underlying production Hadoop infrastructure less frequently, however, wants the agility and faster cadence in the applications layer and possibly run various versions (i.e. dev, test & production) of each application side to by side. The enterprise customer wants to move towards a business-value focused audience (versus selling to infrastructure focused audience) and wants the ability to sell pre-assembled Big Data applications (such as focused on Cyber Security or Internet of Things), with minor customization efforts. Similar to the AppStore of a SmartPhone Operating System, the customer wants a hub where its end consumers can download the Big Data applications. The customer is deploying expensive hardware resources like GPUs, FPGAs for deep learning and wants to share as a cluster-wide resource pool, with network and IO level SLA guidance per tenant and improve the performance of the app and utilization of the cluster. The customer has a corporate mandate to archive data for five years instead of one and needs the data lake to provide 2x the present storage efficiency, without sacrificing the ability to query the data in seconds. The customer is running business critical (aka Tier1) applications on Hadoop infrastructure and requires a Disaster Recovery Business Continuity strategy so that data is available in minutes or hours, should the production site go down. The How Our Data Lake 3.0 vision requires us to execute on a complex set of machinery under the hood. While not an exhaustive list, the following sections provides a high-level overview of capabilities and we are setting the stage with this introductory blog. Application Assemblies: A baseline set of services running on bare-metal facilitates running dockerized services for a longer duration. We can leverage the benefits of docker packaging and distribution, along with the isolation. We can cut down the “time to deployment” from days to minutes and enable use cases, such as running multiple versions of applications side by side; running multiple Hortonworks Data Platform (HDP) clusters logically on a single data lake; running use-case focused data intensive micro-services that we refer to as “Assemblies”. Storage Enhancements: Naturally, we store the datasets in a single Hadoop data lake to increase the analytics efficacy and reduce the siloes, while providing multiple logical application-centric services on top. Data needs to be kept for many years in an active archive fashion. Depending on the access pattern and temperature, the data needs to sit on both fast (Solid State Drive) and slow (Hard Drive) media. This is where Reed Solomon based Erasure Coding plays a pivotal role in reducing the storage overhead by 2x (vs. existing 3 replica approach) especially for cold storage. In future, we intend to provide an “auto-tiering” mechanism to move the data between hot and cold tiers of media automagically. Liberated from the storage overhead and TCO burden, customers can now retain data for many years. Features such as three NameNode configuration make sure that the administrator has a large servicing window just in case, the Active NameNode goes down on a Saturday night. Resource Isolation & Sharing: Compute intensive analytics such as deep learning require not only a large compute pool, but also a fast and expensive processing pool made of Graphic Processing Unit (GPU)s in tandem to cut the time of insight from months to days. We intend to provide a resource vector attribute that can be mapped to the cluster-wide GPU resources -so, a customer does not have to dedicate a GPU node to a single tenant or workload. In addition to providing CPU and Memory level isolation, we will provide Network and IO level isolation between tenants and facilitate dynamic allocation of the resources. The Road Ahead At Hortonworks, we are incredibly lucky to be guided by many of the world’s advanced analytics users, representing a wide set of verticals in our customer advisory and briefing meetings. Based on their invaluable input, we are on an exciting journey to supercharge Apache Hadoop. Our trip will have many legs, however, 2017 is going to be the exciting year to deliver on many of our promises. If you have made this far, I encourage you to follow this blog series, as we continue to provide more detailed updates from our rockstar technology leaders. Hope, you enjoy the demo video that captures a glimpse of 2017! Please contact us if you are interested in a limited early access.

sburagohain · ‎02-06-2017

I echo what has been said already. From Hortonworks Product Mgmt perspective, I'm monitoring this feature (no firm decision yet). The current state is that Federation code is our HDP release, however, we don't have support in Ambari or Hive to make sure Hortonworks can officially support this feature yet. We are getting some queries from customers, along the use case that you mentioned and we need to have a critical mass to move the priority. If any of your customers need it, please feel free to respond to any of our threads with the use case (plus desired Ambari+Hive support).

sburagohain · ‎09-20-2016

Hi Greg, I will provide a high-level response, assuming, you are referring to a cluster hosted in AWS (the intro was covered in the following blog and there will be more detailed blogs to go into details). http://hortonworks.com/blog/making-elephant-fly-cloud/ Following is the high-level deployment scenario of a cluster in AWS. Hortonworks cloud engineering team has made improvements in S3 connector, in ORC layer, in Hive- available in our cloud distro HDC. During a Hive benchmark test, we saw about 2.5x performance improvements on an average and 2-14x across queries (vs. a vanilla HDP on AWS). HDFS on EBS will be used for intermediate data, while S3 will be used for the persistent storage. We are also enabling LLAP to cache columnar data sitting on S3 in order to further improve query performance on S3. Please stay tuned for the rest of the blog series (do remind us if you don't see them posted soon).

sburagohain · ‎09-07-2016

I see that Sowmya already answered. Yes, we can specify S3 as the source/destination cluster(s) with paths (we support Azure as well). Here is a Falcon screenshot. falcon1.png

sburagohain · ‎09-06-2016

Responding from Hortonworks Product Mgmt- currently, we only support native HDFS clusters as the source/destination in Falcon (in addition to S3/Azure). There is no support for Hadoop Compatible File-system (such as EMC ECS), though we are getting the requests from various channels. This will be explored as a future item though we are yet to arrive at a timeline.

sburagohain · ‎07-26-2016

As the Apache Hadoop project completed a decade, I will start with a fun fact. Hadoop is not an acronym- it was named after a toy elephant belonging to one of the creator’s son and now, is the fastest growing segment of the data management and analytics market! The data management and analytics market is projected to grow from $69.6B in 2015 to $132B in 2020 and the Hadoop segment is forecasted to grow 2.7 times faster than the data management and analytics market (according to a recent analyst report). Though Apache Hadoop started as a single system for batch workloads, Apache Hadoop is becoming the multi-use data platform for batch, interactive, online and streaming workloads. At the core of the Hadoop eco-system, we have Apache Hadoop File System (HDFS) to reliably store data and Apache Hadoop YARN (Yet Another Resource Negotiator) to reconcile the way applications use Hadoop system resources. While this post is focused on Apache HDFS, I will recommend Vinod Kumar Vavilapalli’s blog covering the latest of Apache Hadoop YARN blog. Now, for many of you attending the Hadoop Summit San Jose, 2016, I want to capture some of the high-level themes behind the Apache HDFS focused sessions: mixed workload support, enterprise hardening (supportability/multi-tenancy), storage efficiency, geo-dispersed clusters, and cloud. Data volume has been growing at an unprecedented rate as users retain more data, longer for data insights. Active archive of data at petabyte scale is becoming common and users want logical separation due to compliance and security reasons. Users expect reliability and ease of supportability from Apache HDFS, which is not unlike any enterprise grade storage platform. There is a great deal of interest to retain data in an efficient manner without incurring a large storage overhead. For disaster recovery and various other reasons, many Hadoop clusters are now spread across geographies. In addition to batch oriented large sequential writes, we are seeing small random files coming to the Apache Hadoop file system. Last and not the least, the public cloud is on everyone’s mind and there are activities in the community to seamlessly plug-in cloud storage as an extension to the Hadoop eco-system. You might find the following sessions interesting and some are presented by the luminaries from the Apache Hadoop community of committers and contributors. Evolving HDFS to a Generalized Distributed Storage Subsystem By Sanjay Radia, Hortonworksand Jitendra Pandey, Hortonworks We are evolving HDFS to a distributed storage system that will support not just a distributed file system, but other storage services. We plan to evolve the Datanodes’ fault-tolerant block storage layer to a generalized subsystem over which to build other storage services such as HDFS and Object store, etc. Hdfs Analysis for Small File By Rohit Jangid, Expedia and Raman Goyal, Expedia At Expedia, multiple business teams run ETL jobs which push their data on HDFS. With such enormous usage, cluster scalability and performance of jobs is crucial. This project aimed at scanning the cluster on a weekly basis and cataloging these small files , their location, which team they belong to and track their growth over time. HDFS: Optimization, Stabilization and Supportability By Chris Nauroth, Hortonworksand Arpit Agarwal, Hortonworks Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. Toward Better Multi-Tenancy Support from HDFS By Xiaoyu Yao, Hortonworks Hadoop, as an central enterprise data hub, naturally demands multi-tenancy. In this talk, we will explore existing multi-tenancy features including their use cases and limitations, and ongoing work to provide better multi-tenancy support for Hadoop Ecosystem from HDFS layer such as Effective Namenode Throttling, Datanode and Yarn Qos integration. Debunking the Myths of HDFS Erasure Coding Performance By Zhe Zhang, LinkedIn and Uma Maheswara Rao Gangumalla, Intel Ever since its creation, HDFS has been relying on data replication to shield against most failure scenarios. However, with the explosive growth in data volume, replication is getting quite expensive. Erasure coding (EC) uses far less storage space while still providing the same level of fault tolerance. In this talk, we will present the first-ever performance study of the new HDFS erasure coding feature. Cross-DC Fault-Tolerant ViewFileSystem at Twitter By Gera Shegalov, Twitter and Ming Ma, Twitter Twitter stores hundreds of petabytes in multiple datacenters, on multiple Hadoop clusters. ViewFileSystem makes the interaction with our HDFS infrastructure as simple as a single namespace spanning all datacenters and clusters. Hadoop & Cloud Storage: Object Store Integration in ProductionBy Chris Nauroth, Hortonworks and Rajesh Balamohan, Hortonworks Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. This session explores the challenges around cloud object storage and presents recent work to address them in a comprehensive effort. HDFS Tiered Storage By Chris Douglas, Microsoft and Virajith Jalaparti, Microsoft HDFS also stores transient and operational data in cloud offerings, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on-premise deployments- applications often manage data stored in multiple filesystems, each with unique traits. Building on existing heterogeneous storage support contributed to Apache Hadoop 2.3 and expanded in 2.7, we embed a tiered storage architecture in HDFS to work with external stores.

sburagohain · ‎07-20-2016

I will give you a qualitative answer- Ambari (UI) uses WebHDFS and it is designed for scale and performance (vs. httpfs). In future, we will also look into enabling WebHDFS to seamlessly handle Name Node failover scenarios so that the apps dependent on WebHDFS does not have to keep track.

Online	Offline
Last Visited	‎05-25-2017 04:19 PM

Member Since	‎05-23-2016 09:56 PM
Last Visited	‎05-25-2017 04:19 PM
Posts	11
Kudos received	13

Cloudera Community

Re: HDFS Federation is officially supported by hor...

Re: Is there a way to make Falcon features work wi...

Re: How to protect HDFS directories from deletion ...

Data Lake 3.0 -Containerization, Erasure Coding, G...

Re: HDFS Federation is officially supported by hor...

Re: Are there any special considerations or optimi...

Re: Is it possible to use S3 for Falcon feeds?

Re: Is there a way to make Falcon features work wi...

Explore the latest on Apache Hadoop HDFS Summit Sa...

Re: WebHDFS Performance