Member since
05-23-2016
11
Posts
13
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2203 | 02-06-2017 06:06 PM | |
1475 | 09-06-2016 11:43 PM |
04-28-2017
04:10 PM
1 Kudo
This is a good article by our intern James Medel to protect against accidental deletion: USING HDFS SNAPSHOTS TO PROTECT IMPORTANT ENTERPRISE DATASETS Sometime back, we introduced the ability to create snapshots to protect important enterprise data sets from user or application errors. HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system and are:
Performant and Reliable: Snapshot creation is atomic and instantaneous, no matter the size or depth of the directory subtree Scalable: Snapshots do not create extra copies of blocks on the file system. Snapshots are highly optimized in memory and stored along with the NameNode’s file system namespace In this blog post we’ll walk through how to administer and use HDFS snapshots. ENABLE SNAPSHOTS In an example scenario, Web Server logs are being loaded into HDFS on a daily basis for processing and long term storage. The logs are loaded in a few times a day, and the dataset is organized into directories that hold log files per day in HDFS. Since the Web Server logs are stored only in HDFS, it’s imperative that they are protected from deletion. /data/weblogs /data/weblogs/20130901 /data/weblogs/20130902 /data/weblogs/20130903 In order to provide data protection and recovery for the Web Server log data, snapshots are enabled for the parent directory: hdfs dfsadmin -allowSnapshot /data/weblogs Snapshots need to be explicitly enabled for directories. This provides system administrators with the level of granular control they need to manage data in HDP. TAKE POINT IN TIME SNAPSHOTS The following command creates a point in time snapshot of the /data/weblogs/directory and its subtree: hdfs dfs -createSnapshot /data/weblogs This will create a snapshot, and give it a default name which matches the timestamp at which the snapshot was created. Users can provide an optional snapshot name instead of the default. With the default name, the created snapshot path will be: /data/weblogs/.snapshot/s20130903-000941.091. Users can schedule a CRON job to create snapshots at regular intervals. Example, when you run CRON job: 30 18 * * * rm /home/someuser/tmp/*, the comand tells your file system to run the content from the tmp folder at 18:30 every day. For instance, to integrate CRON jobs with HDFS snapshots, run the command: 30 18 * * * hdfs dfs -createSnapshot /data/weblogs/* to schedule Snapshots to be created each day at 6:30. To view the state of the directory at the recently created snapshot: hdfs dfs -ls /data/weblogs/.snapshot/s20130903-000941.091 Found3 items drwxr-xr-x - web hadoop 02013-09-0123:59/data/weblogs/.snapshot/s20130903-000941.091/20130901 drwxr-xr-x - web hadoop 02013-09-0200:55/data/weblogs/.snapshot/s20130903-000941.091/20130902 drwxr-xr-x - web hadoop 02013-09-0323:57/data/weblogs/.snapshot/s20130903-000941.091/20130903 RECOVER LOST DATA As new data is loaded into the web logs dataset, there could be an erroneous deletion of a file or directory. For example, an application could delete the set of logs pertaining to Sept 2nd, 2013 stored in the /data/weblogs/20130902 directory. Since /data/weblogs has a snapshot, the snapshot will protect from the file blocks being removed from the file system. A deletion will only modify the metadata to remove /data/weblogs/20130902 from the working directory. To recover from this deletion, data is restored by copying the needed data from the snapshot path: hdfs dfs -cp /data/weblogs/.snapshot/s20130903-000941.091/20130902/data/weblogs/ This will restore the lost set of files to the working data set: hdfs dfs -ls /data/weblogs Found3 items drwxr-xr-x - web hadoop 02013-09-0123:59/data/weblogs/20130901 drwxr-xr-x - web hadoop 02013-09-0412:10/data/weblogs/20130902 drwxr-xr-x - web hadoop 02013-09-0323:57/data/weblogs/20130903 Since snapshots are read-only, HDFS will also protect against user or application deletion of the snapshot data itself. The following operation will fail: hdfs dfs -rmdir /data/weblogs/.snapshot/s20130903-000941.091/20130902 NEXT STEPS With HDP 2.1, you can use snapshots to protect your enterprise data from accidental deletion, corruption and errors. Download HDP to get started.
... View more
02-16-2017
05:43 AM
3 Kudos
The new year brings new innovation
and collaborative efforts. Various teams from the Apache community have
been working hard for the last eighteen months to bring the EZ button to
Apache Hadoop technology and Data Lake. In the coming months, we will
publish a series of blogs introducing our Data Lake 3.0 architecture and
highlighting our innovations within Apache Hadoop core and its related
technologies. The What You probably heard of the Deep Learning powered cucumber sorter
from a Japanese farmer Makoto Koike! In their cucumber farm, Makoto’s
mother spends up to eight hours per day classifying cucumbers into
different classes. Makoto is a trained embedded systems designer but not
a trained “Machine Learning” engineer. He leveraged TensorFlow,
a deep learning framework, with minor configurations to automate his
mom’s complex art of cucumber sorting so that they can focus more on
cucumber farming instead. This simple, yet powerful example
mirrors the trip we have embarked on with our valued enterprise
customers to reduce the time to deployment and insight (from days to
minutes), while reducing the Total Cost of Ownership (TCO)
by 2x. Instead of a component-centric approach, we envision an
application-centric Data Lake 3.0. If you look back, Data Lake 1.0 was a
single use system for batch applications and Data Lake 2.0 was a
multi-use platform for batch, interactive, online and streaming
components. In Data Lake 3.0, we want to deploy pre-packaged
applications with minor customizations and the focus will shift from the
platform management to solving the business problems. The Why We begin with a few real-world
problems – ranging from simple to complex. The common threads behind the
Data Lake 3.0 architecture are: reduce the time to deployment; reduce
the time to insight; reduce the TCO of a Petabyte (PB) scale Hadoop
infrastructure, while increasing utilization of the cluster with
additional workloads. The
customer wants to empower its dev-op tenants to be able to spin up a
logical cluster in minutes instead of days with the tenants sharing a
common set of servers, yet using their own version of Hortonworks Data
Platform (HDP). The customer also wants to dynamically allocate the
compute and memory resources between its globally dispersed tenants that
are following the Sun. The
customer has a standard procedure to upgrade the underlying production
Hadoop infrastructure less frequently, however, wants the agility and
faster cadence in the applications layer and possibly run various
versions (i.e. dev, test & production) of each application side to
by side. The
enterprise customer wants to move towards a business-value focused
audience (versus selling to infrastructure focused audience) and wants
the ability to sell pre-assembled Big Data applications (such as focused
on Cyber Security or Internet of Things), with minor customization
efforts. Similar to the AppStore of a SmartPhone Operating System, the
customer wants a hub where its end consumers can download the Big Data
applications. The
customer is deploying expensive hardware resources like GPUs, FPGAs for
deep learning and wants to share as a cluster-wide resource pool, with
network and IO level SLA guidance per tenant and improve the performance
of the app and utilization of the cluster. The
customer has a corporate mandate to archive data for five years instead
of one and needs the data lake to provide 2x the present storage
efficiency, without sacrificing the ability to query the data in
seconds. The
customer is running business critical (aka Tier1) applications on Hadoop
infrastructure and requires a Disaster Recovery Business Continuity
strategy so that data is available in minutes or hours, should the
production site go down. The How Our Data Lake 3.0 vision requires us
to execute on a complex set of machinery under the hood. While not an
exhaustive list, the following sections provides a high-level overview
of capabilities and we are setting the stage with this introductory
blog. Application Assemblies:
A baseline set of services running on bare-metal facilitates running
dockerized services for a longer duration. We can leverage the benefits
of docker packaging and distribution, along with the isolation. We can
cut down the “time to deployment” from days to minutes and enable use
cases, such as running multiple versions of applications side by side;
running multiple Hortonworks Data Platform (HDP) clusters logically on a
single data lake; running use-case focused data intensive
micro-services that we refer to as “Assemblies”. Storage Enhancements:
Naturally, we store the datasets in a single Hadoop data lake to
increase the analytics efficacy and reduce the siloes, while providing
multiple logical application-centric services on top. Data needs to be
kept for many years in an active archive fashion. Depending on the
access pattern and temperature, the data needs to sit on both fast
(Solid State Drive) and slow (Hard Drive) media. This is where Reed Solomon based Erasure Coding
plays a pivotal role in reducing the storage overhead by 2x (vs.
existing 3 replica approach) especially for cold storage. In future, we
intend to provide an “auto-tiering” mechanism to move the data between
hot and cold tiers of media automagically. Liberated from the storage
overhead and TCO burden, customers can now retain data for many years.
Features such as three NameNode configuration make sure that the
administrator has a large servicing window just in case, the Active
NameNode goes down on a Saturday night. Resource Isolation & Sharing:
Compute intensive analytics such as deep learning require not only a
large compute pool, but also a fast and expensive processing pool made
of Graphic Processing Unit (GPU)s
in tandem to cut the time of insight from months to days. We intend to
provide a resource vector attribute that can be mapped to the
cluster-wide GPU resources -so, a customer does not have to dedicate a
GPU node to a single tenant or workload. In addition to providing CPU
and Memory level isolation, we will provide Network and IO level
isolation between tenants and facilitate dynamic allocation of the
resources. The Road Ahead At Hortonworks, we are incredibly
lucky to be guided by many of the world’s advanced analytics users,
representing a wide set of verticals in our customer advisory and
briefing meetings. Based on their invaluable input, we are on an
exciting journey to supercharge Apache Hadoop. Our trip will have many
legs, however, 2017 is going to be the exciting year to deliver on many
of our promises. If you have made this far, I encourage you to follow
this blog series, as we continue to provide more detailed updates from
our rockstar technology leaders. Hope, you enjoy the demo video that captures a glimpse of 2017! Please contact us if you are interested in a limited early access.
... View more
Labels:
02-06-2017
06:06 PM
I echo what has been said already. From Hortonworks Product Mgmt perspective, I'm monitoring this feature (no firm decision yet). The current state is that Federation code is our HDP release, however, we don't have support in Ambari or Hive to make sure Hortonworks can officially support this feature yet. We are getting some queries from customers, along the use case that you mentioned and we need to have a critical mass to move the priority. If any of your customers need it, please feel free to respond to any of our threads with the use case (plus desired Ambari+Hive support).
... View more
09-20-2016
06:41 AM
3 Kudos
Hi Greg, I will provide a high-level response, assuming, you are referring to a cluster hosted in AWS (the intro was covered in the following blog and there will be more detailed blogs to go into details). http://hortonworks.com/blog/making-elephant-fly-cloud/ Following is the high-level deployment scenario of a cluster in AWS. Hortonworks cloud engineering team has made improvements in S3 connector, in ORC layer, in Hive- available in our cloud distro HDC. During a Hive benchmark test, we saw about 2.5x performance improvements on an average and 2-14x across queries (vs. a vanilla HDP on AWS). HDFS on EBS will be used for intermediate data, while S3 will be used for the persistent storage. We are also enabling LLAP to cache columnar data sitting on S3 in order to further improve query performance on S3. Please stay tuned for the rest of the blog series (do remind us if you don't see them posted soon).
... View more
09-07-2016
12:00 AM
1 Kudo
I see that Sowmya already answered. Yes, we can specify S3 as the source/destination cluster(s) with paths (we support Azure as well). Here is a Falcon screenshot. falcon1.png
... View more
09-06-2016
11:43 PM
3 Kudos
Responding from Hortonworks Product Mgmt- currently, we only support native HDFS clusters as the source/destination in Falcon (in addition to S3/Azure). There is no support for Hadoop Compatible File-system (such as EMC ECS), though we are getting the requests from various channels. This will be explored as a future item though we are yet to arrive at a timeline.
... View more
07-26-2016
12:17 AM
2 Kudos
As the Apache Hadoop project
completed a decade, I will start with a fun fact. Hadoop is not an acronym- it
was named after a toy elephant belonging to one of the creator’s son and now,
is the fastest growing segment of the data management and analytics market! The
data management and analytics market is projected to grow from $69.6B in 2015
to $132B in 2020 and the Hadoop segment is forecasted to grow 2.7 times faster
than the data management and analytics market (according to a recent analyst
report). Though Apache Hadoop started as a
single system for batch workloads, Apache Hadoop is becoming the multi-use data
platform for batch, interactive, online and streaming workloads. At the core of
the Hadoop eco-system, we have Apache Hadoop File System (HDFS) to reliably
store data and Apache Hadoop YARN (Yet Another Resource
Negotiator) to reconcile the way applications use Hadoop system resources.
While this post is focused on Apache HDFS, I will recommend Vinod Kumar
Vavilapalli’s blog covering the latest of Apache Hadoop YARN blog. Now, for many of you attending the
Hadoop Summit San Jose, 2016, I want to capture some of the high-level themes
behind the Apache HDFS focused sessions: mixed workload support, enterprise
hardening (supportability/multi-tenancy), storage efficiency, geo-dispersed
clusters, and cloud. Data volume has been growing at an unprecedented rate as
users retain more data, longer for data insights. Active archive of data at
petabyte scale is becoming common and users want logical separation due to
compliance and security reasons. Users expect reliability and ease of
supportability from Apache HDFS, which is not unlike any enterprise grade
storage platform. There is a great deal of interest to retain data in an
efficient manner without incurring a large storage overhead. For disaster
recovery and various other reasons, many Hadoop clusters are now spread across
geographies. In addition to batch oriented large sequential writes, we are
seeing small random files coming to the Apache Hadoop file system. Last and not
the least, the public cloud is on everyone’s mind and there are activities in
the community to seamlessly plug-in cloud storage as an extension to the Hadoop
eco-system. You might find the following sessions interesting and some are
presented by the luminaries from the Apache Hadoop community of committers and
contributors. Evolving HDFS to a Generalized Distributed Storage Subsystem
By Sanjay Radia, Hortonworksand Jitendra Pandey, Hortonworks We are evolving HDFS to a
distributed storage system that will support not just a distributed file
system, but other storage services. We plan to evolve the Datanodes’
fault-tolerant block storage layer to a generalized subsystem over which to
build other storage services such as HDFS and Object store, etc. Hdfs Analysis for Small File
By Rohit Jangid, Expedia and Raman Goyal, Expedia At Expedia, multiple business teams
run ETL jobs which push their data on HDFS. With such enormous usage, cluster
scalability and performance of jobs is crucial. This project aimed at scanning
the cluster on a weekly basis and cataloging these small files , their
location, which team they belong to and track their growth over time. HDFS: Optimization, Stabilization and Supportability
By Chris Nauroth, Hortonworksand Arpit Agarwal, Hortonworks Performance and stability of HDFS
are crucial to the correct functioning of applications at higher layers in the
Hadoop stack. This session is a technical deep dive into recent enhancements
committed to HDFS by the entire Apache contributor community. Toward Better Multi-Tenancy Support from HDFS
By Xiaoyu Yao, Hortonworks Hadoop, as an central enterprise
data hub, naturally demands multi-tenancy. In this talk, we will explore
existing multi-tenancy features including their use cases and limitations, and
ongoing work to provide better multi-tenancy support for Hadoop Ecosystem from
HDFS layer such as Effective Namenode Throttling, Datanode and Yarn Qos
integration. Debunking the Myths of HDFS Erasure Coding Performance
By Zhe Zhang, LinkedIn and Uma Maheswara Rao Gangumalla, Intel Ever since its creation, HDFS has
been relying on data replication to shield against most failure scenarios.
However, with the explosive growth in data volume, replication is getting quite
expensive. Erasure coding (EC) uses far less storage space while still
providing the same level of fault tolerance. In this talk, we will present the
first-ever performance study of the new HDFS erasure coding feature. Cross-DC Fault-Tolerant ViewFileSystem at Twitter
By Gera Shegalov, Twitter and Ming Ma, Twitter Twitter stores hundreds of petabytes
in multiple datacenters, on multiple Hadoop clusters. ViewFileSystem makes the
interaction with our HDFS infrastructure as simple as a single namespace
spanning all datacenters and clusters. Hadoop & Cloud Storage: Object Store Integration in
ProductionBy Chris Nauroth, Hortonworks and Rajesh Balamohan, Hortonworks Today's typical Apache Hadoop
deployments use HDFS for persistent, fault-tolerant storage of big data files.
However, recent emerging architectural patterns increasingly rely on cloud
object storage such as S3, Azure Blob Store, GCS, which are designed for
cost-efficiency, scalability and geographic distribution. This session explores
the challenges around cloud object storage and presents recent work to address
them in a comprehensive effort. HDFS Tiered Storage
By Chris Douglas, Microsoft and Virajith Jalaparti, Microsoft HDFS also stores transient and
operational data in cloud offerings, such as Azure HDInsight and Amazon EMR. In
these settings- but also in more traditional, on-premise deployments-
applications often manage data stored in multiple filesystems, each with unique
traits. Building on existing heterogeneous storage support contributed to
Apache Hadoop 2.3 and expanded in 2.7, we embed a tiered storage architecture
in HDFS to work with external stores.
... View more
Labels:
07-20-2016
09:25 PM
I will give you a qualitative answer- Ambari (UI) uses WebHDFS and it is designed for scale and performance (vs. httpfs). In future, we will also look into enabling WebHDFS to seamlessly handle Name Node failover scenarios so that the apps dependent on WebHDFS does not have to keep track.
... View more