Community Articles

rbiswas1 · ‎07-06-2016

Before we move on to the more popular DR/HA strategies for Amabari,Hive,HBase and using Falcon/snapshot/distcp etc let us take a quick tour of the tiered storage in Hadoop.

Link to Series1:

https://community.hortonworks.com/content/kbentry/43525/disaster-recovery-and-backup-best-practices-...

The concept of tiered storage revolves around the idea of decoupling storage of data and the computing/performing operations on the data.

Let us think about it in a more general way:

You have a data lake and you are storing a 5 years’ worth of data amounting to a petabyte.

You are observing the following scenarios in your cluster:

Your storage has become heterogeneous with time meaning you have added new generation racks/machines over time and some of them are better performing (less dense, high spinning/ newer processor) than the others.
You are also observing a trend that the last 6 months of data is getting accessed more frequently (Like 90% of the times). While the remaining 4.5 years of data is only accessed 10%.

This is where tiered storage comes into play.

What we are saying here is as your data is ageing, its getting hot to warm to cold (labelled in terms of frequency of access). There will be some exceptions to this for example look-up tables.

A simple solution to increase the overall performance of your data lake would be to store all the 3 replicas (if possible) in the newer machines. While store only 1 or 0 replica of the older data in the older lower performing machines. Even these older machines can be utilized to back up your configuration/setup files.

You can tag the data node storage location as archive or disk or ram etc to indicate tiered storage.

The following storage policies can be setup:

Hot - for both storage and compute. The data that is popular and still being used for processing will stay in this policy. When a block is hot, all replicas are stored in DISK.
Cold - only for storage with limited compute. The data that is no longer being used, or data that needs to be archived is moved from hot storage to cold storage. When a block is cold, all replicas are stored in ARCHIVE.
Warm - partially hot and partially cold. When a block is warm, some of its replicas are stored in DISK and the remaining replicas are stored in ARCHIVE.
All_SSD - for storing all replicas in SSD.
One_SSD - for storing one of the replicas in SSD. The remaining replicas are stored in DISK.
Lazy_Persist - for writing blocks with single replica in memory. The replica is first written in RAM_DISK and then it is lazily persisted in DISK.

You need to configure the following properties:

dfs.storage.policy.enabled - for enabling/disabling the storage policy feature. The default value is true.
dfs.datanode.data.dir - on each data node, the comma-separated storage locations should be tagged with their storage types. This allows storage policies to place the blocks on different storage types according to policy.
For example:

A datanode storage location /grid/dn/disk0 on DISK should be configured with [DISK]file:///grid/dn/disk0
A datanode storage location /grid/dn/ssd0 on SSD can should configured with [SSD]file:///grid/dn/ssd0
A datanode storage location /grid/dn/archive0 on ARCHIVE should be configured with [ARCHIVE]file:///grid/dn/archive0
A datanode storage location /grid/dn/ram0 on RAM_DISK should be configured with [RAM_DISK]file:///grid/dn/ram0

The default storage type of a datanode storage location will be DISK if it does not have a storage type tagged explicitly.

Storage Policies can be enforced during file creation, and at any point during the lifetime of the file. For Storage Policies that have changed during the lifetime of the file, HDFS introduces a new tool called Mover that can be run periodically to migrate all files across the cluster to correct Storage Types based on their Storage policies.

If you want to read more about it please refer this excellent articles:

http://hortonworks.com/blog/heterogeneous-storages-hdfs/

http://hortonworks.com/blog/heterogeneous-storage-policies-hdp-2-2/

http://www.slideshare.net/Hadoop_Summit/reduce-storage-costs-by-5x-using-the-new-hdfs-tiered-storage...

And the documentation:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_hdfs_admin_tools/content/archival_storag...

I want to end this article with some discussion on the In-Memory storage tier:

For applications that need to write data that are temporary or can be regenerated, memory (RAM) can be an alternate storage medium that provides low latency for reads and writes. Since memory is a volatile storage medium, data written to the memory tier will be asynchronously persisted to disk.

HDP introduces the ‘RAM_DISK’ Storage Type and ‘LAZY_PERSIST’ Storage Policy. To setup the memory as storage, it needs to follow the steps below:

1. Shut Down the Data Node

2. Mount a Portion of Data Node Memory for HDFS

To use Data Node memory as storage, you must first mount a portion of the Data Node memory for use by HDFS.

For example, you would use the following commands to allocate 2GB of memory for HDFS storage:

sudo mkdir -p /mnt/hdfsramdisk sudo mount -t tmpfs -o size=2048m tmpfs /mnt/hdfsramdisk sudo mkdir -p /usr/lib/hadoop-hdfs

3. Assign the RAM_DISK Storage Type and Enable Short-Circuit Reads

Edit the following properties in the /etc/hadoop/conf/hdfs-site.xml file to assign the RAM_DISK storage type to DataNodes and enable short-circuit reads.

The dfs.name.dir property determines where on the local filesystem a DataNode should store its blocks. To specify a DataNode as RAM_DISK storage, insert [RAM_DISK] at the beginning of the local file system mount path and add it to the dfs.name.dir property.
To enable short-circuit reads, set the value of dfs.client.read.shortcircuit to true.

For example:

<property> <name>dfs.data.dir</name> <value>file:///grid/3/aa/hdfs/data/,[RAM_DISK]file:///mnt/hdfsramdisk/</value> </property> <property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/lib/hadoop-hdfs/dn_socket</value> </property> <property> <name>dfs.checksum.type</name> <value>NULL</value> </property>

4. Set the LAZY_PERSIST Storage Policy on Files or Directories

Set a storage policy on a file or a directory.

Example:

hdfs dfsadmin -setStoragePolicy /memory1 LAZY_PERSIST

5. Start the Data Node

More to come in series 3.

gwhiteford · ‎11-13-2017

Has series 3 been posted?

tom_spiggle · ‎11-28-2017

</div>

<a class="btn btn-default header-btn-lu" id="open_model_signin_top" data-toggle="modal" data-target="#signin_modal">Login</a>

</div>

dale_preston · ‎04-10-2018

Will there be a part 3 of this? So far a good appetizer but no meat yet.

Cloudera Community

Community Articles

Disaster recovery and Backup best practices in a typical Hadoop Cluster: Series 2 Introduction to Tiered Storage

Apache Falcon

Apache Hadoop

Apache HBase

Apache Hive

HDFS

Hortonworks Data Platform (HDP)

Re: Disaster recovery and Backup best practices in a typical Hadoop Cluster: Series 2 Introduction to Tiered Storage

Re: Disaster recovery and Backup best practices in a typical Hadoop Cluster: Series 2 Introduction to Tiered Storage

Re: Disaster recovery and Backup best practices in a typical Hadoop Cluster: Series 2 Introduction to Tiered Storage

Disaster recovery and Backup best practices in a t...

Typical HDP Cluster Network Configuration Best Pra...

Backup and Disaster recovery alternative options

Rack Awareness Series 2

OLAP in Hadoop - Introduction ( Part 1 )

IoT Series: Sensors: Utilizing Breakout Garden...

What are best practices for setting up Backup and ...

Hadoop Tutorial Series Part-2 How to Add Node To E...

Big Data Processing Engines, The Technical Series ...

Easy Hadoop Backup