Created on 07-06-2016 01:11 PM
Before we move on to the more popular DR/HA strategies for Amabari,Hive,HBase and using Falcon/snapshot/distcp etc let us take a quick tour of the tiered storage in Hadoop.
Link to Series1:
Let us think about it in a more general way:
You have a data lake and you are storing a 5 years’ worth of data amounting to a petabyte.
You are observing the following scenarios in your cluster:
This is where tiered storage comes into play.
What we are saying here is as your data is ageing, its getting hot to warm to cold (labelled in terms of frequency of access). There will be some exceptions to this for example look-up tables.
A simple solution to increase the overall performance of your data lake would be to store all the 3 replicas (if possible) in the newer machines. While store only 1 or 0 replica of the older data in the older lower performing machines. Even these older machines can be utilized to back up your configuration/setup files.
You can tag the data node storage location as archive or disk or ram etc to indicate tiered storage.
The following storage policies can be setup:
You need to configure the following properties:
The default storage type of a datanode storage location will be DISK if it does not have a storage type tagged explicitly.
If you want to read more about it please refer this excellent articles:
And the documentation:
I want to end this article with some discussion on the In-Memory storage tier:
For applications that need to write data that are temporary or can be regenerated, memory (RAM) can be an alternate storage medium that provides low latency for reads and writes. Since memory is a volatile storage medium, data written to the memory tier will be asynchronously persisted to disk.
HDP introduces the ‘RAM_DISK’ Storage Type and ‘LAZY_PERSIST’ Storage Policy. To setup the memory as storage, it needs to follow the steps below:
1. Shut Down the Data Node
2. Mount a Portion of Data Node Memory for HDFS
To use Data Node memory as storage, you must first mount a portion of the Data Node memory for use by HDFS.
For example, you would use the following commands to allocate 2GB of memory for HDFS storage:
sudo mkdir -p /mnt/hdfsramdisk sudo mount -t tmpfs -o size=2048m tmpfs /mnt/hdfsramdisk sudo mkdir -p /usr/lib/hadoop-hdfs
3. Assign the RAM_DISK Storage Type and Enable Short-Circuit Reads
Edit the following properties in the /etc/hadoop/conf/hdfs-site.xml file to assign the RAM_DISK storage type to DataNodes and enable short-circuit reads.
<property> <name>dfs.data.dir</name> <value>file:///grid/3/aa/hdfs/data/,[RAM_DISK]file:///mnt/hdfsramdisk/</value> </property> <property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/lib/hadoop-hdfs/dn_socket</value> </property> <property> <name>dfs.checksum.type</name> <value>NULL</value> </property>
4. Set the LAZY_PERSIST Storage Policy on Files or Directories
Set a storage policy on a file or a directory.
hdfs dfsadmin -setStoragePolicy /memory1 LAZY_PERSIST
5. Start the Data Node
More to come in series 3.