Created on 10-25-2016 05:29 PM
HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system.Snapshots are very efficient because they only copy data that are changed. We can restore the data to any previous snapshot. Some common use cases of snapshots are Data backup and Disaster recovery.
HDFS Snapshot Extension:
Falcon will support HDFS snapshot-based replication through HDFS Snapshot extension. Using this feature,
Snapshot replication will only work from single source directory to single target directory.
For snapshot to work, we expect users to do the following
To perform the HDFS Snapshot replication in Falcon, We need to create the source, target cluster entities and also need to create/give permissions to the staging and working directories. Please use the following steps to accomplish it.
Source Cluster:
hdfs dfs -rm -r /tmp/fs /tmp/fw hdfs dfs -mkdir -p /tmp/fs hdfs dfs -chmod 777 /tmp/fs hdfs dfs -mkdir -p /tmp/fw hdfs dfs -chmod 755 /tmp/fw hdfs dfs -chown falcon /tmp/fs hdfs dfs -chown falcon /tmp/fw
Target Cluster :
hdfs dfs -rm -r /tmp/fs /tmp/fw hdfs dfs -mkdir -p /tmp/fs hdfs dfs -chmod 777 /tmp/fs hdfs dfs -mkdir -p /tmp/fw hdfs dfs -chmod 755 /tmp/fw hdfs dfs -chown falcon /tmp/fs hdfs dfs -chown falcon /tmp/fw
Cluster Entities:
primaryCluster.xml
<?xml version="1.0" encoding="UTF-8"?> <cluster xmlns="uri:falcon:cluster:0.1" colo="USWestOregon" description="oregonHadoopCluster" name="primaryCluster"> <interfaces> <interface type="readonly" endpoint="webhdfs://mycluster1:20070" version="0.20.2" /> <interface type="write" endpoint="hdfs://mycluster1:8020" version="0.20.2" /> <interface type="execute" endpoint="primaryCluster-12.openstacklocal:8050" version="0.20.2" /> <interface type="workflow" endpoint="http://primaryCluster-14.openstacklocal:11000/oozie" version="3.1" /> <interface type="messaging" endpoint="tcp://primaryCluster-9.openstacklocal:61616?daemon=true" version="5.1.6" /> <interface type="registry" endpoint="thrift://primaryCluster-14.openstacklocal:9083" version="0.11.0" /> </interfaces> <locations> <location name="staging" path="/tmp/fs" /> <location name="temp" path="/tmp" /> <location name="working" path="/tmp/fw" /> </locations> <ACL owner="ambari-qa" group="users" permission="0755" /> <properties> <property name="dfs.namenode.kerberos.principal" value="nn/_HOST@EXAMPLE.COM" /> <property name="hive.metastore.kerberos.principal" value="hive/_HOST@EXAMPLE.COM" /> <property name="hive.metastore.sasl.enabled" value="true" /> <property name="hadoop.rpc.protection" value="authentication" /> <property name="hive.metastore.uris" value="thrift://primaryCluster-14.openstacklocal:9083" /> <property name="hive.server2.uri" value="hive2://primaryCluster-14.openstacklocal:10000" /> </properties> </cluster>
falcon entity -submit -type cluster -file primaryCluster.xml --> primaryCluster
backupCluster :
<?xml version="1.0" encoding="UTF-8"?> <cluster xmlns="uri:falcon:cluster:0.1" colo="USWestOregon" description="oregonHadoopCluster" name="backupCluster"> <interfaces> <interface type="readonly" endpoint="webhdfs://mycluster2:20070" version="0.20.2" /> <interface type="write" endpoint="hdfs://mycluster2:8020" version="0.20.2" /> <interface type="execute" endpoint="backupCluster-5.openstacklocal:8050" version="0.20.2" /> <interface type="workflow" endpoint="http://backupCluster-6.openstacklocal:11000/oozie" version="3.1" /> <interface type="messaging" endpoint="tcp://backupCluster-1.openstacklocal:61616" version="5.1.6" /> <interface type="registry" endpoint="thrift://backupCluster-6.openstacklocal:9083" version="0.11.0" /> </interfaces> <locations> <location name="staging" path="/tmp/fs" /> <location name="temp" path="/tmp" /> <location name="working" path="/tmp/fw" /> </locations> <ACL owner="ambari-qa" group="users" permission="0755" /> <properties> <property name="dfs.namenode.kerberos.principal" value="nn/_HOST@EXAMPLE.COM" /> <property name="hive.metastore.kerberos.principal" value="hive/_HOST@EXAMPLE.COM" /> <property name="hive.metastore.sasl.enabled" value="true" /> <property name="hadoop.rpc.protection" value="authentication" /> <property name="hive.metastore.uris" value="thrift://backupCluster-6.openstacklocal:9083" /> <property name="hive.server2.uri" value="hive2://backupCluster-6.openstacklocal:10000" /> </properties> </cluster>
falcon entity -submit -type cluster -file backupCluster.xml --> backupCluster
HDFS Snapshot Replication:
Source: [ Create directory and copy the data]
hdfs dfs -mkdir -p /tmp/falcon/HDFSSnapshot/source hdfs dfs -put NYSE-2000-2001.tsv /tmp/falcon/HDFSSnapshot/source
Note: you can download the NYSE-2000-2001.tsv file from
https://s3.amazonaws.com/hw-sandbox/tutorial1/NYSE-2000-2001.tsv.gz
Allow Snapshot to the directory:
ddfs dfsadmin -allowSnapshot /tmp/falcon/HDFSSnapshot/source [ hdfs] hdfs lsSnapshottableDir [ ambari-qa]
Target Cluster
hdfs dfs -mkdir -p /tmp/falcon/HDFSSnapshot/target hdfs dfsadmin -allowSnapshot /tmp/falcon/HDFSSnapshot/target
hdfs-snapshot.properties
jobName=HDFSSnapshot jobClusterName=primaryCluster jobValidityStart=2016-05-09T06:25Z jobValidityEnd=2016-05-09T08:00Z jobFrequency=days(1) sourceCluster=primaryCluster sourceSnapshotDir=/tmp/falcon/HDFSSnapshot/source sourceSnapshotRetentionAgeLimit=days(1) sourceSnapshotRetentionNumber=3 targetCluster=backupCluster targetSnapshotDir=/tmp/falcon/HDFSSnapshot/target targetSnapshotRetentionAgeLimit=days(1) targetSnapshotRetentionNumber=3 jobAclOwner=ambari-qa jobAclGroup=users jobAclPermission="0x755"
Submit And schedule the job using the property file:
falcon extension -extensionName hdfs-snapshot-mirroring -submitAndSchedule -file hdfs-snapshot.properties
By using the jobName we can find the oozie job it has launched
falcon extension -instances -jobName HDFSSnapshot
Once the job is completed, we can see in source the snapshot will be automatically created and snapshot along with source content are replicated in the target cluster :
Source Cluster HDFS Content:
hdfs dfs -ls -R hdfs://mycluster1:8020//tmp/falcon/HDFSSnapshot/source/
drwxr-xr-x - ambari-qa hdfs 0 2016-10-25 02:27 hdfs://mycluster1:8020/tmp/falcon/HDFSSnapshot/source/source -rw-r--r-- 3 ambari-qa hdfs 44005963 2016-10-25 02:27 hdfs://mycluster1:8020/tmp/falcon/HDFSSnapshot/source/source/NYSE-2000-2001.tsv
Target Cluster HDFS Content:
hdfs dfs -ls -R hdfs://mycluster2:8020//tmp/falcon-HDFSSnapshot/target/
drwxr-xr-x - ambari-qa hdfs 0 2016-10-25 02:28 hdfs://mycluster2:8020/tmp/falcon/HDFSSnapshot/target/source -rw-r--r-- 3 ambari-qa hdfs 44005963 2016-10-25 02:28 hdfs://mycluster2:8020/tmp/falcon/HDFSSnapshot/target/source/NYSE-2000-2001.tsv
We can see the data has been replicated from source to target cluster.
Source Snapshot Directory:
hdfs dfs -ls hdfs://mycluster1:8020//tmp/falcon/HDFSSnapshot/source/.snapshot
Found 1 items drwxr-xr-x - ambari-qa hdfs 0 2016-10-25 02:27 hdfs://mycluster1:8020/tmp/falcon/HDFSSnapshot/source/.snapshot/falcon-snapshot-HDFSSnapshot-2016-05-09-06-25-1477362461509
Target Snapshot Directory:
hdfs dfs -ls hdfs://mycluster2:8020//tmp/falcon/HDFSSnapshot/target/.snapshot
Found 1 itemsdrwxr-xr-x - ambari-qa hdfs 0 2016-10-25 02:28 hdfs://mycluster2:8020/tmp/falcon/HDFSSnapshot/target/.snapshot/falcon-snapshot-HDFSSnapshot-2016-05-09-06-25-1477362461509
We can see the snapshot directory has been automatically created in source and also replicated from source to target cluster.
Created on 10-31-2016 11:17 PM
Thanks for sharing! whenever there is a change in the source directory, is there a corresponding update on the target directory? @Murali Ramasami
Created on 11-03-2016 03:52 AM
@zhixun he Yes. Whenever there is a change, snapshot will get created in source and falcon process instance will trigger based on the frequency
Created on 12-16-2016 12:04 PM
Quick question, curious of perspective, does it make sense to use falcon snapshot support to just manage the snapshots for a single cluster and not necessarily the DR replication aspects.