Community Articles

kkanchu · ‎08-08-2017

This article tries to compare the data recovery period of accidentally deleted data in HDFS. We would compare two scenarios,

1. When trash is enabled.

2. When snapshot is enabled.

Data Recovery from trash:

When a data from HDFS is deleted, metadata in HDFS is updated to delete the file from the source folder. However, the blocks from the datanode is not immediately deleted. The trash folder in HDFS is updated with the file along with the directory from where it is deleted in the user's .trash folder. The deleted data could be recovered from the trash folder.

Example:

1. Existing data in HDFS.

#hadoop fs -ls /tmp/test1.txt
-rw-r--r--   3 hdfs hdfs          4 2017-08-07 23:47 /tmp/test1.txt

2. Deleted data in HDFS.

#hadoop fs -rm /tmp/test1.txt
17/08/07 23:52:13 INFO fs.TrashPolicyDefault: Moved: 'hdfs://vnn/tmp/test1.txt' to trash at: hdfs://vnn/user/hdfs/.Trash/Current/tmp/test1.txt

3. Recovering a deleted data

#hadoop fs -cp /user/hdfs/.Trash/Current/tmp/test1.txt /tmp/
#hadoop fs -ls /tmp/test1.txt
-rw-r--r--   3 hdfs hdfs          4 2017-08-07 23:57 /tmp/test1.txt

Data recovery from snapshots:

Snapshots are read-only point in time copies of HDFS file system. Enable a directory to be snapshot-able to recovery any accidental data loss.

1. Enabling snapshot.

#hdfs dfsadmin -allowSnapshot /tmp/snapshotdir
Allowing snaphot on /tmp/snapshotdir succeeded

2. Create snapshot for a directory.

#hdfs dfs -createSnapshot /tmp/snapshotdir
Created snapshot /tmp/snapshotdir/.snapshot/s20170807-180139.568

3. Contents of HDFS snapshot based folder.

#hdfs dfs -ls /tmp/snapshotdir/
Found 3 items
hadoop fs -rm $1
-rw-r--r--   3 hdfs hdfs  1083492818 2017-07-31 19:01 /tmp/snapshotdir/oneGB.csv
-rw-r--r--   3 hdfs hdfs 10722068505 2017-08-02 17:19 /tmp/snapshotdir/tenGB.csv

#hdfs dfs -ls /tmp/snapshotdir/.snapshot/s20170807-180139.568
Found 3 items
-rw-r--r--   3 hdfs hdfs  1083492818 2017-07-31 19:01 /tmp/snapshotdir/.snapshot/s20170807-180139.568/oneGB.csv
-rw-r--r--   3 hdfs hdfs 10722068505 2017-08-02 17:19 /tmp/snapshotdir/.snapshot/s20170807-180139.568/tenGB.csv

4. Delete and recovering lost data.

#hadoop fs -rm /tmp/snapshotdir/oneGB.csv
17/08/07 19:37:46 INFO fs.TrashPolicyDefault: Moved: 'hdfs://vinodnn/tmp/snapshotdir/oneGB.csv' to trash at: hdfs://vinodnn/user/hdfs/.Trash/Current/tmp/snapshotdir/oneGB.csv1502134666492

#hadoop fs -cp /tmp/snapshotdir/.snapshot/s20170807-180139.568/oneGB.csv  /tmp/snapshotdir/

It is seen in the above methods that hadoop copy "hadoop fs -cp <source> <dest>" is used to recover the data. However, the time taken by "cp" operation would increase as the size of the lost data increases. One of the optimizations would be to use the move command, "hadoop fs -mv <source> <destination>" in place of copy operation, as former operation fairs better over latter. Since, snapshot folders are read-only, the only supported operation is "copy" ( but not move ). Following are the metrics that are used to compare the performance of "copy" operation over "move" for one GB and ten GB data file.

Time to recover a file using copy (cp) operations:

screen-shot-2017-08-07-at-60552-pm.png

Time to recover a file using move (mv) operations:

screen-shot-2017-08-07-at-60602-pm.png

Hence, we observe that recovery of data using trash along with move operation is efficient in certain cases to tackle accidental data loss and recovery.

NOTE: Recovering the data from trash would be possible if trash interval (fs.trash.interval) are properly configured to give Hadoop admins enough time to detect the data loss and recover it. If not, snapshot would be recommended for eventual recovery.

Cloudera Community

Community Articles

Performance comparison in recovering accidental HDFS data loss.

Apache Hadoop

How to Recover HDFS Files after Accidental Deletio...

Timestamp & String comparison in Hive

Comparison of HttpFs and WebHDFS

Performance Comparison b/w ORC SNAPPY and ZLib in ...

Comparison : Kudu Copy Command vs Spark backup uti...

HDFS Balancer (1): 100x Performance Improvement

Scaling the HDFS NameNode (part 4) - Avoiding Perf...

SQOOP Performance tuning

HDF/NiFi Improving the performance of your UI

SolrCloud Performance - HDFS index/data