Created on 08-08-201712:49 AM - edited 08-17-201911:37 AM
This article tries to compare the data recovery period of accidentally deleted data in HDFS. We would compare two scenarios,
1. When trash is enabled.
2. When snapshot is enabled.
Data Recovery from trash:
When a data from HDFS is deleted, metadata in HDFS is updated to delete the file from the source folder. However, the blocks from the datanode is not immediately deleted. The trash folder in HDFS is updated with the file along with the directory from where it is deleted in the user's .trash folder. The deleted data could be recovered from the trash folder.
#hadoop fs -rm /tmp/snapshotdir/oneGB.csv
17/08/07 19:37:46 INFO fs.TrashPolicyDefault: Moved: 'hdfs://vinodnn/tmp/snapshotdir/oneGB.csv' to trash at: hdfs://vinodnn/user/hdfs/.Trash/Current/tmp/snapshotdir/oneGB.csv1502134666492
#hadoop fs -cp /tmp/snapshotdir/.snapshot/s20170807-180139.568/oneGB.csv /tmp/snapshotdir/
It is seen in the above methods that hadoop copy "hadoop fs -cp <source> <dest>" is used to recover the data. However, the time taken by "cp" operation would increase as the size of the lost data increases. One of the optimizations would be to use the move command, "hadoop fs -mv <source> <destination>" in place of copy operation, as former operation fairs better over latter. Since, snapshot folders are read-only, the only supported operation is "copy" ( but not move ). Following are the metrics that are used to compare the performance of "copy" operation over "move" for one GB and ten GB data file.
Time to recover a file using copy (cp) operations:
Hence, we observe that recovery of data using trash along with move operation is efficient in certain cases to tackle accidental data loss and recovery.
NOTE: Recovering the data from trash would be possible if trash interval (fs.trash.interval) are properly configured to give Hadoop admins enough time to detect the data loss and recover it. If not, snapshot would be recommended for eventual recovery.