Support Questions

Find answers, ask questions, and share your expertise
Announcements
Now Live: Explore expert insights and technical deep dives on the new Cloudera Community BlogsRead the Announcement

Restoring a file from snapshot if only permissions change

avatar
New Contributor

I have to temporarily change permissions to several files so I'm planning to take a snapshot

before issuing chmod command

I know that restoring a file from a snapshot is done using cp command

In this case what does it happen, is it just restored the inode?

What does it happen to unmodified files running

cp <snaproot>/.snapshot/<name>/* <target dir>/

Are those file skipped?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

1. Does cp just restore the inode? No. When you use the cp command, the system does not simply "re-link" the old metadata or inode. Instead, it creates brand-new files.

The cp command reads the data blocks from the snapshot and writes them to the target directory as new blocks.

The resulting files will have new Inode IDs, new timestamps, and your current user as the owner.

It is a heavy operation because it physically duplicates the data on the disk (until the background deduplication/hard-linking handles it, depending on the underlying filesystem).


2. What happens to unmodified files? By default, the hdfs dfs -cp command will overwrite the existing files in the target directory even if they haven't changed.

It does not perform a "diff" or a check to see if the content is identical.

It will read the file from the snapshot and overwrite the live file, resulting in unnecessary I/O and network traffic.

3. Are those files skipped? No. Using cp with a wildcard (*) will force the system to attempt to copy everything. If a file with the same name exists in the target, the command will fail with a "File exists" error unless you use specific flags (like -f in some environments) or delete the target first.


  The "Better" Way (Recommended Fix)
If the goal is to only restore metadata or only update files that actually changed, using cp is inefficient. Instead, suggest the following:

A. Use distcp with the -update flag
distcp is much smarter. It compares the source (snapshot) and the target (live) and only copies files that have different sizes or checksums.

hadoop distcp -update -ptag <snaproot>/.snapshot/<name>/ <target_dir>/

-update: Only copies files if the size/checksum differs.
-ptag: Preserves the original permissions, timestamps, and ACLs.

B. Manual "Restore" (Metadata only)
If you only changed permissions (chmod) and didn't touch the data, the most efficient "restore" isn't a copy at all—it’s simply running chmod again to set them back.

Snapshots are great for insurance, but they are most useful for data recovery, not metadata undoing.

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

1. Does cp just restore the inode? No. When you use the cp command, the system does not simply "re-link" the old metadata or inode. Instead, it creates brand-new files.

The cp command reads the data blocks from the snapshot and writes them to the target directory as new blocks.

The resulting files will have new Inode IDs, new timestamps, and your current user as the owner.

It is a heavy operation because it physically duplicates the data on the disk (until the background deduplication/hard-linking handles it, depending on the underlying filesystem).


2. What happens to unmodified files? By default, the hdfs dfs -cp command will overwrite the existing files in the target directory even if they haven't changed.

It does not perform a "diff" or a check to see if the content is identical.

It will read the file from the snapshot and overwrite the live file, resulting in unnecessary I/O and network traffic.

3. Are those files skipped? No. Using cp with a wildcard (*) will force the system to attempt to copy everything. If a file with the same name exists in the target, the command will fail with a "File exists" error unless you use specific flags (like -f in some environments) or delete the target first.


  The "Better" Way (Recommended Fix)
If the goal is to only restore metadata or only update files that actually changed, using cp is inefficient. Instead, suggest the following:

A. Use distcp with the -update flag
distcp is much smarter. It compares the source (snapshot) and the target (live) and only copies files that have different sizes or checksums.

hadoop distcp -update -ptag <snaproot>/.snapshot/<name>/ <target_dir>/

-update: Only copies files if the size/checksum differs.
-ptag: Preserves the original permissions, timestamps, and ACLs.

B. Manual "Restore" (Metadata only)
If you only changed permissions (chmod) and didn't touch the data, the most efficient "restore" isn't a copy at all—it’s simply running chmod again to set them back.

Snapshots are great for insurance, but they are most useful for data recovery, not metadata undoing.

avatar
New Contributor

Thanks for the suggestion, I will go for distcp because we have hundred thousand of files and "only" several thousand of them must be restored 

avatar
Rising Star

@ganzuoni When restoring the snapshot it will copy all the files to the target directory, so it will read and rewrite every single file from the snapshot to the target directory and won't be skipped. So you have to be careful on which directory you are trying to restore and if you have concern if  you don't wants to restore on the existing file path, then restore on a different path.

Also its not just the inode level operations, rather its complete copy operation with new inodes as the original file inodes still refers to the snapshot one.