Support Questions
Find answers, ask questions, and share your expertise

Can I get a not-updated version of a file from HDFS?

New Contributor

I wrote a commandline script that does the following in a loop for about 15 times on an edgenode (in pseudo code):

loop 15 times:

hdfs dfs get myfile

locally do something on myfile

hdfs dfs delete myfile

hdfs dfs put myfile // an updated version of myfile on HDFS

endloop

At de put-statement the file wil be on one of the datanodes in the HDP-cluster. Replication will start later by itself. Can someone confirm my hypothesis that it can happen that HDFS serves an old "myfile" on the next get-statement from a different datanode than where myfile was put previously?

2 REPLIES 2

Re: Can I get a not-updated version of a file from HDFS?

I confirm this will never happen.

There is a small catch , What about the meta-info present on the name node ???

hdfs dfs delete myfile : This will result in deletion of the meta information from NAME-NODE.
As the meta info is not present there is no way a client will be ever able to see the file again.

hdfs dfs put myfile // an updated version of myfile on HDFS :
A new file has been place and NAME NODE meta info has been updated.

If you are still not convenienced , think of a situation where you were performing the same operation on Linux filesystem. how is Linux handling it . In Linux filesystem too a file is composed of block (tracks and sectors and they might be non contiguous)

Re: Can I get a not-updated version of a file from HDFS?

New Contributor

@kgautam Thanks you for your swift answer!

Still... I changed the above loop to the pseudo code as below - and now all is working correctly. Small change to implement but I want to know for sure if I can rely on HDFS always serving me the latest greatest version of an updated file?

hdfs dfs get myfile

loop 15 times

fetch myfile locally

locally do something on myfile

save myfile locally

hdfs dfs put -f myfile // force put instead of delete & put -- this change alone didn't resolve the issue

endloop

delete local myfile

FYI, I think HDFS differs alot from a local EXT3 disk for example - only 1 os and 1 diskcontroller are involved here! HDFS relies on multile os-installations, multiple disks per os AND at least one JAVA program to manage changes in the HDFS filesystem!

For those wondering, I'm talking about HDP 2.4.