Created 11-15-2017 07:26 PM
I wrote a commandline script that does the following in a loop for about 15 times on an edgenode (in pseudo code):
loop 15 times:
hdfs dfs get myfile
locally do something on myfile
hdfs dfs delete myfile
hdfs dfs put myfile // an updated version of myfile on HDFS
endloop
At de put-statement the file wil be on one of the datanodes in the HDP-cluster. Replication will start later by itself. Can someone confirm my hypothesis that it can happen that HDFS serves an old "myfile" on the next get-statement from a different datanode than where myfile was put previously?
Created 11-15-2017 07:35 PM
I confirm this will never happen.
There is a small catch , What about the meta-info present on the name node ???
hdfs dfs delete myfile : This will result in deletion of the meta information from NAME-NODE.
As the meta info is not present there is no way a client will be ever able to see the file again.
hdfs dfs put myfile // an updated version of myfile on HDFS :
A new file has been place and NAME NODE meta info has been updated.
If you are still not convenienced , think of a situation where you were performing the same operation on Linux filesystem. how is Linux handling it . In Linux filesystem too a file is composed of block (tracks and sectors and they might be non contiguous)
Created 11-16-2017 09:18 AM
@kgautam Thanks you for your swift answer!
Still... I changed the above loop to the pseudo code as below - and now all is working correctly. Small change to implement but I want to know for sure if I can rely on HDFS always serving me the latest greatest version of an updated file?
hdfs dfs get myfile
loop 15 times
fetch myfile locally
locally do something on myfile
save myfile locally
hdfs dfs put -f myfile // force put instead of delete & put -- this change alone didn't resolve the issue
endloop
delete local myfile
FYI, I think HDFS differs alot from a local EXT3 disk for example - only 1 os and 1 diskcontroller are involved here! HDFS relies on multile os-installations, multiple disks per os AND at least one JAVA program to manage changes in the HDFS filesystem!
For those wondering, I'm talking about HDP 2.4.