Support Questions
Find answers, ask questions, and share your expertise

hdfs retention

Expert Contributor

how to identify HDFS file retention? Say i need to identify whether /tmp/abc.txt file in hdfs has been accessed in the last 90 days or not, with out ranger audit mysql database.

5 REPLIES 5

Super Guru

@Raja Sekhar Chintalapati I have personally not used it but I think what you are looking for is dfs.namenode.accesstime.precision

Default value for access time is 1 hour. Check this link.

https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

@Raja Sekhar Chintalapati As far as I know, HDFS does not track a last accessed time (atime). In fact, it is recommended to disable atime when mounting the disks. This is primarily because files are split into blocks and individual blocks will have varying atime values. Also, the overhead of writing the atime would cause a serious performance hit. HDFS does track the last modified date. You can see this in the Ambari Files view or by executing "hdfs dfs -ls </path/to/files/>"

Guru

WebHDFS (I believe / have not implemented) can retrieve access time when dfs.namenode.accesstime.precision is > 0 as @mqureshi referenced. I cannot add anything about performance issues as @Scott Shaw raises.

curl -i "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILESTATUS" for file

curl -i "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS" for directory

https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#GETFILESTATUS

https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#LISTSTATUS

Expert Contributor

thank you guys @gkeys @Scott Shaw @mqureshi. I think this would seriously cause performance issue, i definitely donot want to scan 300T of data to get the status of last access time..I'm still researching, if i found anything interesting..will drop a note here..

thanks again

Guru

I think the only scalable way to do this is to read the audit logs in Ranger.