Created 06-20-2016 08:58 PM
Created 06-21-2016 04:11 AM
Nothing I know of out of the both at this moment will get you this. An option is to enable ranger and plug into rangers audit frame work. more info here. Basically when you enable audit on hdfs you will who did what and when. This will tell you the data sets which are being used. Not much about if the data set which are not being use. This with ranger will need to be inferred.
You can also view HDFS space utilization through zeppelin here.
Another option is to build a custom job which would be to take the HDFS space utilization report out (above) feed and store into a table. Cross join it with ranger audit logs and store results in phoenix. Run this jobs every hour. run report off end table
In summary I am not aware of any out of the box solution.
Created 06-21-2016 04:11 AM
Nothing I know of out of the both at this moment will get you this. An option is to enable ranger and plug into rangers audit frame work. more info here. Basically when you enable audit on hdfs you will who did what and when. This will tell you the data sets which are being used. Not much about if the data set which are not being use. This with ranger will need to be inferred.
You can also view HDFS space utilization through zeppelin here.
Another option is to build a custom job which would be to take the HDFS space utilization report out (above) feed and store into a table. Cross join it with ranger audit logs and store results in phoenix. Run this jobs every hour. run report off end table
In summary I am not aware of any out of the box solution.
Created 06-21-2016 09:05 AM
@milind pandit
You can get access time using FileStatus#getAccessTime() as follows:
public class HdfsAtime { private static SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ"); public static void main(String[] args) throws IOException { FileSystem fs = FileSystem.get(new Configuration()); printFileStatus(fs, new Path(args[0])); fs.close(); } private static void printFileStatus(FileSystem fs, Path root) throws IOException { FileStatus[] fss = fs.listStatus(root); for (FileStatus status : fss) { if (status.isDirectory()) { printFileStatus(fs, status.getPath()); } else { System.out.println( sdf.format(new Date(status.getAccessTime())) + " " + sdf.format(new Date(status.getModificationTime())) + " " + Path.getPathWithoutSchemeAndAuthority(status.getPath())); } } } }
And, in order to enable access time, you may need to set dfs.namenode.accesstime.precision an appropriate value (set 0 by HDP default).
<property> <name>dfs.namenode.accesstime.precision</name> <value>3600000</value> </property>