- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
I would like to get custom report on Which datasets are not accessed by any user or process in past n days ? Is it possible to get this level of details from Atlas?
- Labels:
-
Apache Atlas
Created 06-20-2016 08:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created 06-21-2016 04:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nothing I know of out of the both at this moment will get you this. An option is to enable ranger and plug into rangers audit frame work. more info here. Basically when you enable audit on hdfs you will who did what and when. This will tell you the data sets which are being used. Not much about if the data set which are not being use. This with ranger will need to be inferred.
You can also view HDFS space utilization through zeppelin here.
Another option is to build a custom job which would be to take the HDFS space utilization report out (above) feed and store into a table. Cross join it with ranger audit logs and store results in phoenix. Run this jobs every hour. run report off end table
In summary I am not aware of any out of the box solution.
Created 06-21-2016 04:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nothing I know of out of the both at this moment will get you this. An option is to enable ranger and plug into rangers audit frame work. more info here. Basically when you enable audit on hdfs you will who did what and when. This will tell you the data sets which are being used. Not much about if the data set which are not being use. This with ranger will need to be inferred.
You can also view HDFS space utilization through zeppelin here.
Another option is to build a custom job which would be to take the HDFS space utilization report out (above) feed and store into a table. Cross join it with ranger audit logs and store results in phoenix. Run this jobs every hour. run report off end table
In summary I am not aware of any out of the box solution.
Created 06-21-2016 09:05 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@milind pandit
You can get access time using FileStatus#getAccessTime() as follows:
public class HdfsAtime {
private static SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ");
public static void main(String[] args) throws IOException {
FileSystem fs = FileSystem.get(new Configuration());
printFileStatus(fs, new Path(args[0]));
fs.close();
}
private static void printFileStatus(FileSystem fs, Path root) throws IOException {
FileStatus[] fss = fs.listStatus(root);
for (FileStatus status : fss) {
if (status.isDirectory()) {
printFileStatus(fs, status.getPath());
} else {
System.out.println(
sdf.format(new Date(status.getAccessTime())) + " "
+ sdf.format(new Date(status.getModificationTime())) + " "
+ Path.getPathWithoutSchemeAndAuthority(status.getPath()));
}
}
}
}
And, in order to enable access time, you may need to set dfs.namenode.accesstime.precision an appropriate value (set 0 by HDP default).
<property>
<name>dfs.namenode.accesstime.precision</name>
<value>3600000</value>
</property>