Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

I would like to get custom report on Which datasets are not accessed by any user or process in past n days ? Is it possible to get this level of details from Atlas?

avatar
 
1 ACCEPTED SOLUTION

avatar
Master Guru

@milind pandit

Nothing I know of out of the both at this moment will get you this. An option is to enable ranger and plug into rangers audit frame work. more info here. Basically when you enable audit on hdfs you will who did what and when. This will tell you the data sets which are being used. Not much about if the data set which are not being use. This with ranger will need to be inferred.

You can also view HDFS space utilization through zeppelin here.

Another option is to build a custom job which would be to take the HDFS space utilization report out (above) feed and store into a table. Cross join it with ranger audit logs and store results in phoenix. Run this jobs every hour. run report off end table

In summary I am not aware of any out of the box solution.

View solution in original post

2 REPLIES 2

avatar
Master Guru

@milind pandit

Nothing I know of out of the both at this moment will get you this. An option is to enable ranger and plug into rangers audit frame work. more info here. Basically when you enable audit on hdfs you will who did what and when. This will tell you the data sets which are being used. Not much about if the data set which are not being use. This with ranger will need to be inferred.

You can also view HDFS space utilization through zeppelin here.

Another option is to build a custom job which would be to take the HDFS space utilization report out (above) feed and store into a table. Cross join it with ranger audit logs and store results in phoenix. Run this jobs every hour. run report off end table

In summary I am not aware of any out of the box solution.

avatar
Explorer

@milind pandit

You can get access time using FileStatus#getAccessTime() as follows:

public class HdfsAtime {
  private static SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ");
  public static void main(String[] args) throws IOException {
    FileSystem fs = FileSystem.get(new Configuration());
    printFileStatus(fs, new Path(args[0]));
    fs.close();
  }
  private static void printFileStatus(FileSystem fs, Path root) throws IOException {
    FileStatus[] fss = fs.listStatus(root);
    for (FileStatus status : fss) {
      if (status.isDirectory()) {
        printFileStatus(fs, status.getPath());
      } else {
        System.out.println(
            sdf.format(new Date(status.getAccessTime())) + " "
            + sdf.format(new Date(status.getModificationTime())) + " "
            + Path.getPathWithoutSchemeAndAuthority(status.getPath()));
      }
    }
  }
}

And, in order to enable access time, you may need to set dfs.namenode.accesstime.precision an appropriate value (set 0 by HDP default).

    <property>
      <name>dfs.namenode.accesstime.precision</name>
      <value>3600000</value>
    </property>