Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

I would like to get custom report on Which datasets are not accessed by any user or process in past n days ? Is it possible to get this level of details from Atlas?

Solved Go to solution
Highlighted

I would like to get custom report on Which datasets are not accessed by any user or process in past n days ? Is it possible to get this level of details from Atlas?

 
1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: I would like to get custom report on Which datasets are not accessed by any user or process in past n days ? Is it possible to get this level of details from Atlas?

Super Guru

@milind pandit

Nothing I know of out of the both at this moment will get you this. An option is to enable ranger and plug into rangers audit frame work. more info here. Basically when you enable audit on hdfs you will who did what and when. This will tell you the data sets which are being used. Not much about if the data set which are not being use. This with ranger will need to be inferred.

You can also view HDFS space utilization through zeppelin here.

Another option is to build a custom job which would be to take the HDFS space utilization report out (above) feed and store into a table. Cross join it with ranger audit logs and store results in phoenix. Run this jobs every hour. run report off end table

In summary I am not aware of any out of the box solution.

View solution in original post

2 REPLIES 2
Highlighted

Re: I would like to get custom report on Which datasets are not accessed by any user or process in past n days ? Is it possible to get this level of details from Atlas?

Super Guru

@milind pandit

Nothing I know of out of the both at this moment will get you this. An option is to enable ranger and plug into rangers audit frame work. more info here. Basically when you enable audit on hdfs you will who did what and when. This will tell you the data sets which are being used. Not much about if the data set which are not being use. This with ranger will need to be inferred.

You can also view HDFS space utilization through zeppelin here.

Another option is to build a custom job which would be to take the HDFS space utilization report out (above) feed and store into a table. Cross join it with ranger audit logs and store results in phoenix. Run this jobs every hour. run report off end table

In summary I am not aware of any out of the box solution.

View solution in original post

Re: I would like to get custom report on Which datasets are not accessed by any user or process in past n days ? Is it possible to get this level of details from Atlas?

Explorer

@milind pandit

You can get access time using FileStatus#getAccessTime() as follows:

public class HdfsAtime {
  private static SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ");
  public static void main(String[] args) throws IOException {
    FileSystem fs = FileSystem.get(new Configuration());
    printFileStatus(fs, new Path(args[0]));
    fs.close();
  }
  private static void printFileStatus(FileSystem fs, Path root) throws IOException {
    FileStatus[] fss = fs.listStatus(root);
    for (FileStatus status : fss) {
      if (status.isDirectory()) {
        printFileStatus(fs, status.getPath());
      } else {
        System.out.println(
            sdf.format(new Date(status.getAccessTime())) + " "
            + sdf.format(new Date(status.getModificationTime())) + " "
            + Path.getPathWithoutSchemeAndAuthority(status.getPath()));
      }
    }
  }
}

And, in order to enable access time, you may need to set dfs.namenode.accesstime.precision an appropriate value (set 0 by HDP default).

    <property>
      <name>dfs.namenode.accesstime.precision</name>
      <value>3600000</value>
    </property>
Don't have an account?
Coming from Hortonworks? Activate your account here