Support Questions

VidyaSargur · ‎08-04-2019

Hi All

I am trying to achieve similar functionality to below Unix command in HDFS level.

find /temp -name '*.avro' -cnewer sample.avro

or to retrieve the files greater than a specific timestamp from HDFS level.

From the Hadoop documentation I came to know that we have limited functionality of

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#find

Let me know how this can be achieved in Hadoop level. Any workarounds.

Thanks - Muthu

Gomathinayagam · ‎08-05-2019

@smkmuthu

If you are using CDH Distribution, you can use HdfsFindTool to accomplish this.

Sample Command to find files older than 3 days in the directory "/user/hive" from now:

hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /user/hive -type f -mtime -3

Please modify the /opt/cloudera/parcels path in the command as per the version of CDH you are using and the target directory as per the requirement. More details about HdfsFindTool can be found HERE.

Hope it helps!

View solution in original post

Gomathinayagam · ‎08-05-2019

@smkmuthu

If you are using CDH Distribution, you can use HdfsFindTool to accomplish this.

Sample Command to find files older than 3 days in the directory "/user/hive" from now:

hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /user/hive -type f -mtime -3

Please modify the /opt/cloudera/parcels path in the command as per the version of CDH you are using and the target directory as per the requirement. More details about HdfsFindTool can be found HERE.

Hope it helps!

Tomas79 · ‎08-06-2019

You can use a script like this to create snapshots of old and new files - i.e. search files which are older than 3 days and search for files which are newer than 3 days, just make sure, you use the correct path to the cloudera jars. In the case of CDH5.15:

#!/bin/bash
now=`date +"%Y-%m-%dT%H:%M:%S"`
hdfs dfs -rm /data/cleanup_report/part=older3days/*
hdfs dfs -rm /data/cleanup_report/part=newer3days/*


hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /data -type d -mtime +3 | sed "s/^/${now}\tolder3days\t/" | hadoop fs -put - /data/cleanup_report/part=older3days/data.csv


hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /data -type d -mtime -3 | sed "s/^/${now}\tnewer3days\t/" | hadoop fs -put - /data/cleanup_report/part=newer3days/data.csv

Then create an external table with partitions on top of this HDFS folder.

Cloudera Community

Support Questions

Retrive hdfs files after a specific time stamp