Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Retrive hdfs files after a specific time stamp

avatar
Explorer

Hi All

 

I am trying to achieve similar functionality to below Unix command in HDFS level. 

 

find /temp -name '*.avro' -cnewer sample.avro 

 

or to retrieve the files greater than a specific timestamp from HDFS level. 

 

From the Hadoop documentation I came to know that  we have limited functionality of 

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#find

 

Let me know how this can be achieved in Hadoop level. Any workarounds. 

 

Thanks - Muthu 

 

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@smkmuthu 

 

If you are using CDH Distribution, you can use HdfsFindTool to accomplish this. 

Sample Command to find files older than 3 days in the directory "/user/hive" from now:

 

hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /user/hive -type f -mtime -3

 

Please modify the /opt/cloudera/parcels path in the command as per the version of CDH you are using and the target directory as per the requirement. More details about HdfsFindTool can be found HERE.

 

Hope it helps!

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

@smkmuthu 

 

If you are using CDH Distribution, you can use HdfsFindTool to accomplish this. 

Sample Command to find files older than 3 days in the directory "/user/hive" from now:

 

hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /user/hive -type f -mtime -3

 

Please modify the /opt/cloudera/parcels path in the command as per the version of CDH you are using and the target directory as per the requirement. More details about HdfsFindTool can be found HERE.

 

Hope it helps!

avatar

You can use a script like this to create snapshots of old and new files - i.e. search files which are older than 3 days and search for files which are newer than 3 days, just make sure, you use the correct path to the cloudera jars. In the case of CDH5.15:

 

#!/bin/bash
now=`date +"%Y-%m-%dT%H:%M:%S"`
hdfs dfs -rm /data/cleanup_report/part=older3days/*
hdfs dfs -rm /data/cleanup_report/part=newer3days/*


hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /data -type d -mtime +3 | sed "s/^/${now}\tolder3days\t/" | hadoop fs -put - /data/cleanup_report/part=older3days/data.csv


hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /data -type d -mtime -3 | sed "s/^/${now}\tnewer3days\t/" | hadoop fs -put - /data/cleanup_report/part=newer3days/data.csv

 

Then create an external table with partitions on top of this HDFS folder.