Created on 08-04-2019 10:06 AM - last edited on 08-05-2019 02:10 AM by VidyaSargur
Hi All
I am trying to achieve similar functionality to below Unix command in HDFS level.
find /temp -name '*.avro' -cnewer sample.avro
or to retrieve the files greater than a specific timestamp from HDFS level.
From the Hadoop documentation I came to know that we have limited functionality of
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#find
Let me know how this can be achieved in Hadoop level. Any workarounds.
Thanks - Muthu
Created on 08-05-2019 06:38 AM - edited 08-05-2019 06:38 AM
If you are using CDH Distribution, you can use HdfsFindTool to accomplish this.
Sample Command to find files older than 3 days in the directory "/user/hive" from now:
hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /user/hive -type f -mtime -3
Please modify the /opt/cloudera/parcels path in the command as per the version of CDH you are using and the target directory as per the requirement. More details about HdfsFindTool can be found HERE.
Hope it helps!
Created on 08-05-2019 06:38 AM - edited 08-05-2019 06:38 AM
If you are using CDH Distribution, you can use HdfsFindTool to accomplish this.
Sample Command to find files older than 3 days in the directory "/user/hive" from now:
hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /user/hive -type f -mtime -3
Please modify the /opt/cloudera/parcels path in the command as per the version of CDH you are using and the target directory as per the requirement. More details about HdfsFindTool can be found HERE.
Hope it helps!
Created 08-06-2019 08:23 AM
You can use a script like this to create snapshots of old and new files - i.e. search files which are older than 3 days and search for files which are newer than 3 days, just make sure, you use the correct path to the cloudera jars. In the case of CDH5.15:
#!/bin/bash
now=`date +"%Y-%m-%dT%H:%M:%S"`
hdfs dfs -rm /data/cleanup_report/part=older3days/*
hdfs dfs -rm /data/cleanup_report/part=newer3days/*
hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /data -type d -mtime +3 | sed "s/^/${now}\tolder3days\t/" | hadoop fs -put - /data/cleanup_report/part=older3days/data.csv
hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /data -type d -mtime -3 | sed "s/^/${now}\tnewer3days\t/" | hadoop fs -put - /data/cleanup_report/part=newer3days/data.csv
Then create an external table with partitions on top of this HDFS folder.