- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Retrive hdfs files after a specific time stamp
- Labels:
-
Apache Hadoop
Created on
‎08-04-2019
10:06 AM
- last edited on
‎08-05-2019
02:10 AM
by
VidyaSargur
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All
I am trying to achieve similar functionality to below Unix command in HDFS level.
find /temp -name '*.avro' -cnewer sample.avro
or to retrieve the files greater than a specific timestamp from HDFS level.
From the Hadoop documentation I came to know that we have limited functionality of
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#find
Let me know how this can be achieved in Hadoop level. Any workarounds.
Thanks - Muthu
Created on ‎08-05-2019 06:38 AM - edited ‎08-05-2019 06:38 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are using CDH Distribution, you can use HdfsFindTool to accomplish this.
Sample Command to find files older than 3 days in the directory "/user/hive" from now:
hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /user/hive -type f -mtime -3
Please modify the /opt/cloudera/parcels path in the command as per the version of CDH you are using and the target directory as per the requirement. More details about HdfsFindTool can be found HERE.
Hope it helps!
Created on ‎08-05-2019 06:38 AM - edited ‎08-05-2019 06:38 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are using CDH Distribution, you can use HdfsFindTool to accomplish this.
Sample Command to find files older than 3 days in the directory "/user/hive" from now:
hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /user/hive -type f -mtime -3
Please modify the /opt/cloudera/parcels path in the command as per the version of CDH you are using and the target directory as per the requirement. More details about HdfsFindTool can be found HERE.
Hope it helps!
Created ‎08-06-2019 08:23 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use a script like this to create snapshots of old and new files - i.e. search files which are older than 3 days and search for files which are newer than 3 days, just make sure, you use the correct path to the cloudera jars. In the case of CDH5.15:
#!/bin/bash
now=`date +"%Y-%m-%dT%H:%M:%S"`
hdfs dfs -rm /data/cleanup_report/part=older3days/*
hdfs dfs -rm /data/cleanup_report/part=newer3days/*
hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /data -type d -mtime +3 | sed "s/^/${now}\tolder3days\t/" | hadoop fs -put - /data/cleanup_report/part=older3days/data.csv
hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.15.1.jar org.apache.solr.hadoop.HdfsFindTool -find /data -type d -mtime -3 | sed "s/^/${now}\tnewer3days\t/" | hadoop fs -put - /data/cleanup_report/part=newer3days/data.csv
Then create an external table with partitions on top of this HDFS folder.
