Member since
09-28-2022
1
Post
0
Kudos Received
0
Solutions
10-13-2023
04:32 AM
In Hadoop, you can use the Hadoop Distributed File System (HDFS) shell commands to remove files that meet certain criteria, such as being older than a certain number of days or greater than a certain number of files in a folder. You can achieve this using HDFS shell commands in a shell script. Here's how you can do it: To remove all files greater than 100 files in a folder: hadoop fs -count -q -h <folder_path>: This command retrieves a count of files in the specified folder, along with their sizes and other information. awk '$2 > 100 {print $3}': This awk command filters the output to select only those file paths where the file count is greater than 100. xargs -I {} hadoop fs -rm {}: This part of the command reads the file paths provided by awk and deletes those files using hadoop fs -rm To remove all files older than 10 days in a folder: hadoop fs -ls <folder_path> | awk -v cutoff=$(date -d "10 days ago" +%s) '{if ($6 < cutoff) print $8}' | xargs -I {} hadoop fs -rm {} hadoop fs -ls <folder_path>: This command lists the files in the specified folder. awk -v cutoff=$(date -d "10 days ago" +%s) '{if ($6 < cutoff) print $8}': This awk command calculates the timestamp for 10 days ago and compares it to the modification timestamps of the files. It selects files with modification timestamps older than 10 days. xargs -I {} hadoop fs -rm {}: This part of the command reads the file paths provided by awk and deletes those files using hadoop fs -rm.
... View more