Support Questions

Find answers, ask questions, and share your expertise

How to remove all files greater than 100 files in folder or remove all files older than 10days in Hadoop shell script

avatar
New Contributor

Hello all,

 

I am new to both shell script and hadoop. 

I would like to remove files which satisfy either 1 condition in a folder. 

condition 1: remove all files more than 100 file counts in the folder.

condition 2: remove all files older than 10 days in the same folder. 

 

I have a shell script which did removing file older than 10 days only. But I don't know how to modify the script when I want to add another condition of removing files more than 100 file counts in the folder. 

 

 

#!/bin/sh
start_time=`date`
processStart=`date -d "$start_time" '+%Y-%m-%d %H:%M:%S'`
psnew=`date -d "$processStart" '+%Y-%m-%d %H:%M'`
now=`date -d "$psnew" +'%s'`
PATH_ARCH=/folder1/arch
hadoop fs -count $PATH_ARCH/*.txt
for FILE in `hdfs dfs -ls  $PATH_ARCH/File_*.txt | sort -rk6,7 | tail -n 100 | grep wav | awk '{print $8}'`; do
    filename=(${FILE//// })
    filename_split=${filename[6]}
    fileTimestamp=`hadoop fs -ls $FILE | awk '{print $6,$7}'`
    fileTimestampsec=`date -d "$fileTimestamp" +'%s'`
    time_difference=$((($now - $fileTimestampsec)/(60*60*24)))
    if [[ $time_difference -gt 10 ]]; then
       hadoop fs -rm -skipTrash $FILE
    fi
done

 

Thank you.

 

1 REPLY 1

avatar
Master Collaborator

In Hadoop, you can use the Hadoop Distributed File System (HDFS) shell commands to remove files that meet certain criteria, such as being older than a certain number of days or greater than a certain number of files in a folder. You can achieve this using HDFS shell commands in a shell script. Here's how you can do it:

To remove all files greater than 100 files in a folder:

  • hadoop fs -count -q -h <folder_path>: This command retrieves a count of files in the specified folder, along with their sizes and other information.

  • awk '$2 > 100 {print $3}': This awk command filters the output to select only those file paths where the file count is greater than 100.

  • xargs -I {} hadoop fs -rm {}: This part of the command reads the file paths provided by awk and deletes those files using hadoop fs -rm


To remove all files older than 10 days in a folder:

 

hadoop fs -ls <folder_path> | awk -v cutoff=$(date -d "10 days ago" +%s) '{if ($6 < cutoff) print $8}' | xargs -I {} hadoop fs -rm {}

 

  • hadoop fs -ls <folder_path>: This command lists the files in the specified folder.

  • awk -v cutoff=$(date -d "10 days ago" +%s) '{if ($6 < cutoff) print $8}': This awk command calculates the timestamp for 10 days ago and compares it to the modification timestamps of the files. It selects files with modification timestamps older than 10 days.

  • xargs -I {} hadoop fs -rm {}: This part of the command reads the file paths provided by awk and deletes those files using hadoop fs -rm.