Created 04-20-2022 05:21 AM
Hi Guyz,
please help me to build the python script for cleaning HDFS dir which are older than 3 days.
give me the suggestions
Created 04-25-2022 11:15 PM
Hi,
From shell find the files that needs to be deleted and save them in a temp file like below,
 #!/bin/sh   
 today=`date +'%s'`  
 hdfs dfs -ls /file/Path/ | while read line ; do  
 dir_date=$(echo ${line} | awk '{print $6}')  
 difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))  
 filePath=$(echo ${line} | awk '{print $8}')  
 if [ ${difference} -gt 3 ]; then  
   echo -e "$filePath" >> toDelete
 fi  
 done
Then execute arbitrary shell command using form example subprocess.call or sh library so something like below
import subprocess
file = open('toDelete', 'r')
for each in file:
	subprocess.call(["hadoop", "fs", "-rm", "-f", each])
Also, you can use hdfs fs API in PySpark like below, 
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
def delete_path(spark, path):
    sc = spark.sparkContext
    fs = (sc._jvm.org
          .apache.hadoop
          .fs.FileSystem
          .get(sc._jsc.hadoopConfiguration())
          )
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
delete_path(spark, "Your/hdfs/path")Created on 04-22-2022 04:24 AM - edited 04-22-2022 04:25 AM
Hi , 
The below source code removes files that are older than 3 days from the HDFS path
 #!/bin/sh   
 today=`date +'%s'`  
 hdfs dfs -ls /file/Path/ | while read line ; do  
 dir_date=$(echo ${line} | awk '{print $6}')  
 difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))  
 filePath=$(echo ${line} | awk '{print $8}')  
 if [ ${difference} -gt 3 ]; then  
   hdfs dfs -rm -r $filePath  
 fi  
 done 
hdfs dfs -rm -r command moves the data to the trash folder if the trash mechanism is configured. 
To ignore moving the file to trash folder use skipTrash option.
Created 04-23-2022 12:33 AM
Created 04-25-2022 11:15 PM
Hi,
From shell find the files that needs to be deleted and save them in a temp file like below,
 #!/bin/sh   
 today=`date +'%s'`  
 hdfs dfs -ls /file/Path/ | while read line ; do  
 dir_date=$(echo ${line} | awk '{print $6}')  
 difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))  
 filePath=$(echo ${line} | awk '{print $8}')  
 if [ ${difference} -gt 3 ]; then  
   echo -e "$filePath" >> toDelete
 fi  
 done
Then execute arbitrary shell command using form example subprocess.call or sh library so something like below
import subprocess
file = open('toDelete', 'r')
for each in file:
	subprocess.call(["hadoop", "fs", "-rm", "-f", each])
Also, you can use hdfs fs API in PySpark like below, 
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
def delete_path(spark, path):
    sc = spark.sparkContext
    fs = (sc._jvm.org
          .apache.hadoop
          .fs.FileSystem
          .get(sc._jsc.hadoopConfiguration())
          )
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
delete_path(spark, "Your/hdfs/path")Created 04-29-2022 03:32 AM
Hi ggangadharan
I follow the way u suggest, but now i need help in code that we can execute that hdfscode shell script from python script.
and do it as an
import subprocess
subprocess.call('./home/test.sh/' ,shell=True)
file = open('toDelete, 'r')
for each in file:
subprocess.call(["hadoop", "fs", "-rm", "-f", each])
but now my shell script is not executing and not showing any output ,plz suggest me what i do.
thanks
 
					
				
				
			
		
