Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HDFS dir cleanup which older than 7 days in python script

avatar
Explorer

Hi Guyz,

 

please help me to build the python script for cleaning HDFS dir which are older than 3 days.

give me the suggestions 

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hi,

From shell find the files that needs to be deleted and save them in a temp file like below,

 #!/bin/sh   
 today=`date +'%s'`  
 hdfs dfs -ls /file/Path/ | while read line ; do  
 dir_date=$(echo ${line} | awk '{print $6}')  
 difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))  
 filePath=$(echo ${line} | awk '{print $8}')  
 if [ ${difference} -gt 3 ]; then  
   echo -e "$filePath" >> toDelete
 fi  
 done


Then execute arbitrary shell command using form example subprocess.call or sh library so something like below

import subprocess

file = open('toDelete', 'r')
for each in file:
	subprocess.call(["hadoop", "fs", "-rm", "-f", each])


Also, you can use hdfs fs API in PySpark like below, 

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()

def delete_path(spark, path):
    sc = spark.sparkContext
    fs = (sc._jvm.org
          .apache.hadoop
          .fs.FileSystem
          .get(sc._jsc.hadoopConfiguration())
          )
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)

delete_path(spark, "Your/hdfs/path")




View solution in original post

4 REPLIES 4

avatar
Super Collaborator

Hi , 

The below source code removes files that are older than 3 days from the HDFS path

 

 #!/bin/sh   
 today=`date +'%s'`  
 hdfs dfs -ls /file/Path/ | while read line ; do  
 dir_date=$(echo ${line} | awk '{print $6}')  
 difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))  
 filePath=$(echo ${line} | awk '{print $8}')  
 if [ ${difference} -gt 3 ]; then  
   hdfs dfs -rm -r $filePath  
 fi  
 done 

 


hdfs dfs -rm -r command moves the data to the trash folder if the trash mechanism is configured. 

To ignore moving the file to trash folder use skipTrash option.

avatar
Explorer
Thanks for the solution, it will help a lot but i need help to built python
script foe this hdfs cleanup
plz help in that also

avatar
Super Collaborator

Hi,

From shell find the files that needs to be deleted and save them in a temp file like below,

 #!/bin/sh   
 today=`date +'%s'`  
 hdfs dfs -ls /file/Path/ | while read line ; do  
 dir_date=$(echo ${line} | awk '{print $6}')  
 difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))  
 filePath=$(echo ${line} | awk '{print $8}')  
 if [ ${difference} -gt 3 ]; then  
   echo -e "$filePath" >> toDelete
 fi  
 done


Then execute arbitrary shell command using form example subprocess.call or sh library so something like below

import subprocess

file = open('toDelete', 'r')
for each in file:
	subprocess.call(["hadoop", "fs", "-rm", "-f", each])


Also, you can use hdfs fs API in PySpark like below, 

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()

def delete_path(spark, path):
    sc = spark.sparkContext
    fs = (sc._jvm.org
          .apache.hadoop
          .fs.FileSystem
          .get(sc._jsc.hadoopConfiguration())
          )
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)

delete_path(spark, "Your/hdfs/path")




avatar
Explorer

Hi ggangadharan

 

I follow the way u suggest, but now i need help in code that we can execute that hdfscode shell script from python script.

and do it as an

 

import subprocess 

subprocess.call('./home/test.sh/' ,shell=True)

file = open('toDelete, 'r')

for each in file:
subprocess.call(["hadoop", "fs", "-rm", "-f", each])

 

but now my shell script is not executing and not showing any output ,plz suggest me what i do.

 

thanks