Support Questions
Find answers, ask questions, and share your expertise

HDFS dir cleanup which older than 7 days in python script

New Contributor

Hi Guyz,

 

please help me to build the python script for cleaning HDFS dir which are older than 3 days.

give me the suggestions 

1 ACCEPTED SOLUTION

Cloudera Employee

Hi,

From shell find the files that needs to be deleted and save them in a temp file like below,

 #!/bin/sh   
 today=`date +'%s'`  
 hdfs dfs -ls /file/Path/ | while read line ; do  
 dir_date=$(echo ${line} | awk '{print $6}')  
 difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))  
 filePath=$(echo ${line} | awk '{print $8}')  
 if [ ${difference} -gt 3 ]; then  
   echo -e "$filePath" >> toDelete
 fi  
 done


Then execute arbitrary shell command using form example subprocess.call or sh library so something like below

import subprocess

file = open('toDelete', 'r')
for each in file:
	subprocess.call(["hadoop", "fs", "-rm", "-f", each])


Also, you can use hdfs fs API in PySpark like below, 

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()

def delete_path(spark, path):
    sc = spark.sparkContext
    fs = (sc._jvm.org
          .apache.hadoop
          .fs.FileSystem
          .get(sc._jsc.hadoopConfiguration())
          )
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)

delete_path(spark, "Your/hdfs/path")




View solution in original post

4 REPLIES 4

Cloudera Employee

Hi , 

The below source code removes files that are older than 3 days from the HDFS path

 

 #!/bin/sh   
 today=`date +'%s'`  
 hdfs dfs -ls /file/Path/ | while read line ; do  
 dir_date=$(echo ${line} | awk '{print $6}')  
 difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))  
 filePath=$(echo ${line} | awk '{print $8}')  
 if [ ${difference} -gt 3 ]; then  
   hdfs dfs -rm -r $filePath  
 fi  
 done 

 


hdfs dfs -rm -r command moves the data to the trash folder if the trash mechanism is configured. 

To ignore moving the file to trash folder use skipTrash option.

New Contributor
Thanks for the solution, it will help a lot but i need help to built python
script foe this hdfs cleanup
plz help in that also

Cloudera Employee

Hi,

From shell find the files that needs to be deleted and save them in a temp file like below,

 #!/bin/sh   
 today=`date +'%s'`  
 hdfs dfs -ls /file/Path/ | while read line ; do  
 dir_date=$(echo ${line} | awk '{print $6}')  
 difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))  
 filePath=$(echo ${line} | awk '{print $8}')  
 if [ ${difference} -gt 3 ]; then  
   echo -e "$filePath" >> toDelete
 fi  
 done


Then execute arbitrary shell command using form example subprocess.call or sh library so something like below

import subprocess

file = open('toDelete', 'r')
for each in file:
	subprocess.call(["hadoop", "fs", "-rm", "-f", each])


Also, you can use hdfs fs API in PySpark like below, 

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()

def delete_path(spark, path):
    sc = spark.sparkContext
    fs = (sc._jvm.org
          .apache.hadoop
          .fs.FileSystem
          .get(sc._jsc.hadoopConfiguration())
          )
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)

delete_path(spark, "Your/hdfs/path")




New Contributor

Hi ggangadharan

 

I follow the way u suggest, but now i need help in code that we can execute that hdfscode shell script from python script.

and do it as an

 

import subprocess 

subprocess.call('./home/test.sh/' ,shell=True)

file = open('toDelete, 'r')

for each in file:
subprocess.call(["hadoop", "fs", "-rm", "-f", each])

 

but now my shell script is not executing and not showing any output ,plz suggest me what i do.

 

thanks

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.