Support Questions

arati · ‎04-20-2022

Hi Guyz,

please help me to build the python script for cleaning HDFS dir which are older than 3 days.

give me the suggestions

ggangadharan · ‎04-25-2022

Hi,

From shell find the files that needs to be deleted and save them in a temp file like below,

 #!/bin/sh   
 today=`date +'%s'`  
 hdfs dfs -ls /file/Path/ | while read line ; do  
 dir_date=$(echo ${line} | awk '{print $6}')  
 difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))  
 filePath=$(echo ${line} | awk '{print $8}')  
 if [ ${difference} -gt 3 ]; then  
   echo -e "$filePath" >> toDelete
 fi  
 done

Then execute arbitrary shell command using form example subprocess.call or sh library so something like below

import subprocess

file = open('toDelete', 'r')
for each in file:
	subprocess.call(["hadoop", "fs", "-rm", "-f", each])

Also, you can use hdfs fs API in PySpark like below,

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()

def delete_path(spark, path):
    sc = spark.sparkContext
    fs = (sc._jvm.org
          .apache.hadoop
          .fs.FileSystem
          .get(sc._jsc.hadoopConfiguration())
          )
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)

delete_path(spark, "Your/hdfs/path")

View solution in original post

ggangadharan · ‎04-22-2022

Hi ,

The below source code removes files that are older than 3 days from the HDFS path

 #!/bin/sh   
 today=`date +'%s'`  
 hdfs dfs -ls /file/Path/ | while read line ; do  
 dir_date=$(echo ${line} | awk '{print $6}')  
 difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))  
 filePath=$(echo ${line} | awk '{print $8}')  
 if [ ${difference} -gt 3 ]; then  
   hdfs dfs -rm -r $filePath  
 fi  
 done

hdfs dfs -rm -r command moves the data to the trash folder if the trash mechanism is configured.

To ignore moving the file to trash folder use skipTrash option.

arati · ‎04-23-2022

Thanks for the solution, it will help a lot but i need help to built python
script foe this hdfs cleanup
plz help in that also

ggangadharan · ‎04-25-2022

Hi,

From shell find the files that needs to be deleted and save them in a temp file like below,

 #!/bin/sh   
 today=`date +'%s'`  
 hdfs dfs -ls /file/Path/ | while read line ; do  
 dir_date=$(echo ${line} | awk '{print $6}')  
 difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))  
 filePath=$(echo ${line} | awk '{print $8}')  
 if [ ${difference} -gt 3 ]; then  
   echo -e "$filePath" >> toDelete
 fi  
 done

Then execute arbitrary shell command using form example subprocess.call or sh library so something like below

import subprocess

file = open('toDelete', 'r')
for each in file:
	subprocess.call(["hadoop", "fs", "-rm", "-f", each])

Also, you can use hdfs fs API in PySpark like below,

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()

def delete_path(spark, path):
    sc = spark.sparkContext
    fs = (sc._jvm.org
          .apache.hadoop
          .fs.FileSystem
          .get(sc._jsc.hadoopConfiguration())
          )
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)

delete_path(spark, "Your/hdfs/path")

arati · ‎04-29-2022

Hi ggangadharan

I follow the way u suggest, but now i need help in code that we can execute that hdfscode shell script from python script.

and do it as an

import subprocess

subprocess.call('./home/test.sh/' ,shell=True)

file = open('toDelete, 'r')

for each in file:
subprocess.call(["hadoop", "fs", "-rm", "-f", each])

but now my shell script is not executing and not showing any output ,plz suggest me what i do.

thanks

Cloudera Community

Support Questions

HDFS dir cleanup which older than 7 days in python script