Created 04-20-2022 05:21 AM
Hi Guyz,
please help me to build the python script for cleaning HDFS dir which are older than 3 days.
give me the suggestions
Created 04-25-2022 11:15 PM
Hi,
From shell find the files that needs to be deleted and save them in a temp file like below,
#!/bin/sh
today=`date +'%s'`
hdfs dfs -ls /file/Path/ | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')
if [ ${difference} -gt 3 ]; then
echo -e "$filePath" >> toDelete
fi
done
Then execute arbitrary shell command using form example subprocess.call or sh library so something like below
import subprocess
file = open('toDelete', 'r')
for each in file:
subprocess.call(["hadoop", "fs", "-rm", "-f", each])
Also, you can use hdfs fs API in PySpark like below,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
def delete_path(spark, path):
sc = spark.sparkContext
fs = (sc._jvm.org
.apache.hadoop
.fs.FileSystem
.get(sc._jsc.hadoopConfiguration())
)
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
delete_path(spark, "Your/hdfs/path")
Created on 04-22-2022 04:24 AM - edited 04-22-2022 04:25 AM
Hi ,
The below source code removes files that are older than 3 days from the HDFS path
#!/bin/sh
today=`date +'%s'`
hdfs dfs -ls /file/Path/ | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')
if [ ${difference} -gt 3 ]; then
hdfs dfs -rm -r $filePath
fi
done
hdfs dfs -rm -r command moves the data to the trash folder if the trash mechanism is configured.
To ignore moving the file to trash folder use skipTrash option.
Created 04-23-2022 12:33 AM
Created 04-25-2022 11:15 PM
Hi,
From shell find the files that needs to be deleted and save them in a temp file like below,
#!/bin/sh
today=`date +'%s'`
hdfs dfs -ls /file/Path/ | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')
if [ ${difference} -gt 3 ]; then
echo -e "$filePath" >> toDelete
fi
done
Then execute arbitrary shell command using form example subprocess.call or sh library so something like below
import subprocess
file = open('toDelete', 'r')
for each in file:
subprocess.call(["hadoop", "fs", "-rm", "-f", each])
Also, you can use hdfs fs API in PySpark like below,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
def delete_path(spark, path):
sc = spark.sparkContext
fs = (sc._jvm.org
.apache.hadoop
.fs.FileSystem
.get(sc._jsc.hadoopConfiguration())
)
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
delete_path(spark, "Your/hdfs/path")
Created 04-29-2022 03:32 AM
Hi ggangadharan
I follow the way u suggest, but now i need help in code that we can execute that hdfscode shell script from python script.
and do it as an
import subprocess
subprocess.call('./home/test.sh/' ,shell=True)
file = open('toDelete, 'r')
for each in file:
subprocess.call(["hadoop", "fs", "-rm", "-f", each])
but now my shell script is not executing and not showing any output ,plz suggest me what i do.
thanks