- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
HDFS dir cleanup which older than 7 days in python script
- Labels:
-
HDFS
Created ‎04-20-2022 05:21 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Guyz,
please help me to build the python script for cleaning HDFS dir which are older than 3 days.
give me the suggestions
Created ‎04-25-2022 11:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
From shell find the files that needs to be deleted and save them in a temp file like below,
#!/bin/sh
today=`date +'%s'`
hdfs dfs -ls /file/Path/ | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')
if [ ${difference} -gt 3 ]; then
echo -e "$filePath" >> toDelete
fi
done
Then execute arbitrary shell command using form example subprocess.call or sh library so something like below
import subprocess
file = open('toDelete', 'r')
for each in file:
subprocess.call(["hadoop", "fs", "-rm", "-f", each])
Also, you can use hdfs fs API in PySpark like below,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
def delete_path(spark, path):
sc = spark.sparkContext
fs = (sc._jvm.org
.apache.hadoop
.fs.FileSystem
.get(sc._jsc.hadoopConfiguration())
)
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
delete_path(spark, "Your/hdfs/path")
Created on ‎04-22-2022 04:24 AM - edited ‎04-22-2022 04:25 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi ,
The below source code removes files that are older than 3 days from the HDFS path
#!/bin/sh
today=`date +'%s'`
hdfs dfs -ls /file/Path/ | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')
if [ ${difference} -gt 3 ]; then
hdfs dfs -rm -r $filePath
fi
done
hdfs dfs -rm -r command moves the data to the trash folder if the trash mechanism is configured.
To ignore moving the file to trash folder use skipTrash option.
Created ‎04-23-2022 12:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
script foe this hdfs cleanup
plz help in that also
Created ‎04-25-2022 11:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
From shell find the files that needs to be deleted and save them in a temp file like below,
#!/bin/sh
today=`date +'%s'`
hdfs dfs -ls /file/Path/ | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')
if [ ${difference} -gt 3 ]; then
echo -e "$filePath" >> toDelete
fi
done
Then execute arbitrary shell command using form example subprocess.call or sh library so something like below
import subprocess
file = open('toDelete', 'r')
for each in file:
subprocess.call(["hadoop", "fs", "-rm", "-f", each])
Also, you can use hdfs fs API in PySpark like below,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
def delete_path(spark, path):
sc = spark.sparkContext
fs = (sc._jvm.org
.apache.hadoop
.fs.FileSystem
.get(sc._jsc.hadoopConfiguration())
)
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
delete_path(spark, "Your/hdfs/path")
Created ‎04-29-2022 03:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi ggangadharan
I follow the way u suggest, but now i need help in code that we can execute that hdfscode shell script from python script.
and do it as an
import subprocess
subprocess.call('./home/test.sh/' ,shell=True)
file = open('toDelete, 'r')
for each in file:
subprocess.call(["hadoop", "fs", "-rm", "-f", each])
but now my shell script is not executing and not showing any output ,plz suggest me what i do.
thanks
