Support Questions

Find answers, ask questions, and share your expertise

Nifi processor that deletes the older day files in HDFS.

I am planning to put a processor that executes a query on hive and stores the results to HDFS in CSV with Timestamp as name of the file. And from there I want to run the same job for every 24 hours. In parallel to that I want to put a processor that deletes previous days records in HDFS everyday.

-- For this I need some processor which names the timestamps to the output file and a processor that deletes the file from HDFS.

@Matt Burgess @Shu

1 ACCEPTED SOLUTION

Super Guru
@Sai Krishna Makineni

You can use either ListHDFS (or) GetHDFSFileInfo processors and then processor will not store the state and you can schedule this processor to run at nightly and once you list the files from HDFS then you can use hdfs.lastModified attribute(or) you can use your filename with substringAfter function and check the timestamp value in your RouteOnAttribute processor.

Once you filterout the files that are more than specific time then feed to DeleteHDFS processor to delete them.

In addition ListHDFS processor stores the state and runs only incrementally so if you want to clear the state then use RestAPI with

/processors/{id}/state/clear-requests

To clear the state and run the processor once you clear the state.

Flow:

1.ListHDFS2.RouteOnAttribute //check the filename (or) lastmodified time3.DeleteHDFS //delete the files in hdfs

Flow:

1.GenerateFlowFile
2.GetHDFSFileINFO
3.RouteOnAttribute
4.DeleteHDFS

(or)

You can use GetHDFS processor(Keep source file to true) which doesn't store the state but in this processor we are fetching the files from HDFS if the file is big then we are keeping lot of load on NiFi.

View solution in original post

1 REPLY 1

Super Guru
@Sai Krishna Makineni

You can use either ListHDFS (or) GetHDFSFileInfo processors and then processor will not store the state and you can schedule this processor to run at nightly and once you list the files from HDFS then you can use hdfs.lastModified attribute(or) you can use your filename with substringAfter function and check the timestamp value in your RouteOnAttribute processor.

Once you filterout the files that are more than specific time then feed to DeleteHDFS processor to delete them.

In addition ListHDFS processor stores the state and runs only incrementally so if you want to clear the state then use RestAPI with

/processors/{id}/state/clear-requests

To clear the state and run the processor once you clear the state.

Flow:

1.ListHDFS2.RouteOnAttribute //check the filename (or) lastmodified time3.DeleteHDFS //delete the files in hdfs

Flow:

1.GenerateFlowFile
2.GetHDFSFileINFO
3.RouteOnAttribute
4.DeleteHDFS

(or)

You can use GetHDFS processor(Keep source file to true) which doesn't store the state but in this processor we are fetching the files from HDFS if the file is big then we are keeping lot of load on NiFi.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.