- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Nifi processor that deletes the older day files in HDFS.
- Labels:
-
Apache NiFi
Created 08-15-2018 04:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am planning to put a processor that executes a query on hive and stores the results to HDFS in CSV with Timestamp as name of the file. And from there I want to run the same job for every 24 hours. In parallel to that I want to put a processor that deletes previous days records in HDFS everyday.
-- For this I need some processor which names the timestamps to the output file and a processor that deletes the file from HDFS.
Created 08-15-2018 09:45 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use either ListHDFS (or) GetHDFSFileInfo processors and then processor will not store the state and you can schedule this processor to run at nightly and once you list the files from HDFS then you can use hdfs.lastModified attribute(or) you can use your filename with substringAfter function and check the timestamp value in your RouteOnAttribute processor.
Once you filterout the files that are more than specific time then feed to DeleteHDFS processor to delete them.
In addition ListHDFS processor stores the state and runs only incrementally so if you want to clear the state then use RestAPI with
/processors/{id}/state/clear-requests
To clear the state and run the processor once you clear the state.
Flow:
1.ListHDFS2.RouteOnAttribute //check the filename (or) lastmodified time3.DeleteHDFS //delete the files in hdfs
Flow:
1.GenerateFlowFile
2.GetHDFSFileINFO
3.RouteOnAttribute
4.DeleteHDFS
(or)
You can use GetHDFS processor(Keep source file to true) which doesn't store the state but in this processor we are fetching the files from HDFS if the file is big then we are keeping lot of load on NiFi.
Created 08-15-2018 09:45 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use either ListHDFS (or) GetHDFSFileInfo processors and then processor will not store the state and you can schedule this processor to run at nightly and once you list the files from HDFS then you can use hdfs.lastModified attribute(or) you can use your filename with substringAfter function and check the timestamp value in your RouteOnAttribute processor.
Once you filterout the files that are more than specific time then feed to DeleteHDFS processor to delete them.
In addition ListHDFS processor stores the state and runs only incrementally so if you want to clear the state then use RestAPI with
/processors/{id}/state/clear-requests
To clear the state and run the processor once you clear the state.
Flow:
1.ListHDFS2.RouteOnAttribute //check the filename (or) lastmodified time3.DeleteHDFS //delete the files in hdfs
Flow:
1.GenerateFlowFile
2.GetHDFSFileINFO
3.RouteOnAttribute
4.DeleteHDFS
(or)
You can use GetHDFS processor(Keep source file to true) which doesn't store the state but in this processor we are fetching the files from HDFS if the file is big then we are keeping lot of load on NiFi.
