Created 08-15-2018 04:55 PM
I am planning to put a processor that executes a query on hive and stores the results to HDFS in CSV with Timestamp as name of the file. And from there I want to run the same job for every 24 hours. In parallel to that I want to put a processor that deletes previous days records in HDFS everyday.
-- For this I need some processor which names the timestamps to the output file and a processor that deletes the file from HDFS.
Created 08-15-2018 09:45 PM
You can use either ListHDFS (or) GetHDFSFileInfo processors and then processor will not store the state and you can schedule this processor to run at nightly and once you list the files from HDFS then you can use hdfs.lastModified attribute(or) you can use your filename with substringAfter function and check the timestamp value in your RouteOnAttribute processor.
Once you filterout the files that are more than specific time then feed to DeleteHDFS processor to delete them.
In addition ListHDFS processor stores the state and runs only incrementally so if you want to clear the state then use RestAPI with
/processors/{id}/state/clear-requests
To clear the state and run the processor once you clear the state.
Flow:
1.ListHDFS2.RouteOnAttribute //check the filename (or) lastmodified time3.DeleteHDFS //delete the files in hdfs
Flow:
1.GenerateFlowFile
2.GetHDFSFileINFO
3.RouteOnAttribute
4.DeleteHDFS
(or)
You can use GetHDFS processor(Keep source file to true) which doesn't store the state but in this processor we are fetching the files from HDFS if the file is big then we are keeping lot of load on NiFi.
Created 08-15-2018 09:45 PM
You can use either ListHDFS (or) GetHDFSFileInfo processors and then processor will not store the state and you can schedule this processor to run at nightly and once you list the files from HDFS then you can use hdfs.lastModified attribute(or) you can use your filename with substringAfter function and check the timestamp value in your RouteOnAttribute processor.
Once you filterout the files that are more than specific time then feed to DeleteHDFS processor to delete them.
In addition ListHDFS processor stores the state and runs only incrementally so if you want to clear the state then use RestAPI with
/processors/{id}/state/clear-requests
To clear the state and run the processor once you clear the state.
Flow:
1.ListHDFS2.RouteOnAttribute //check the filename (or) lastmodified time3.DeleteHDFS //delete the files in hdfs
Flow:
1.GenerateFlowFile
2.GetHDFSFileINFO
3.RouteOnAttribute
4.DeleteHDFS
(or)
You can use GetHDFS processor(Keep source file to true) which doesn't store the state but in this processor we are fetching the files from HDFS if the file is big then we are keeping lot of load on NiFi.