Currently, we are trying to separately files going through Nifi into HDFS by minute by using the PutHDFS configurable property Conflict Resolution Strategy: Append. We use ExtractText to retrieve the minute from the event's timestamp, then save it into an attribute and create the filename with that attribute (among others).
We find that within a margin of 2-6 seconds, there are events from the next minute in the previous minutes file. This will cause problems down the road in terms of searching through the data based on a minute.
Has anyone found this issue themselves using the method we are? Is there a configurable property in ExtractText or UpdateAttribute that might lead us to a more granular depositing of the events correctly?
We also keep getting odd filename missing parts such as example:
Although I will say I notice the incorrect version of the filename moreso when the data is generated in a kafka topic without it being empty rather than it being a new topic... weird.
Without seeing the full data flow, my initial thought would be to try and use a 'merge content' process, and use a variant of your timestamp attribute as the correlation attribute. All flow files with the same correlation attribute should be grouped together; then just write the resulting set of merged flow files out to HDFS.
Correct; the correlation would happen in the merge process, and then written out, although you may be able to use both if your batch sizes are going to be pretty large.