Created 05-11-2017 03:08 PM
Currently, we are trying to separately files going through Nifi into HDFS by minute by using the PutHDFS configurable property Conflict Resolution Strategy: Append. We use ExtractText to retrieve the minute from the event's timestamp, then save it into an attribute and create the filename with that attribute (among others).
We find that within a margin of 2-6 seconds, there are events from the next minute in the previous minutes file. This will cause problems down the road in terms of searching through the data based on a minute.
Has anyone found this issue themselves using the method we are? Is there a configurable property in ExtractText or UpdateAttribute that might lead us to a more granular depositing of the events correctly?
Thanks
Created 05-11-2017 05:26 PM
We also keep getting odd filename missing parts such as example:
Correct file:
2017_05_11_13_25_topic.log
Incorrect:
2017_05_11_13__topics.log
Although I will say I notice the incorrect version of the filename moreso when the data is generated in a kafka topic without it being empty rather than it being a new topic... weird.
Created 05-12-2017 12:53 PM
Without seeing the full data flow, my initial thought would be to try and use a 'merge content' process, and use a variant of your timestamp attribute as the correlation attribute. All flow files with the same correlation attribute should be grouped together; then just write the resulting set of merged flow files out to HDFS.
Created 05-12-2017 01:15 PM
Thanks I can try something like this. In place of the PutHDFS Conflict Resolution Strategy of Append, you mean?
Created 05-12-2017 02:29 PM
Correct; the correlation would happen in the merge process, and then written out, although you may be able to use both if your batch sizes are going to be pretty large.