We have a recurring issue we haven't been able to solve.
Take a look at our flow. We're taking logfiles, extracting date and time attributes to use as the filename when writing them to HDFS
It mostly works but we keep intermittently getting filenames that are missing the minutes field.
2017_05_16_13__fozziesplunkr.log <---this is the file. It contains entries from minutes 55, 56 and 57. It feels like a "catch-all" file.
Our setup: 6 hosts sending to 3 topics, 2 hosts per topic. It is generating at about 10k/messages per second of 1 million events. This is just our test data.
On our sending side, we are doing TailFile -> Control Rate (1 MB) -> PublishKafka (this seems to work well)
On our receiving side, there are screenshots of one of our topics with every processor and its tabs.
We used ConsumeKafka -> ExtractText -> UpdateAttribute (Regex for timestamp from log) -> MergeContent -> UpdateAttribute (Create filename) -> PutHDFS
All of these have screenshots as shown. If anyone has had this problem and has any idea on a solution, that'd be welcome. We've tried all kinds of performances tweaks without success. Nifi logs show no warnings or errors.
Flow Overview and Odd missing minute in Filenames
ConsumeKafka Processor and Update Attribute (Create Filename) Processor
ExtractText (Extract from Syslog -Regex) Processor and Update Attribute (assign to attributes) processor
Sorry I was having problems uploading. It kept timing out so I guess all three ended up posting. How do I close it? The delete option is not available (and I think its because this question has a comment on it, this comment here...)