Created on 05-03-2017 04:56 PM - edited 08-17-2019 07:37 PM
So here is our setup.
Server 1: TailFile -> PublishKafka
Server 2: ConsumeKafka -> ExtractText -> Update Attribute -> MergeContent -> UpdateAttribute (create filename) -> PutHDFS
We currently have it set up to parse out the timestamp from the files and save them as variable using the ExtractText command so we can create our filename and HDFS directories with the variables in this format:
Examples: May_03_16_39 (May 3rd at 16:39 pm)
May_03_16_40 (May 3rd at 16:40 pm)
May_03_16_41 (May 3rd at 16:41 pm)
Our directory structure goes down to the minute: 2017/May/03/16/39
What we see is that during the file rollover, it puts a few seconds of data from the end of one file and a few seconds from the beginning of the next file into a file called: May_03_16_
Please see screenshots of file structure output, PutHDFS config, UpdateAttribute (create filename) config and if you could use anything else that would help let me know.
We are using the append function with PutHDFS to put all files of the same minute into a specific file.
Created on 05-04-2017 06:46 PM - edited 08-17-2019 07:37 PM
I suspect that there is a connection between the number of messages being sent and Run Duration in our ExtractText processor (see screenshot)
This is why:
at 10,000 messages being sent to the Kafka topic / second for total of 1,000,000 we always see the odd displaced data in the filename without the minute on it no matter if the Run Duration is 500 ms, 1 s, or 2 s. (also we changed this from the lowest value because it was causing intermittent data loss)
at 1,000 message / second for total of 100,000 if we set the Run Duration to 1 s, the files are perfect, the way we want them.
Our ultimate use case is to send messages more than 10,000 / second (considerably) so maybe this will help shed some light.
Created 05-03-2017 05:07 PM
Looks like you are using $now vs. the syslog datetime for the rollover. Is there a reason for that? I expect to use the source (syslog) timestamp for the rollover, so you will have only matching timestamps in the hdfs file.
Created 05-03-2017 07:42 PM
Thanks for the idea. I corrected that but am still seeing the same behavior.
Created 05-03-2017 07:47 PM
Can you post the full date format used?
Created on 05-04-2017 06:24 PM - edited 08-17-2019 07:37 PM
Are you referring to the string used to separate the date in PutHDFS?
/topics/minifitest/${allAttributes("syslog_year", "syslog_month", "syslog_day", "syslog_hour"):join("/")}
Here is our date format example: 2017-05-04 17:15:14,655
We split up 2017 into syslog_year, 05 into syslog_month, 04 into syslog_day, 17 into syslog_hour, 15 in syslog_minute ... etc etc
Ultimately we use this string to generate the filename:
${allAttributes("syslog_year", "syslog_month", "syslog_day", "syslog_hour", "syslog_minute"):join("_")}
It all parses into directories correctly but then our files over three minutes end up in three correctly named folders (as screenshot) with missing chunks in the filename missing the minute...
Created 05-04-2017 06:46 PM
I suspect that there is a connection between the number of messages being sent and Run Duration in our ExtractText processor (see screenshot)
This is why:
at 10,000 messages being sent to the Kafka topic / second for total of 1,000,000 we always see the odd displaced data in the filename without the minute on it no matter if the Run Duration is 500 ms, 1 s, or 2 s. (also we changed this from the lowest value because it was causing intermittent data loss)
at 1,000 message / second for total of 100,000 if we set the Run Duration to 1 s, the files are perfect, the way we want them.
Our ultimate use case is to send messages more than 10,000 / second (considerably) so maybe this will help shed some light.
Created on 05-04-2017 06:46 PM - edited 08-17-2019 07:37 PM
I suspect that there is a connection between the number of messages being sent and Run Duration in our ExtractText processor (see screenshot)
This is why:
at 10,000 messages being sent to the Kafka topic / second for total of 1,000,000 we always see the odd displaced data in the filename without the minute on it no matter if the Run Duration is 500 ms, 1 s, or 2 s. (also we changed this from the lowest value because it was causing intermittent data loss)
at 1,000 message / second for total of 100,000 if we set the Run Duration to 1 s, the files are perfect, the way we want them.
Our ultimate use case is to send messages more than 10,000 / second (considerably) so maybe this will help shed some light.
Created 05-04-2017 07:07 PM
Changing the Concurrent Tasks in ExtractText to 3 and reducing the Run Duration to 500ms fixed the problem.