Support Questions

Find answers, ask questions, and share your expertise

Unusual data placement on file rollover in Nifi - HDFS

avatar
Expert Contributor

15016-screen-shot-2017-05-03-at-125256-pm.png

15017-screen-shot-2017-05-03-at-125316-pm.png

15018-screen-shot-2017-05-03-at-125324-pm.png

15019-screen-shot-2017-05-03-at-125331-pm.png

15020-screen-shot-2017-05-03-at-125337-pm.png

So here is our setup.

Server 1: TailFile -> PublishKafka

Server 2: ConsumeKafka -> ExtractText -> Update Attribute -> MergeContent -> UpdateAttribute (create filename) -> PutHDFS

We currently have it set up to parse out the timestamp from the files and save them as variable using the ExtractText command so we can create our filename and HDFS directories with the variables in this format:

Examples: May_03_16_39 (May 3rd at 16:39 pm)

May_03_16_40 (May 3rd at 16:40 pm)

May_03_16_41 (May 3rd at 16:41 pm)

Our directory structure goes down to the minute: 2017/May/03/16/39

What we see is that during the file rollover, it puts a few seconds of data from the end of one file and a few seconds from the beginning of the next file into a file called: May_03_16_

Please see screenshots of file structure output, PutHDFS config, UpdateAttribute (create filename) config and if you could use anything else that would help let me know.

We are using the append function with PutHDFS to put all files of the same minute into a specific file.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

15065-screen-shot-2017-05-04-at-24133-pm.pngI suspect that there is a connection between the number of messages being sent and Run Duration in our ExtractText processor (see screenshot)

This is why:

at 10,000 messages being sent to the Kafka topic / second for total of 1,000,000 we always see the odd displaced data in the filename without the minute on it no matter if the Run Duration is 500 ms, 1 s, or 2 s. (also we changed this from the lowest value because it was causing intermittent data loss)

at 1,000 message / second for total of 100,000 if we set the Run Duration to 1 s, the files are perfect, the way we want them.

Our ultimate use case is to send messages more than 10,000 / second (considerably) so maybe this will help shed some light.

View solution in original post

7 REPLIES 7

avatar

Looks like you are using $now vs. the syslog datetime for the rollover. Is there a reason for that? I expect to use the source (syslog) timestamp for the rollover, so you will have only matching timestamps in the hdfs file.

avatar
Expert Contributor

Thanks for the idea. I corrected that but am still seeing the same behavior.

avatar

Can you post the full date format used?

avatar
Expert Contributor

Are you referring to the string used to separate the date in PutHDFS?

/topics/minifitest/${allAttributes("syslog_year", "syslog_month", "syslog_day", "syslog_hour"):join("/")}

Here is our date format example: 2017-05-04 17:15:14,655

We split up 2017 into syslog_year, 05 into syslog_month, 04 into syslog_day, 17 into syslog_hour, 15 in syslog_minute ... etc etc

Ultimately we use this string to generate the filename:

${allAttributes("syslog_year", "syslog_month", "syslog_day", "syslog_hour", "syslog_minute"):join("_")}

It all parses into directories correctly but then our files over three minutes end up in three correctly named folders (as screenshot) with missing chunks in the filename missing the minute...

15064-screen-shot-2017-05-04-at-22335-pm.png

avatar
Expert Contributor

I suspect that there is a connection between the number of messages being sent and Run Duration in our ExtractText processor (see screenshot)

This is why:

at 10,000 messages being sent to the Kafka topic / second for total of 1,000,000 we always see the odd displaced data in the filename without the minute on it no matter if the Run Duration is 500 ms, 1 s, or 2 s. (also we changed this from the lowest value because it was causing intermittent data loss)

at 1,000 message / second for total of 100,000 if we set the Run Duration to 1 s, the files are perfect, the way we want them.

Our ultimate use case is to send messages more than 10,000 / second (considerably) so maybe this will help shed some light.

avatar
Expert Contributor

15065-screen-shot-2017-05-04-at-24133-pm.pngI suspect that there is a connection between the number of messages being sent and Run Duration in our ExtractText processor (see screenshot)

This is why:

at 10,000 messages being sent to the Kafka topic / second for total of 1,000,000 we always see the odd displaced data in the filename without the minute on it no matter if the Run Duration is 500 ms, 1 s, or 2 s. (also we changed this from the lowest value because it was causing intermittent data loss)

at 1,000 message / second for total of 100,000 if we set the Run Duration to 1 s, the files are perfect, the way we want them.

Our ultimate use case is to send messages more than 10,000 / second (considerably) so maybe this will help shed some light.

avatar
Expert Contributor

Changing the Concurrent Tasks in ExtractText to 3 and reducing the Run Duration to 500ms fixed the problem.