Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Avoiding Duplicate data with Nifi TileFile processor

avatar
Expert Contributor

We are experiencing duplicate data when our cluster's primary node switches over.

Two nodes, m1 and m3, both TailFile processors in them tailing a log but only set to tail the Primary Node. When the Primary Node switches (say from m1 to m3), one TailFile processor stops tailing and the other picks it up (I am assuming it is picking it up at the beginning of the file according to the settings in the TailFile prcoessor) which causes data duplication. I saw there was an option to have it start at the current time instead of the Beginning of File. Would this be a reasonable fix to ensure when the Primary Node switches we dont get duplicate data? We've always had this set at Beginning of File and didn't seem to have this problem before so I am hesitant to change it without a response as to if I undestand this functionality in this situation correctly.

1 ACCEPTED SOLUTION

avatar
Super Mentor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
9 REPLIES 9

avatar

Hi @Eric Lloyd

I am not sure I understand your use case. NiFi tails a local file. From your question, it looks like you are trying to tail the same fail when master switch. Is your file visible to both nodes (such as NAS storage) ?

TailFile saves it's state to avoid duplicating data from one file. There's two option to store the state : local and remote. Have you set "state location" to remote ?

As per the doc :

Specifies where the state is located either local or cluster so that state can be stored appropriately in order to ensure that all data is consumed without duplicating data upon restart of NiFi

avatar
Expert Contributor

The file is visible to both nodes. The state location is set to remote on the TailFile processors.

My use case:

2 nodes in a cluster. Both have TailFiles processors on them with the setting for the Primary Node to be the one tailing.

When started, one of the nodes is the Primary node (lets say m1) So now only m1's Tailfile processor is tailing the log file that is visible to both nodes. Suddenly, without warning, the Primary Node switches to m3. So now the Tailfile processor on m1 stops tailing the log file and the m3 Tailfile processor starts tailing it.

But from where does the m3 Tailfile processor start tailing it? Seeing from the duplicate data it seems to indicate it starts at the Beginning of File rather than were the m1 Tialfile processor left off.

avatar
Super Mentor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Super Mentor

@Eric Lloyd

Avoiding duplication during restart as described in documentation is a different scenario. During NiFi shutdown, processors are give a graceful shutdown timer to complete their running tasks (20 seconds default). If a thread still has not completed by then it is killed. In the case where a thread is killed, no FlowFiles have been committed to the tailFile success connection and no update has been made to state. So in restart, no matter which node becomes Primary node, the tailFile start correctly from last successfully recorded state position.

Primary node changes do to result in killing of any actively running tasks. It simply puts the processor in a stopping state so it will not execute another task once the current task completes.

Matt

avatar
Expert Contributor

Okay yes this all makes sense.

Do you think I would reduce the amount of data duplication by switching the property in the TailFile processor "Initial Start Position" from Beginning of File to Current Time would reduce the amount of duplication? Or would I possibly be facing data loss then?

avatar
Super Mentor

@Eric Lloyd

Not sure that would make a difference. You are well beyond that "initial start position" already. Each execution is working from the recorded state location. Duplication may still occur.

I suggest perhaps adjusting your run schedule to something other then 0 secs. This not only helps to reduce resource consumption, it introduces a small delay between each consumption of lines from the log files. This may help when primary node changes occur.

avatar
Expert Contributor

Hi there @Matt Clarke do you have a suggestion for a value to switch run time to other than 0 secs or is that more of a choice dependent on the flow/system itself?

avatar
Super Mentor

@Eric Lloyd

At 0 secs the processor is trying to run as fast as possible, so basically no break in processing. Just setting it to 2 or 3 seconds may help.

avatar
Expert Contributor

ok changed to 2 secs. Will keep an eye on it to monitor for future data duplication. Going to accept this as answer for now. Thank you.