Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Avoiding Duplicate data with Nifi TileFile processor

avatar
Expert Contributor

We are experiencing duplicate data when our cluster's primary node switches over.

Two nodes, m1 and m3, both TailFile processors in them tailing a log but only set to tail the Primary Node. When the Primary Node switches (say from m1 to m3), one TailFile processor stops tailing and the other picks it up (I am assuming it is picking it up at the beginning of the file according to the settings in the TailFile prcoessor) which causes data duplication. I saw there was an option to have it start at the current time instead of the Beginning of File. Would this be a reasonable fix to ensure when the Primary Node switches we dont get duplicate data? We've always had this set at Beginning of File and didn't seem to have this problem before so I am hesitant to change it without a response as to if I undestand this functionality in this situation correctly.

1 ACCEPTED SOLUTION

avatar
Super Mentor
@Eric Lloyd

I am assuming the file being tailed is mounted across all your NiFi nodes in the cluster? This would need to be the case so that no matter which node becomes the primary node, it could tail the exact same file.

Assuming the above is true, I am also assuming processor has been configured with "State Location" configured for "Remote"

When listFile executes it begins tailing the target File, at completion of each thread state is recorded as to where that tail let off so next thread can pickup where the previous ended.

If you are only storing state "Local" when primary node switches, the new primary node will start tailing from beginning of file again.

That being said, there is still a chance for some duplication even when state is set to Cluster. When primary node changes, original primary node is informed it is no longer the primary node and a new node is elected as the primary node.

The original node will complete it currently executing task but will not schedule any new tasks to run.

The New primary node will start the "primary node" only processors. If the new primary node executes before same processor on old primary node updates cluster state, it is possible new primary node will start tailing from last known recorded cluster state for that processor resulting in some duplication.

NiFi favors duplication over data loss. We cannot assume that the original primary node just did not die. So we have to accept the risk that the original primary node processors may never update state.

Hope this confirms how your processor is setup and why NiFi works the way it does in this scenario.

Thanks,

Matt

View solution in original post

9 REPLIES 9

avatar

Hi @Eric Lloyd

I am not sure I understand your use case. NiFi tails a local file. From your question, it looks like you are trying to tail the same fail when master switch. Is your file visible to both nodes (such as NAS storage) ?

TailFile saves it's state to avoid duplicating data from one file. There's two option to store the state : local and remote. Have you set "state location" to remote ?

As per the doc :

Specifies where the state is located either local or cluster so that state can be stored appropriately in order to ensure that all data is consumed without duplicating data upon restart of NiFi

avatar
Expert Contributor

The file is visible to both nodes. The state location is set to remote on the TailFile processors.

My use case:

2 nodes in a cluster. Both have TailFiles processors on them with the setting for the Primary Node to be the one tailing.

When started, one of the nodes is the Primary node (lets say m1) So now only m1's Tailfile processor is tailing the log file that is visible to both nodes. Suddenly, without warning, the Primary Node switches to m3. So now the Tailfile processor on m1 stops tailing the log file and the m3 Tailfile processor starts tailing it.

But from where does the m3 Tailfile processor start tailing it? Seeing from the duplicate data it seems to indicate it starts at the Beginning of File rather than were the m1 Tialfile processor left off.

avatar
Super Mentor
@Eric Lloyd

I am assuming the file being tailed is mounted across all your NiFi nodes in the cluster? This would need to be the case so that no matter which node becomes the primary node, it could tail the exact same file.

Assuming the above is true, I am also assuming processor has been configured with "State Location" configured for "Remote"

When listFile executes it begins tailing the target File, at completion of each thread state is recorded as to where that tail let off so next thread can pickup where the previous ended.

If you are only storing state "Local" when primary node switches, the new primary node will start tailing from beginning of file again.

That being said, there is still a chance for some duplication even when state is set to Cluster. When primary node changes, original primary node is informed it is no longer the primary node and a new node is elected as the primary node.

The original node will complete it currently executing task but will not schedule any new tasks to run.

The New primary node will start the "primary node" only processors. If the new primary node executes before same processor on old primary node updates cluster state, it is possible new primary node will start tailing from last known recorded cluster state for that processor resulting in some duplication.

NiFi favors duplication over data loss. We cannot assume that the original primary node just did not die. So we have to accept the risk that the original primary node processors may never update state.

Hope this confirms how your processor is setup and why NiFi works the way it does in this scenario.

Thanks,

Matt

avatar
Super Mentor

@Eric Lloyd

Avoiding duplication during restart as described in documentation is a different scenario. During NiFi shutdown, processors are give a graceful shutdown timer to complete their running tasks (20 seconds default). If a thread still has not completed by then it is killed. In the case where a thread is killed, no FlowFiles have been committed to the tailFile success connection and no update has been made to state. So in restart, no matter which node becomes Primary node, the tailFile start correctly from last successfully recorded state position.

Primary node changes do to result in killing of any actively running tasks. It simply puts the processor in a stopping state so it will not execute another task once the current task completes.

Matt

avatar
Expert Contributor

Okay yes this all makes sense.

Do you think I would reduce the amount of data duplication by switching the property in the TailFile processor "Initial Start Position" from Beginning of File to Current Time would reduce the amount of duplication? Or would I possibly be facing data loss then?

avatar
Super Mentor

@Eric Lloyd

Not sure that would make a difference. You are well beyond that "initial start position" already. Each execution is working from the recorded state location. Duplication may still occur.

I suggest perhaps adjusting your run schedule to something other then 0 secs. This not only helps to reduce resource consumption, it introduces a small delay between each consumption of lines from the log files. This may help when primary node changes occur.

avatar
Expert Contributor

Hi there @Matt Clarke do you have a suggestion for a value to switch run time to other than 0 secs or is that more of a choice dependent on the flow/system itself?

avatar
Super Mentor

@Eric Lloyd

At 0 secs the processor is trying to run as fast as possible, so basically no break in processing. Just setting it to 2 or 3 seconds may help.

avatar
Expert Contributor

ok changed to 2 secs. Will keep an eye on it to monitor for future data duplication. Going to accept this as answer for now. Thank you.