Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Apparent Data Loss on Nifi Restart

avatar
New Contributor

I am running Nifi 1.9.2 on a two-node cluster and it seems that I am losing data occasionally when I restart Nifi. I am seeing multiple instances of the following error (each error is for a different swap file)

 

2021-03-25 19:29:46,426 WARN [main] o.a.n.controller.FileSystemSwapManager Encountered unknown Swap File ./flowfile_repository/swap/1616081663634-7cde3c5c-016b-1000-0000-00004c82c4b2-98e280b7-c8e5-44d5-a94d-804cbde11f91.local.swap; will ignore this Swap File. This file should be cleaned up manually

 

and when I check the output of Nifi, I am seeing less data than expected around the time of the restart.

 

It seems to me that Nifi is swapping files but is unable to update the WriteAheadLog before Nifi is stopped. So, when Nifi starts again, it is unable to get the swapped files because there is no record of them in the WriteAheadLog. Let me know if this conclusion is correct and if so, how I can prevent this issue from occurring in the future.

3 REPLIES 3

avatar
Super Mentor

@nmargosian 

 

If you search your NiFi canvas for uuid: 

7cde3c5c-016b-1000-0000-00004c82c4b2

Do you find that connection?
This is the connection that this swap file would get swapped back in to.  If this connection does not exist, then the swap file cannot be loaded back in to it.

Any chance someone removed a connection from the canvas while this node was not connected to the cluster?
Did you recently upgrade from an older NIFi version?
Did you copy a flow.xml.gz from a different node in your cluster to this node because of a flow mismatch exception?

Just looking for reason as to why this connection would be missing.
Does the NiFi flow archive directory exist? 
Does the NiFi service user have proper permissions to read and write to that archive directory?
Does NiFi have proper ownership and permissions to write to the flow.xml.gz file?

When you make a change on the canvas, NiFi makes that change in the in memory flow, archives the current flow.xml.gz and then writes a new flow.xml.gz.  I am wondering if perhaps the above connection was added to the canvas and flow enabled, but for some reason was unable to write out a new copy of the in memory flow to a flow.xml.gz.  On NiFi restart, the flow from the flow.xml.gz is what is loaded back in to memory.

Hope this helps,
Matt

avatar
New Contributor

Hi, I do find a connection for 

7cde3c5c-016b-1000-0000-00004c82c4b2

There was no Nifi version upgrade and I did not edit either node's flow or copy any flow.xml.gzs.

All the mentioned folders exist and all the file permissions are as expected. 

 

I did notice the error after seeing clustering issues with Nifi. Basically I restarted the cluster but noticed that only 1/2 nodes were connected while the other node was stuck in a Connecting state for 10 minutes. So I restarted both the instances again and then they formed a cluster properly (but that is also when I saw the Warnings from my initial post). The primary node did switch on the second restart too.

avatar
Super Mentor

@nmargosian 

The swap file in questions would contain FlowFiles that belong to a connection with the UUID of: 

7cde3c5c-016b-1000-0000-00004c82c4b2

From your Flow Configuration history found under global menu icon in upper right corner, can you search for that UUID to see fi there is any history on it?
- Do you see int existing at some point in time?  Do you see a "Remove" event on it?
- If you see it in history, but there is no "Remove" action, but it is now gone, then the flow.xml.gz loaded on restart did not have this connection in it.

If this connection no longer exists in the canvas, NiFi can not swap these FlowFiles back in.  Everything you see on the canvas resides in heap memory and is also written to disk within a flow.xml.gz file.  When you stop and start or restart NiFi, NiFi loads the flow back in to heap memory from the flow.xml.gz (each node has a copy of this flow.xml.gz and all nodes must have matching flow.xml.gz files or nodes will not rejoin the cluster.

Things I suggest you verify...
1. Make sure that NiFi can successfully write to the directory where the flow.xml.gz file is located.  Make a change on the canvas am verify the existing flow.xml.gz was moved to the archive directory and a new flow.xml.gz was created.  If this process fails then when NiFi is restarted any changes you made would be lost. For example the connection was created and data was queued on it, but NiFi failed to write new flow.xml.gz because it could not archive current flow.xml.gz (space issues, permissions/ownership issues...etc). This would block NiFi from creating a new flow.xml.gz, but the flow in memory would have your current flow still.  All these directories and files should be owned and readable/writable by your NiFi service user.
2. Did some point in history did your cluster nodes fllows mismatch.  For example, a change was made on the canvas of a node that was currently disconnected from the cluster.  Then that nodes flow was copied to the other nodes to make all nodes in sync.
3. Was an archived flow reloaded back to NiFi at some point.  This requires manual user action to copy a flow.xml.gz out of archive and used to replace the existing flow.xml.gz.

NiFi restarts will not just remove connections from your dataflows.  Some other condition occurred and it may not have even been recent.  If you hav enough app.log history covering multiple restarts, do you see this same exact warn log line with each of those restarts.

Hope this helps,
Matt