Created 09-16-2022 08:35 AM
Hi everyone.
We have an NiFi cluster consisting of 3 nodes. After the failure of the disk subsystem on one of the nodes, it was in the ReadOnly state for a long time. After resolving the issue and after restarting cluster we are getting the following error on problematic node:
2022-09-16 15:31:07,255 ERROR [Load-Balanced Client Thread-4] o.a.n.c.q.c.c.a.n.NioAsyncLoadBalanceClient Failed to communicate with Peer xxxxxx:9443
java.io.EOFException: Expected StandardFlowFileRecord[uuid=6ce9e262-b20b-4372-a3b9-43c2c00e8caa,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1663256223724-231072817, container=default, section=49], offset=387990, length=2203],offset=0,name=04190e1f-fdca-4352-a796-6b6c9ce41baa,size=2203] to contain 2203 bytes but the content repository only had 1130 bytes for it
at org.apache.nifi.controller.queue.clustered.ContentRepositoryFlowFileAccess$1.ensureNotTruncated(ContentRepositoryFlowFileAccess.java:83)
at org.apache.nifi.controller.queue.clustered.ContentRepositoryFlowFileAccess$1.read(ContentRepositoryFlowFileAccess.java:63)
at org.apache.nifi.stream.io.StreamUtils.fillBuffer(StreamUtils.java:89)
at org.apache.nifi.controller.queue.clustered.client.async.nio.LoadBalanceSession.getFlowFileContent(LoadBalanceSession.java:297)
at org.apache.nifi.controller.queue.clustered.client.async.nio.LoadBalanceSession.getDataFrame(LoadBalanceSession.java:252)
at org.apache.nifi.controller.queue.clustered.client.async.nio.LoadBalanceSession.communicate(LoadBalanceSession.java:162)
at org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClient.communicate(NioAsyncLoadBalanceClient.java:242)
at org.apache.nifi.controller.queue.clustered.client.async.nio.NioAsyncLoadBalanceClientTask.run(NioAsyncLoadBalanceClientTask.java:76)
at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Content repository settings:
# Content Repository
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=./content_repository
nifi.content.repository.archive.max.retention.period=1 hours
nifi.content.repository.archive.max.usage.percentage=75%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/
nifi.content.repository.encryption.key.provider.implementation=
nifi.content.repository.encryption.key.provider.location=
nifi.content.repository.encryption.key.id=
nifi.content.repository.encryption.key=
Maybe someone have any ideas how to handle it?
Created 09-16-2022 12:49 PM
@EuGras
You have a FlowFile queued somewhere within your dataflow with UUID=
6ce9e262-b20b-4372-a3b9-43c2c00e8caa
The connection is trying to read the content for that FlowFile from a content claim found in the content-repository in order to load balance data across nodes in the cluster here:
id=1663256223724-231072817, container=default, section=49
<path to>/content_repository/49/1663256223724-231072817
The FlowFile metadata/attributes has recorded that this content should be 2203 bytes in length; however, tis file is only 1130 bytes in size. So it appears when you had disk issue it resulted in data corruption.
You could use NiFi data provenance to locate this FlowFile by UUID or filename (04190e1f-fdca-4352-a796-6b6c9ce41baa) to determine which connection contains it. On that connection you could disable load-balance connection configuration, add a routeOnAttribute processor to filter out this one bad FlowFile and auto-terminate it once it is routed out of other FlowFiles that may have been queued in that same connection.
This is not to say that you may have other corruption caused by your disk issues besides this one FlowFile. If you do not care about the data on the nodes that had the disk issues, as another option, you could shutdown that one node, purge the contents of the flowfile_repository and content_repository. This will effectively delete all flowfiles queued in connections on that one node. Then restart the NiFi node. It will construct new content and flowfile repository on startup.
If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.
Thank you,
Matt
Created 09-16-2022 12:49 PM
@EuGras
You have a FlowFile queued somewhere within your dataflow with UUID=
6ce9e262-b20b-4372-a3b9-43c2c00e8caa
The connection is trying to read the content for that FlowFile from a content claim found in the content-repository in order to load balance data across nodes in the cluster here:
id=1663256223724-231072817, container=default, section=49
<path to>/content_repository/49/1663256223724-231072817
The FlowFile metadata/attributes has recorded that this content should be 2203 bytes in length; however, tis file is only 1130 bytes in size. So it appears when you had disk issue it resulted in data corruption.
You could use NiFi data provenance to locate this FlowFile by UUID or filename (04190e1f-fdca-4352-a796-6b6c9ce41baa) to determine which connection contains it. On that connection you could disable load-balance connection configuration, add a routeOnAttribute processor to filter out this one bad FlowFile and auto-terminate it once it is routed out of other FlowFiles that may have been queued in that same connection.
This is not to say that you may have other corruption caused by your disk issues besides this one FlowFile. If you do not care about the data on the nodes that had the disk issues, as another option, you could shutdown that one node, purge the contents of the flowfile_repository and content_repository. This will effectively delete all flowfiles queued in connections on that one node. Then restart the NiFi node. It will construct new content and flowfile repository on startup.
If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.
Thank you,
Matt
Created 09-20-2022 02:57 AM
@MattWho thanks a lot. I identified connections with problematic files, disabled load-balance and terminated them according to your method via filtering by id. It's interesting that the problematic connection ID is not showed in the nifi-app.log, but in the UI logs shows