Created 11-21-2016 07:14 PM
Hello,
First time posting here so sorry if this is in the wrong section / wrong format.
First, some background : We started a POC using NiFi 1.0.0. We're using a 3 node cluster with limited ressources (this is a POC...). Each of the node has 16 cores, 32gb of ram and 2 volumes : a raid 1 volume for the OS and a Raid 10 volume on 2.5in splindles. I know this is not a recommended setup but the content repo, the provenance repo, the flow files, everything basically, is on the same raid 10 array. The disks are heavily used right now. Content Repo archiving is disabled.
Now here's the thing : every 2-3 days, the disk fills up. Right now, the UI reports that we have, in queue : 450 000 (3.21gb). I would expect to have roughly the same amount of data in the nifi/content_repository folder but it's not the case : On one of the node, the content_repo folder is 73gb. I can't tell how big the 2 others nodes are since the "du -h" operation is still running after 10minutes but using "df", I can estimate around 700-800gb on each.
When we restart one of the node, it can take hours while the process cleans the content_repo and spams the log with a bunch of "unknown files"
Any ideas / Suggestions? This is running on CentOS 6.
Thanks
Here's the relevant config section :
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository nifi.content.repository.directory.default=./content_repository nifi.content.repository.archive.max.retention.period=1 hours nifi.content.repository.archive.max.usage.percentage=1% nifi.content.repository.archive.enabled=false nifi.content.repository.always.sync=false
Created 11-21-2016 07:36 PM
The content size displayed in a the UI will not map exactly to disk utilization since Nifi stores multiple FlowFiles in a single claim in the content repo. A claim cannot be deleted until Every FlowFile in contains has reached a point of termination in your dataflow. so it is possible with 450,000 queued FlowFiles you are holding on to a large number of claims still. Try clearing out some of this backlog and see if disk usage drops. Setting backpressure thresholds on connections is good way to prevent your queues from getting so large.
Another possibility is that you are running in to https://issues.apache.org/jira/browse/NIFI-2925 .
This bug has been addressed for the next Apache NiFi release of 1.1 and HDF 2.1
Thanks,
Matt
Created 11-21-2016 07:36 PM
The content size displayed in a the UI will not map exactly to disk utilization since Nifi stores multiple FlowFiles in a single claim in the content repo. A claim cannot be deleted until Every FlowFile in contains has reached a point of termination in your dataflow. so it is possible with 450,000 queued FlowFiles you are holding on to a large number of claims still. Try clearing out some of this backlog and see if disk usage drops. Setting backpressure thresholds on connections is good way to prevent your queues from getting so large.
Another possibility is that you are running in to https://issues.apache.org/jira/browse/NIFI-2925 .
This bug has been addressed for the next Apache NiFi release of 1.1 and HDF 2.1
Thanks,
Matt
Created 11-21-2016 08:39 PM
Apache NiFi 1.1 should be going up for vote very soon..
Created 11-21-2016 08:33 PM
Thanks @Matt, Clearing the queues does not seem to help. I'm restating one of the nodes right now, I'll be able to share the exact message when it boots and discovers the files that should not be there - sounds a lot like we're hitting the bug.
Is there a timeline for the release of 1.1.0? Reading the mailing lists, it seems to be really close to RC.
Thanks
Phil
Created 11-21-2016 08:44 PM
Here's the actual error message. We'll have tons of them (more than 100k) during the restart...
2016-11-21 20:41:43,056 INFO [main] o.a.n.c.repository.FileSystemRepository Found unknown file [nifipath]/content_repository/39/1479172392813-1092647 (5845 bytes) in File System Repository; removing file
Created 11-21-2016 08:53 PM
Very possible it is related to that bug. With regular queues in excess of the swap threshold of 20,000 FlowFiles, swapping will occur. It is a bug in that swapping that can result in those swapped FlowFiles not getting removed from the content repo. This bug continues to occur until eventually you run out of disk space. On restart all that "orphaned" FlowFile content is then removed because their are no FlowFiles referencing that content anymore.
Matt
Created 11-22-2016 07:42 PM
We were definitely swapping. We had a bunch of queue in excess of 40-50K. In all cases, the culprit was a merge processor trying to do too big buckets and waiting for too long. I've modified the flows and stacked 2 merge processor one behind the other (first one has a max of 1000 items, 2nd one does the actual merging to our specific size). I'll monitor the situation and see if the problem occurs again. I'm down to 7-8K flow files (from 450K+) in total.
Created 11-24-2016 10:29 PM
I'm not 100% sure swapping was the problem here. Modified all of the flows to avoid getting big queues... and bumped the swap threshold to 40000 and we're still experiencing disk growth + unknown file on reboot...
I did notice something weird : some of the flows have their "error" or "failure" sending back to themselves instead of auto-termination. Not sure if this is a good practice or not and that it could contribute to the problem?