Created 11-22-2016 04:52 PM
Nifi 1.0.0, Single Node, small JVM (512MB)
Simple flow GetFile --> PutFile
Pretest - Copy large 9GB file to input dir of GetFile processor (takes 6.5 mins)
Run test - Turn Flow on
Is there a setting to see the in progress counts updating when working with large files.
Thanks
Created 11-22-2016 04:54 PM
At this time we don't show progress of in-flight sessions via that mechanism other than the indicator of the number of active threads. That said, it is definitely a good idea just not something we've done anything with to date.
Created 11-22-2016 04:54 PM
At this time we don't show progress of in-flight sessions via that mechanism other than the indicator of the number of active threads. That said, it is definitely a good idea just not something we've done anything with to date.
Created 11-22-2016 05:08 PM
Something else worth mentioning that would be good to get your thoughts on @J.Thomas King is the idea of not actually copying in externally referenceable data as a configurable thing. By that we'd simply create a pointer/reference to the original input data wherever it lives (file, http/url, etc..). Then whenever we actually operate on it in the flow we'd access it in its original form. This avoids needless copies tasks and could result in tremendous throughput benefits. The downside being of course that we cannot manage or guarantee the lifecycle of that data but for certain cases this could be fine anyway. Would such a feature be helpful for your case?
Created 11-22-2016 05:30 PM
It's an interesting feature. It would def speed up the process. Not sure if it would help our application at this point. I was just trouble shooting an issue because a co worker thought nifi was not working with large files (was not seeing indication something was happening and thought the system had locked up).
Created 11-22-2016 05:44 PM
Ok, cool. Looks like you have a pretty small heap size so if the thing you do right after grabbing that big object is splitting it make sure you do a two-phase split. The content itself should never be held in memory in full but even the pointers/metadata about the existance of the flow files can add up. Let's say for instance you get the file then split text on line boundaries. Do SplitText with say (1000 lines per split) then another SplitText to get down to single lines. This way we never dump references to 1000000 flow files at once. In the approach I'm mentioning it can handle extremely large inputs because it is never having too much undo bookkeeping. We also intend to make that go away so users don't even have to consider that either.
On your flow the rate you mention is about 20MB/s copy rate which sounds relatively low. That might be worth looking into as well but in any case your point about wanting to be able to observe in-flight behaviors is certainly a compelling user experience idea.
Created 11-23-2016 12:17 AM
Thanks for the tips Joe, I will pass them along. Yes I was just proving a large file could be handled on a very small/slow system ==> a larger (JVM) faster system would have no problem. Yes very good efficiency tips, thanks!