Support Questions

jthomas_t_king · ‎11-22-2016

Nifi 1.0.0, Single Node, small JVM (512MB)

Simple flow GetFile --> PutFile

Pretest - Copy large 9GB file to input dir of GetFile processor (takes 6.5 mins)

Run test - Turn Flow on

Issue GetFile gives no indication the file is being ingested UNTIL the action is complete, ie do not see the READ/WRITE counts change, and the Status History does not update for 6.5 mins. I expected during the 6.5 mins to see the Read/Write counts to increase until they reached 9 GB. What we saw was 0/0 for 6.5 mins, then 9GB/9GB
See the same behavior from the PutFile processor

Is there a setting to see the in progress counts updating when working with large files.

Thanks

JoeWitt · ‎11-22-2016

At this time we don't show progress of in-flight sessions via that mechanism other than the indicator of the number of active threads. That said, it is definitely a good idea just not something we've done anything with to date.

View solution in original post

JoeWitt · ‎11-22-2016

At this time we don't show progress of in-flight sessions via that mechanism other than the indicator of the number of active threads. That said, it is definitely a good idea just not something we've done anything with to date.

JoeWitt · ‎11-22-2016

Something else worth mentioning that would be good to get your thoughts on @J.Thomas King is the idea of not actually copying in externally referenceable data as a configurable thing. By that we'd simply create a pointer/reference to the original input data wherever it lives (file, http/url, etc..). Then whenever we actually operate on it in the flow we'd access it in its original form. This avoids needless copies tasks and could result in tremendous throughput benefits. The downside being of course that we cannot manage or guarantee the lifecycle of that data but for certain cases this could be fine anyway. Would such a feature be helpful for your case?

jthomas_t_king · ‎11-22-2016

It's an interesting feature. It would def speed up the process. Not sure if it would help our application at this point. I was just trouble shooting an issue because a co worker thought nifi was not working with large files (was not seeing indication something was happening and thought the system had locked up).

JoeWitt · ‎11-22-2016

Ok, cool. Looks like you have a pretty small heap size so if the thing you do right after grabbing that big object is splitting it make sure you do a two-phase split. The content itself should never be held in memory in full but even the pointers/metadata about the existance of the flow files can add up. Let's say for instance you get the file then split text on line boundaries. Do SplitText with say (1000 lines per split) then another SplitText to get down to single lines. This way we never dump references to 1000000 flow files at once. In the approach I'm mentioning it can handle extremely large inputs because it is never having too much undo bookkeeping. We also intend to make that go away so users don't even have to consider that either.

On your flow the rate you mention is about 20MB/s copy rate which sounds relatively low. That might be worth looking into as well but in any case your point about wanting to be able to observe in-flight behaviors is certainly a compelling user experience idea.

jthomas_t_king · ‎11-23-2016

Thanks for the tips Joe, I will pass them along. Yes I was just proving a large file could be handled on a very small/slow system ==> a larger (JVM) faster system would have no problem. Yes very good efficiency tips, thanks!

Cloudera Community

Support Questions

Nifi 1.0.0, no in progress indication when reading or writing large files