About JoeWitt

JoeWitt · ‎04-04-2017

Can you please share the details of what was being put? Perhaps explain/show what the content of the failed flow file was?

JoeWitt · ‎01-26-2017

The stack trace shows "connection reset by peer". There are some good explanations of what this tells us on the Internet but the moral of the story is it suggests the connection NiFi was writing to was closed and NiFi was notified of that. It happened while trying to write the response which is an exceptional condition so you get this stack trace. I think we'd need to understand the systems involved in this web request/response cycle to help diagnose much further.

JoeWitt · ‎01-16-2017

Hello @Arsalan Siddiqi. These are some excellent questions and thoughts regarding provenance. Let me try to answer them in order. ONE: The Apache NiFi community can definitely help you with question on specific timing of releases and what will be included. I do know though that there is work underway for around Apache NiFi's provenance repository so that it can index even more event data on a per second basis than it does today. Exactly when this will end up in a release is subject to the normal community process of when the contribution is released and reviewed and merged. That said, there is a lot of interest in having higher provenance indexing rates so I'd expect it to be in an upcoming release. TWO: The current limitation we generally see is related to what I mention above in ONE. That is we see provenance indexing rate being a bottleneck on overall processing of data because we do cause backpressure to ensure that they backlog of provenance indexing doesn't just grow unbounded while more and more event data is processed. We are first going to make indexing faster. There are other techniques we could try later such as indexing less data which would make indexing far faster at the expense of slower queries. But such a tradeoff might make sense. THREE: Integration with a system such as Apache Atlas has been shown to be a very compelling combination here. The provenance that NiFi generates plays nicely with the type that Atlas ingests. If we get more and more provenance enabled systems reporting to Apache Atlas then it can be the central place to view such data and get a view of what other systems are doing and thus give that nice system of systems view that people really need. To truly prove lineage across systems there would likely need to be some cryptographically verifiable techniques employed. FOUR: The provenance data at present is prone to manipulation. In Apache NiFi we have flagged future work to adopt privacy by design features such as those which would help detect manipulated data and we're also looking at solutions to have distributed copies of the data to help with loss of availability as well. FIVE: It is designed for extension in parts. You can for example create your own implementation of a provenance repository. You can create your own reporting tasks which can harvest data from the provenance repository and send it to other systems as desired. At the moment we don't have it open for creating additional event types. We're intentionally trying to keep the vocabulary small and succinct. There are so many things left that we can do with this data in Apache NiFi and beyond to take full advantage of what it offers for the flow manager, the systems architect, the security professional, etc.. There is also some great inter and intra systems timing data that can be gleaned from this. Systems like to brag about how fast they are....provenance is the truth teller. Hope that helps a bit. Joe

JoeWitt · ‎01-11-2017

I think Matt provided a nice and comprehensive response. I'll add that while we do offer a lot of flexibility and fine grained control for handling any case (whether for you that is a failure, success, connection issue, etc..). But, we can do better. One of the plans we've discussed is to provide for reference-able process groups. This would allow you to effectively call a portion of the flow like a function with a simple input/function/output model. You can read more about this here https://cwiki.apache.org/confluence/display/NIFI/Reference-able+Process+Groups. We also have data provenance in which we can capture details of why we routed any given flowfile to any given relationship. This is not used as often as it could be. Further, we need to surface this information for use within the flow so error handling steps could capture things like 'last processor' and 'last transfer description' or something. In short, there are exciting things on the horizon to make all sorts of flow management cases easier and more intuitive. The above items I mention will be an important part of that. thanks

JoeWitt · ‎11-23-2016

Hello @mayki wogno It is certainly possible to use site-to-site (s2s) to send data to and from the same cluster of nodes and is done commonly as a way to rebalance data across a cluster at key user chosen points. As to your second question regarding why it works the way it does for RPG placement and port placement here are the scenarios. 1) You want to push data to another system using s2s For this you can place an RPG anywhere you like in the flow and direct your data to it on a specific s2s port. 2) You want to pull data from another system using s2s For this you can place an RPG anywhere you like in the flow and source data from on it on a specific s2s port. 3) You want to allow another system to push to yours using s2s For this you can have a remote input port exposed at the root level of the flow. Other systems can then push to it as described in #1. 4) You want to allow another system to pull from yours using s2s For this you can have a remote output port exposed at the root level of the flow. Other systems can then pull from it as described in #2 above. When thinking about scenarios 3 and 4 here the idea is that your system is acting as a broker of data and it is the external systems in control of when they give it to you and take it from you. Your system is simply providing the well published/documented/control points for what those ports are for. We want to make sure this is very explicit and clear and so we require them to be at the root group level. You can then direct any data received to specific internal groups as you need or source from internal groups as you need to expose it for pulling. If we were instead to allow these to live at any point while it would work what we've found is that it makes the flows harder to maintain and people end up furthering the approach of each flow being a discrete one-off/stovepipe type configuration which is just generally not reflective of what really ends up happening with flows (rarely is it from one place to one place - it is often a graph of inter system exchange). Anyway, hopefully that helps give context for why it works the way it does.

JoeWitt · ‎11-22-2016

Ok, cool. Looks like you have a pretty small heap size so if the thing you do right after grabbing that big object is splitting it make sure you do a two-phase split. The content itself should never be held in memory in full but even the pointers/metadata about the existance of the flow files can add up. Let's say for instance you get the file then split text on line boundaries. Do SplitText with say (1000 lines per split) then another SplitText to get down to single lines. This way we never dump references to 1000000 flow files at once. In the approach I'm mentioning it can handle extremely large inputs because it is never having too much undo bookkeeping. We also intend to make that go away so users don't even have to consider that either. On your flow the rate you mention is about 20MB/s copy rate which sounds relatively low. That might be worth looking into as well but in any case your point about wanting to be able to observe in-flight behaviors is certainly a compelling user experience idea.

JoeWitt · ‎11-22-2016

This is a pretty common pattern and something we should add more out of the box support for. This is a lot like how the GeoEnrich processor works. There is some reference dataset and some incoming data which needs to be enriched/altered based on pulling keys from the data and looking up their values in the dataset. I'd recommend you provide a ControllerService that loads/monitors changes to your reference dataset that offers a 'get(key)' type lookup where the value returned is the value you want to place into your data. Then provide a custom processor that uses that controller service against your data. You could do both of these using the scripting support in Groovy, for example. You could also just do this in a single processor as well. Hope that helps

JoeWitt · ‎11-22-2016

Something else worth mentioning that would be good to get your thoughts on @J.Thomas King is the idea of not actually copying in externally referenceable data as a configurable thing. By that we'd simply create a pointer/reference to the original input data wherever it lives (file, http/url, etc..). Then whenever we actually operate on it in the flow we'd access it in its original form. This avoids needless copies tasks and could result in tremendous throughput benefits. The downside being of course that we cannot manage or guarantee the lifecycle of that data but for certain cases this could be fine anyway. Would such a feature be helpful for your case?

JoeWitt · ‎11-22-2016

At this time we don't show progress of in-flight sessions via that mechanism other than the indicator of the number of active threads. That said, it is definitely a good idea just not something we've done anything with to date.

JoeWitt · ‎10-18-2016

@Riccardo Iacomini great post here and I do think it will be quite helpful to others. As a result of this thread there are a couple really important and helpful JIRAs being worked on as well. SplitText performance will be substantially improved and MergeContent will as well. But, the design you've come to will still be the highest sustained performing approach. As you noted FlowFile attributes need to be used wisely. They're a powerful tool but they should be generally focused on things like tagging/routing rather than as a sort of in memory holder between deserialization and serialization. Anyway great post and follow-through to help others!

Online	Offline
Last Visited	‎11-07-2019 09:38 AM

Member Since	‎07-30-2019 09:22 AM
Last Visited	‎11-07-2019 09:38 AM
Posts	105
Kudos received	129

Cloudera Community

Re: ListenTCP or ListenSyslog

Re: Is there Kubernetes config available for Nifi ...

Re: Nifi ConsumeKafka_0_10 error

Re: Best practice for updating Nifi's 'Hadoop Conf...

Re: NIFI HiveStreaming error

Re: NiFi PutSQL exception.

Re: How can I keep track of keep-alive response on...

Re: Apache Nifi Data Provenance: Limitations and R...

Re: NiFi best practices for error handling

Re: NIFI : RPG site-to-site

Re: Nifi 1.0.0, no in progress indication when rea...

Re: NiFi Code to Description mapping

Re: Nifi 1.0.0, no in progress indication when rea...

Re: Nifi 1.0.0, no in progress indication when rea...

Re: NiFi: unable to improve performances