Member since
07-30-2019
105
Posts
129
Kudos Received
43
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
762 | 02-27-2018 01:55 PM | |
1238 | 02-27-2018 05:01 AM | |
3093 | 02-27-2018 04:43 AM | |
664 | 02-27-2018 04:18 AM | |
1903 | 02-27-2018 03:52 AM |
01-11-2017
02:47 AM
2 Kudos
I think Matt provided a nice and comprehensive response. I'll add that while we do offer a lot of flexibility and fine grained control for handling any case (whether for you that is a failure, success, connection issue, etc..). But, we can do better. One of the plans we've discussed is to provide for reference-able process groups. This would allow you to effectively call a portion of the flow like a function with a simple input/function/output model. You can read more about this here https://cwiki.apache.org/confluence/display/NIFI/Reference-able+Process+Groups. We also have data provenance in which we can capture details of why we routed any given flowfile to any given relationship. This is not used as often as it could be. Further, we need to surface this information for use within the flow so error handling steps could capture things like 'last processor' and 'last transfer description' or something. In short, there are exciting things on the horizon to make all sorts of flow management cases easier and more intuitive. The above items I mention will be an important part of that. thanks
... View more
11-23-2016
01:39 PM
1 Kudo
Hello @mayki wogno It is certainly possible to use site-to-site (s2s) to send data to and from the same cluster of nodes and is done commonly as a way to rebalance data across a cluster at key user chosen points. As to your second question regarding why it works the way it does for RPG placement and port placement here are the scenarios. 1) You want to push data to another system using s2s For this you can place an RPG anywhere you like in the flow and direct your data to it on a specific s2s port. 2) You want to pull data from another system using s2s For this you can place an RPG anywhere you like in the flow and source data from on it on a specific s2s port. 3) You want to allow another system to push to yours using s2s For this you can have a remote input port exposed at the root level of the flow. Other systems can then push to it as described in #1. 4) You want to allow another system to pull from yours using s2s For this you can have a remote output port exposed at the root level of the flow. Other systems can then pull from it as described in #2 above. When thinking about scenarios 3 and 4 here the idea is that your system is acting as a broker of data and it is the external systems in control of when they give it to you and take it from you. Your system is simply providing the well published/documented/control points for what those ports are for. We want to make sure this is very explicit and clear and so we require them to be at the root group level. You can then direct any data received to specific internal groups as you need or source from internal groups as you need to expose it for pulling. If we were instead to allow these to live at any point while it would work what we've found is that it makes the flows harder to maintain and people end up furthering the approach of each flow being a discrete one-off/stovepipe type configuration which is just generally not reflective of what really ends up happening with flows (rarely is it from one place to one place - it is often a graph of inter system exchange). Anyway, hopefully that helps give context for why it works the way it does.
... View more
11-22-2016
05:44 PM
Ok, cool. Looks like you have a pretty small heap size so if the thing you do right after grabbing that big object is splitting it make sure you do a two-phase split. The content itself should never be held in memory in full but even the pointers/metadata about the existance of the flow files can add up. Let's say for instance you get the file then split text on line boundaries. Do SplitText with say (1000 lines per split) then another SplitText to get down to single lines. This way we never dump references to 1000000 flow files at once. In the approach I'm mentioning it can handle extremely large inputs because it is never having too much undo bookkeeping. We also intend to make that go away so users don't even have to consider that either. On your flow the rate you mention is about 20MB/s copy rate which sounds relatively low. That might be worth looking into as well but in any case your point about wanting to be able to observe in-flight behaviors is certainly a compelling user experience idea.
... View more
11-22-2016
05:13 PM
1 Kudo
This is a pretty common pattern and something we should add more out of the box support for. This is a lot like how the GeoEnrich processor works. There is some reference dataset and some incoming data which needs to be enriched/altered based on pulling keys from the data and looking up their values in the dataset. I'd recommend you provide a ControllerService that loads/monitors changes to your reference dataset that offers a 'get(key)' type lookup where the value returned is the value you want to place into your data. Then provide a custom processor that uses that controller service against your data. You could do both of these using the scripting support in Groovy, for example. You could also just do this in a single processor as well. Hope that helps
... View more
11-22-2016
05:08 PM
Something else worth mentioning that would be good to get your thoughts on @J.Thomas King is the idea of not actually copying in externally referenceable data as a configurable thing. By that we'd simply create a pointer/reference to the original input data wherever it lives (file, http/url, etc..). Then whenever we actually operate on it in the flow we'd access it in its original form. This avoids needless copies tasks and could result in tremendous throughput benefits. The downside being of course that we cannot manage or guarantee the lifecycle of that data but for certain cases this could be fine anyway. Would such a feature be helpful for your case?
... View more
11-22-2016
04:54 PM
At this time we don't show progress of in-flight sessions via that mechanism other than the indicator of the number of active threads. That said, it is definitely a good idea just not something we've done anything with to date.
... View more
10-18-2016
11:21 AM
1 Kudo
@Riccardo Iacomini great post here and I do think it will be quite helpful to others. As a result of this thread there are a couple really important and helpful JIRAs being worked on as well. SplitText performance will be substantially improved and MergeContent will as well. But, the design you've come to will still be the highest sustained performing approach. As you noted FlowFile attributes need to be used wisely. They're a powerful tool but they should be generally focused on things like tagging/routing rather than as a sort of in memory holder between deserialization and serialization. Anyway great post and follow-through to help others!
... View more
10-11-2016
01:32 PM
2 Kudos
@Riccardo Iacomini Looks like you're doing some really good stuff to think through this. Some things I would add is that it can often be quite ok to generate a lot of objects. A common source of GC pressure is the length of object retention and how many/how large retained objects are. For short lived objects that are created then eligible for cleaning I've generally found that causes little challenge for collection. The logic is a bit different with G1 but in any event I think we can go back to basics a bit here before spending time tuning the GC. I recommend running your flow with as small a heap as possible. Consider a 512 MB or 1GB heap for instance. Run with a single thread in the flow controller or very few. Run every processor with a single thread. Then let your flow run at full rate. Measure the latencies. Profile the code. You will find some very interesting and useful things this way. If you're designing a flow for maximum performance (lowest latency and highest throughput with minimal CPU utilization) then you really want to think about the design of the flow. I do not recommend using flow file attributes as a go between mechanism to deserialize and serialize content from one form to another. Take the original format and convert it to the desired output in another processor. If you need to make routing and filtering decisions then either do that on the raw format or the converted format. Which is the best choice depends on your case. Extract things to attributes so that you can reuse existing processes is attractive of course but if you primary aim above all else is raw speed then you want to design for that before other tradeoffs like reusability.
... View more
09-29-2016
03:26 PM
1 Kudo
So without doing anything fancy configuration wise and having a very basic template like pure-split-merge.xml (it assumes compressed input) i get around 13,333 events/sec on a very stable basis. The disk is moving but fine. CPU is pretty busy but fine. GC is busy but fine and no full GCs. So, at this point it looks like there are some opportunities to improve how we schedule processors to be both more aggressive and less noisy (when there is no work to do). So a few of us are looking into that. This goes to your question of wasting speed. We see some cases where our scheduler itself could be wasting speed opportunities. Now, in the mean time a definitely fast option is to avoid the need to split data in the first place. Simply have the processors which were extracting attributes and then later altering content be composed together and operate on the dataset events. That is less elegant and reusable admittedly so I'm not proposing that is the end solution just stating that this approach works well. Anyway, we'll keep this conversation going. This is a perfect case to evaluate performance as it exercises a few important elements and can be a common pattern. More to follow as we learn more. This will almost certainly end up in a blog/article 🙂
... View more
09-28-2016
03:15 PM
1 Kudo
No problem at all on time. Happy to help. This should be faster so let's figure out what is happening. Appreciate the details you're providing. I've recreated a very similar flow. I am seeing basically 10,000-20,000 events per second (depending on tuning). In a very basic, default everything, single threaded flow i am getting an end-to-end 20,000 events/sec equating to about 20MB/sec. This is on my macbook. The amount of disk usage happening to make this happen given all the processors I have in the flow equates to about 60MB/s read with 50MB/s write. That is all steady state and healthy. But it does seem like it should be faster. Disk isn't tapped out nor is CPU and GC looks great. So, adding threads...performance actually seemed to drop a bit in this case and when I pushed it with a variety of scenarios it then did show these OOME. So, will be looking into this more. I've still got a 512MB heap so first I'll bump that a bit which is reasonable given what I'm trying to do now. Regarding your CopyProcessor keep in mind the UpdateAttribute processor does what you describe already and supports batching nicely. Regarding the logic of when to combine them into one or not yeah I totally agree with your thinking. Just wanted to put that out there for consideration. If you've already thought through that then I'm all with you. Will provide more thoughts as I/we get a better handle on the bottlenecks and options to move the needle.
... View more
09-28-2016
01:13 PM
2 Kudos
Cool. Thanks for all the details. Yes let's definitely avoid going the cluster route right now. Once we see reasonable node performance then we can deal with scale out. Some quick observations Your garbage collection results look great. The custom procs are indeed rather slow. Though I'm not overly impressed with the numbers I see on the standard procs either. The second split text took 18 seconds to split 300MB worth of lines. Not ideal. You definitely should take advantage of NiFi's automatic session batching capability. Check out the SupportsBatching annotation and you can find several examples of its use. By having a processor support that and in the UI having 'run duration' higher than 0 NiFI can automatically combine several commits into one and this can yield far higher throughput at the expense of latency (on the order of milliseconds) Questions: What is the underlying storage device that NiFi is using? Type of disk (local disk, HDD or SDD). Type of partitioning (are all the repos on a single disk). Have you considered restructuring the composition of those custom processors? Could/should a couple reasonably be combined into a single step? ProcessNullFields performance appears really poor. Have you done any testing/evaluation to see what that processor is spending the bulk of its time on? Attaching a debugger/profiler at runtime could be really enlightening. CopyProcessor also appears heavy in terms of time. What does that one do? I'll setup a vanilla flow that brings in a file like you mention, splits it, merges it, all on a basic laptop setup and let you know the results I see.
... View more
09-27-2016
03:05 PM
1 Kudo
Could you please list the processors you have in the flow? The processors Matt notes can use a decent chunk of memory but it is not really based on original size of the input entry. It is more about the metadata for the individual flowfiles themselves. So a large input file does not necessarily mean a large heap usage. The metadata for the flowfiles is in memory but typically a very small amount of content is ever in memory. Some processors though do use a lot of memory for one reason or another. We should probably put warnings about them in their docs and on the UI. Let's look through the list and identify candidates.
... View more
09-27-2016
03:01 PM
5 Kudos
Hello, how many lines/rows are in each incoming CSV file? A common pattern here is to do two-phase splits where the first phase splits into say 5000 line bundles and the second phase splits into single lines. Then using back pressure you can avoid ever creating too many flowfiles at once that aren't being operated on and simply causing excessive GC pressure. On a rather modest system you should very easily see performance (in conservative terms) of 30+ MB/s for 10s or 100s of thousands of rows per second (as described in this flow). The bottleneck points in the flow should be fairly easily spotted through the UI. Could you attach a screenshot of the running flow? One thing that is not easily spotted is when it is GC pressure causing the issue. If you click on the summary page on go to 'system diagnostics' you can view garbage collection info. What does it show?
... View more
09-08-2016
03:59 AM
Yep what you describe with UpdateAttribute/MergeContent sounds perfectly fine. What you'll want there precisely will depend on how many relationships you have out of RouteText. As for concurrent tasks I'd say it would be 1 for GetFile 1 for SplitFile 2...4 or 5 or so on RouteText. No need to go too high generally. 1 for MergeContent 1 to 2 for PutHDFS You don't have to stress too much on those numbers out of the gate. You can run it with minimal threads first, find any bottlenecks and increase if necessary.
... View more
09-07-2016
02:08 PM
1 Kudo
Thea If you look in the nifi-assembly/target folder what do you see and how large are the files? It really just looks like an incomplete build at this point. Consider grabbing a convenience binary and using that so you can rule out local build issues. Thanks
... View more
09-07-2016
03:38 AM
5 Kudos
Hello, In Apache NiFi 1.0 there are no longer UI controls or API calls exposed to change whether or not a node is the primary or cluster coordinator because they are now automatically elected and maintained by the zero master clustering model backed by Zookeeper. It is the case that any node at any time should be capable to take on those designations. This ensures that we don't need any special nodes and that we're able to always have these valuable roles active from an HA perspective - which was not the case previously. Thanks
Joe
... View more
09-06-2016
03:49 PM
1 Kudo
I'm not sure I understand the versus nature as posed here. MirrorMaker can be used to replicate data from one Kafka broker to another. The NiFi site-to-site protocol can be used to replicate data from one NiFi cluster to another. They both support the appropriate security mechanisms. NiFi offers the fine grained provenance/lineage but arguably Kafka's log replication/offset mechanism is sufficient for the case of replication. As for tuning again both offer strong tuning/throughput mechanisms. I'd recommend using the facilities of each.
... View more
09-06-2016
02:10 PM
1 Kudo
"A single concurrent task can work on a single file." That is worth clarifying. It is actually that a single concurrent task can work on a single process session. When RouteText creates a process session it pulls in a single flow file. Other processors can pull in many more. Just depends on the use case and design but fundamentally a single concurrent task can work on far more than a single file. For "this" use case and "this" processor the recommendation is to spit the input up so that parallelism can be taken advantage of.
... View more
09-06-2016
01:45 PM
3 Kudos
Hello Is a perfectly fine use case but I'd recommend breaking the input data up a bit so you can take advantage of the parallelism. So given you have a 50M line input I'd recommend running that first through SplitText to break that into files with say 10,000 lines. That would yield about 5,000 splits each with around 10,000 lines. Then feed that into the RouteText processor. This way it can be operated in a far better divide and conquer manner. You should see rates pretty close to the ideal rate of your underlying storage system. In very conservative terms assume that is about 50MB/s so it should take about 5 minutes at most (and that can certainly be improved). Thanks Joe
... View more
09-02-2016
10:54 AM
4 Kudos
If the processor in question believes there is something about a given flowfile that is temporary and may resolve itself it will mark the flowfile at penalized. When it routes that penalized flowfile to some outgoing connection those penalized flowfiles will not be accessible to the processor that might consume it until the penalty period expires. You can read a bit more about that here https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#settings-tab A good example where this is useful is to consider delivery of flow files to some remote system using PutSFTP. It is common to route 'failures' of PutSFTP back to itself so it will keep trying. But, sometimes there can be conflicts like filenames on the remote server that match so you want to wait until they clear out and try again. In this case penalization lets us operate on other data while we put the problematic flowfiles off to the side. It's all just part of helping ensure the most productive action possible can happen and we're not just sitting there pounding the remote system with the same flow file over and over.
... View more
09-02-2016
08:00 AM
When using a secured instance of NiFi the user either logs in with username and password or they are identified using their certificate. The user first attempts to access NiFi at which point an account is automatically created without any permissions. Then an administrator can grant permissions and you'll see them on that page you're showing above.
... View more
09-02-2016
01:52 AM
2 Kudos
Hello Obaid, You can go to help in NiFi and bring up the docs. Scroll down on the left hand pane to the section titled 'developer' and select REST API. The docs can also be found here https://nifi.apache.org/docs.html From there you can select 'provenance' to get a detailed breakdown of the requests and information necessary for the requests. A really good thing to do is use Chrome's developer tools and use the NiFi UI to create requests. Then you can see precisely what is being done by NiFi's client and you can then emulate the same thing programatically. Thanks Joe
... View more
08-02-2016
03:28 PM
4 Kudos
When entering text values into NiFi you should be able to hit "Shift-Enter" and it will give you true new lines. See attached screenshot
... View more
07-28-2016
04:33 AM
3 Kudos
Regarding the first question about wanting to distribute data from a given node to another node... Site-to-Site is meant for sending data from one cluster to another cluster on explicit ports (named input/entry points) to another cluster. It then takes care of load-balancing and fail-over. At present, site-to-site does not support sending data to a limited subset of nodes based on some defined criteria (partitioning). Though this is an interesting idea and something that has been talked about. However, as you've describe your case thus far you might find that simply using PostHttp (on the sending node(s)) and ListenHttp on the listening node(s) is sufficient. With PostHTTP you get to address a specific recipient and therefore will know that only that node is getting the data of interest. You could then route other data that can be more generally spread throughout the cluster to use site-to-site.
... View more
07-23-2016
01:58 AM
2 Kudos
You could use existing processors such as ExtractText for some types of emails to extract attributes which you can then use for routing. Or you could use the scripting processors and write your own code to extract features of the emails as attributes then use RouteOnAttribute. In the NiFi community there was recently work merged https://issues.apache.org/jira/browse/NIFI-1899 which looks like it will help a lot. For now, probably the best approach is to use ExecuteScript or InvokeScript to put together a quick e-mail parsing processor. Thanks
... View more
07-20-2016
03:12 PM
1 Kudo
It is not available yet but the team is working hard on it and we hope to have it officially supported very soon!
... View more
07-18-2016
01:53 AM
2 Kudos
@P C I share your view that there are a number of scenarios for which a JVM based dataflow management tool would be unfit or suboptimal. Recognizing that and a number of other unique challenges that exist in the edge collection space, the Hortonworks DataFlow team is working as part of the Apache MiNiFi community that Matt just mentioned. MiNiFi is a subproject of Apache NiFi and is designed to work seamlessly with NiFi. To your specific question asking if Hortonworks is developing any support for QNX I can state that we are supporting a range of IoT and 'metal that moves' cases as mentioned in that article. A recent public example of our efforts in this area can be found in this article https://hortonworks.com/blog/qualcomm-hortonworks-showcase-connected-car-platform-tu-automotive-detroit/ Thanks
... View more
07-16-2016
02:27 PM
4 Kudos
GetFile when told to keep source files where it finds them will capture them even if it doesn't have write permissions to the directory they are contained in. However, when told to remove source files once pulled it requires write permissions to the directory it is pulling from and when listing it will skip those which it doesn't have permissions for. Given that we know there are files there and it isn't pulling them in this case and specifically yielding, which only happens when the listing attempt provides no valid results, then I strongly believe the parent directory permissions are not sufficient. Please verify.
... View more
07-16-2016
02:11 PM
Not sure just yet. Will take a look. The only time GetFile would yield, as is the case in the log output you show for keepFile=true, is when it finds nothing in the listing.
... View more
07-16-2016
02:01 PM
2 Kudos
I don't have any real numbers to share though I found this article on the topic interesting https://www.maxcdn.com/blog/ssl-performance-myth/ and found that it aligns to what I've observed as well. There are relatively few cases where unsecured site-to-site is appropriate in comparison to secured site-to-site.
... View more