Created 11-23-2016 01:20 PM
Hi all,
I'll trying to make a dataflow as explain in this tuto : site-to-site
it explains about ListHDFS and FetchHDFS
It is possible to use RPG inside the local cluster (I have one cluster with 3 nodes) ?
Why I can not use RPG and input port in side the specific process group but I must to create dataflow in the top level.
Created 11-23-2016 01:39 PM
Hello @mayki wogno
It is certainly possible to use site-to-site (s2s) to send data to and from the same cluster of nodes and is done commonly as a way to rebalance data across a cluster at key user chosen points.
As to your second question regarding why it works the way it does for RPG placement and port placement here are the scenarios.
1) You want to push data to another system using s2s
For this you can place an RPG anywhere you like in the flow and direct your data to it on a specific s2s port.
2) You want to pull data from another system using s2s
For this you can place an RPG anywhere you like in the flow and source data from on it on a specific s2s port.
3) You want to allow another system to push to yours using s2s
For this you can have a remote input port exposed at the root level of the flow. Other systems can then push to it as described in #1.
4) You want to allow another system to pull from yours using s2s
For this you can have a remote output port exposed at the root level of the flow. Other systems can then pull from it as described in #2 above.
When thinking about scenarios 3 and 4 here the idea is that your system is acting as a broker of data and it is the external systems in control of when they give it to you and take it from you. Your system is simply providing the well published/documented/control points for what those ports are for. We want to make sure this is very explicit and clear and so we require them to be at the root group level. You can then direct any data received to specific internal groups as you need or source from internal groups as you need to expose it for pulling.
If we were instead to allow these to live at any point while it would work what we've found is that it makes the flows harder to maintain and people end up furthering the approach of each flow being a discrete one-off/stovepipe type configuration which is just generally not reflective of what really ends up happening with flows (rarely is it from one place to one place - it is often a graph of inter system exchange).
Anyway, hopefully that helps give context for why it works the way it does.
Created 11-23-2016 01:39 PM
Hello @mayki wogno
It is certainly possible to use site-to-site (s2s) to send data to and from the same cluster of nodes and is done commonly as a way to rebalance data across a cluster at key user chosen points.
As to your second question regarding why it works the way it does for RPG placement and port placement here are the scenarios.
1) You want to push data to another system using s2s
For this you can place an RPG anywhere you like in the flow and direct your data to it on a specific s2s port.
2) You want to pull data from another system using s2s
For this you can place an RPG anywhere you like in the flow and source data from on it on a specific s2s port.
3) You want to allow another system to push to yours using s2s
For this you can have a remote input port exposed at the root level of the flow. Other systems can then push to it as described in #1.
4) You want to allow another system to pull from yours using s2s
For this you can have a remote output port exposed at the root level of the flow. Other systems can then pull from it as described in #2 above.
When thinking about scenarios 3 and 4 here the idea is that your system is acting as a broker of data and it is the external systems in control of when they give it to you and take it from you. Your system is simply providing the well published/documented/control points for what those ports are for. We want to make sure this is very explicit and clear and so we require them to be at the root group level. You can then direct any data received to specific internal groups as you need or source from internal groups as you need to expose it for pulling.
If we were instead to allow these to live at any point while it would work what we've found is that it makes the flows harder to maintain and people end up furthering the approach of each flow being a discrete one-off/stovepipe type configuration which is just generally not reflective of what really ends up happening with flows (rarely is it from one place to one place - it is often a graph of inter system exchange).
Anyway, hopefully that helps give context for why it works the way it does.
Created 11-23-2016 01:52 PM
Thanks it is clear.
Another question about URL of RPG, it is possible to use a VIP as the url ?
Because, url is a point of failure if the nifi node crash.
Created 11-24-2016 12:40 AM
Hello @mayki wogno
Yes, a VIP should work as an URL of RPG. A load balancer such as HAProxy can also be used in front of a NiFi cluster and use its host:port as RPG URL.
After RPG got cluster information from the such URL (propagating the request to one of NiFi node), it will access each node with its IP or hostname directly to transfer data.
Also, there's an ongoing effort to allow multiple URLs as RPG URL to avoid making it a SPOF:
Created 11-24-2016 08:12 AM
@kkawamura thanks
Created on 11-24-2016 08:47 AM - edited 08-19-2019 03:42 AM
I got a dataflow with listHDFS and RPG but i don't know explain this error :
Created 11-24-2016 09:03 AM
The second error message on bulletin indicating that any Hadoop configuration file could not be found.
Please check file path configured at 'Hadoop Configuration Resources' of ListHDFS. It should point at core-site.xml and hdfs-site.xml, and should be accessible by the user running NiFi.
Created 11-25-2016 09:17 AM
@kkawamura : the second error is very clear. I'll talked about first message error.
Created 11-25-2016 11:57 AM
Hi @mayki wogno
The first error message was also written by the same error with the second error message. The processor reported the error twice, because it logged an error message when the ListHDFS processor caught the exception, then re-throw it, and NiFi framework caught the exception and logged another error message.
When NiFi framework catches an exception thrown by a processor, it yields the processor for the amount of time specified by 'Yield Duration'.
Once the processor successfully accesses core-site.xml and hdfs-site.xml, both error messages will be cleared.