Cloudera Community

Support Questions

Find answers, ask questions, and share your expertise

Advanced Search

Solved

Rising Star

Hi all,

I'll trying to make a dataflow as explain in this tuto : site-to-site

it explains about ListHDFS and FetchHDFS

It is possible to use RPG inside the local cluster (I have one cluster with 3 nodes) ?

Why I can not use RPG and input port in side the specific process group but I must to create dataflow in the top level.

3,706 Views

1 ACCEPTED SOLUTION

Guru

Hello @mayki wogno

It is certainly possible to use site-to-site (s2s) to send data to and from the same cluster of nodes and is done commonly as a way to rebalance data across a cluster at key user chosen points.

As to your second question regarding why it works the way it does for RPG placement and port placement here are the scenarios.

1) You want to push data to another system using s2s

For this you can place an RPG anywhere you like in the flow and direct your data to it on a specific s2s port.

2) You want to pull data from another system using s2s

For this you can place an RPG anywhere you like in the flow and source data from on it on a specific s2s port.

3) You want to allow another system to push to yours using s2s

For this you can have a remote input port exposed at the root level of the flow. Other systems can then push to it as described in #1.

4) You want to allow another system to pull from yours using s2s

For this you can have a remote output port exposed at the root level of the flow. Other systems can then pull from it as described in #2 above.

When thinking about scenarios 3 and 4 here the idea is that your system is acting as a broker of data and it is the external systems in control of when they give it to you and take it from you. Your system is simply providing the well published/documented/control points for what those ports are for. We want to make sure this is very explicit and clear and so we require them to be at the root group level. You can then direct any data received to specific internal groups as you need or source from internal groups as you need to expose it for pulling.

If we were instead to allow these to live at any point while it would work what we've found is that it makes the flows harder to maintain and people end up furthering the approach of each flow being a discrete one-off/stovepipe type configuration which is just generally not reflective of what really ends up happening with flows (rarely is it from one place to one place - it is often a graph of inter system exchange).

Anyway, hopefully that helps give context for why it works the way it does.

View solution in original post

3,330 Views

8 REPLIES 8

Guru

Hello @mayki wogno

It is certainly possible to use site-to-site (s2s) to send data to and from the same cluster of nodes and is done commonly as a way to rebalance data across a cluster at key user chosen points.

As to your second question regarding why it works the way it does for RPG placement and port placement here are the scenarios.

1) You want to push data to another system using s2s

For this you can place an RPG anywhere you like in the flow and direct your data to it on a specific s2s port.

2) You want to pull data from another system using s2s

For this you can place an RPG anywhere you like in the flow and source data from on it on a specific s2s port.

3) You want to allow another system to push to yours using s2s

For this you can have a remote input port exposed at the root level of the flow. Other systems can then push to it as described in #1.

4) You want to allow another system to pull from yours using s2s

For this you can have a remote output port exposed at the root level of the flow. Other systems can then pull from it as described in #2 above.

When thinking about scenarios 3 and 4 here the idea is that your system is acting as a broker of data and it is the external systems in control of when they give it to you and take it from you. Your system is simply providing the well published/documented/control points for what those ports are for. We want to make sure this is very explicit and clear and so we require them to be at the root group level. You can then direct any data received to specific internal groups as you need or source from internal groups as you need to expose it for pulling.

If we were instead to allow these to live at any point while it would work what we've found is that it makes the flows harder to maintain and people end up furthering the approach of each flow being a discrete one-off/stovepipe type configuration which is just generally not reflective of what really ends up happening with flows (rarely is it from one place to one place - it is often a graph of inter system exchange).

Anyway, hopefully that helps give context for why it works the way it does.

3,331 Views

Rising Star

Thanks it is clear.

Another question about URL of RPG, it is possible to use a VIP as the url ?

Because, url is a point of failure if the nifi node crash.

3,330 Views

Guru

Hello @mayki wogno

Yes, a VIP should work as an URL of RPG. A load balancer such as HAProxy can also be used in front of a NiFi cluster and use its host:port as RPG URL.

After RPG got cluster information from the such URL (propagating the request to one of NiFi node), it will access each node with its IP or hostname directly to transfer data.

Also, there's an ongoing effort to allow multiple URLs as RPG URL to avoid making it a SPOF:

https://issues.apache.org/jira/browse/NIFI-3026

3,330 Views

Rising Star

@kkawamura thanks

3,330 Views

Rising Star

I got a dataflow with listHDFS and RPG but i don't know explain this error :

3,330 Views

Guru

The second error message on bulletin indicating that any Hadoop configuration file could not be found.

Please check file path configured at 'Hadoop Configuration Resources' of ListHDFS. It should point at core-site.xml and hdfs-site.xml, and should be accessible by the user running NiFi.

3,330 Views

Rising Star

@kkawamura : the second error is very clear. I'll talked about first message error.

3,330 Views

Guru

Hi @mayki wogno

The first error message was also written by the same error with the second error message. The processor reported the error twice, because it logged an error message when the ListHDFS processor caught the exception, then re-throw it, and NiFi framework caught the exception and logged another error message.

When NiFi framework catches an exception thrown by a processor, it yields the processor for the amount of time specified by 'Yield Duration'.

Once the processor successfully accesses core-site.xml and hdfs-site.xml, both error messages will be cleared.

3,330 Views

Announcements

What's New @ Cloudera

[RELEASED] Cloudera Streaming Analytics - Kubernetes Operato...

What's New @ Cloudera

[RELEASED] Cloudera Streams Messaging - Kubernetes Operator ...

Community Announcements

February 2025 Community Highlights

What's New @ Cloudera

3 Benefits of External IDE Connectivity, Now Available in Cl...

What's New @ Cloudera

Performance comparison of Spark3 on YARN with S3 Standard VS...