Support Questions

pushkar1593 · ‎04-04-2017

Hi,

I wanted to understand the site-to-site architecture of NiFi, mainly the differences between using HTTP and RAW protocols, the architecture of the site-to-site client and the input/output ports and how the NiFi REST API comes into picture in site-to-site.

Could someone please point me to architecture documents or any other media that could clarify these things for me? If not, I will go through the code to understand these but I would really appreciate some pointers in that as well.

Thanks a lot.

MattWho · ‎04-04-2017

@Pushkara Ravindra

The intent of the Site-To-Site (S2S) protocol is to allow the exchange of NiFi FlowFiles between NiFi instances. A NiFi FlowFile consists of two parts:

1. FlowFile content <-- Original content in whatever format (NiFi is data agnostic and has no data format dependency)

2. FlowFile Attributes <-- Collection of key/value pairs (some are NiFi assigned by default while others are add via processors)

Sending FlowFiles between NiFi instances allows the originating NiFi to share the attributes it knows about a FlowFiles content with the target NiFi instance. The FlowFile Attributes are loaded in to the FlowFile repo of the target NiFi automatically.

In addition to the above, the S2S allows for the automatically smart load-balancing of FlowFiles to a target NiFi cluster. S2S allows for the auto-scaling up or down of the target Nifi cluster without the client needing to change anything.

How it all works:

The source/client NiFi instance/cluster will add a Remote Process Group (RPG) to their canvas and configure it to point at the URL of any target/destination NiFi instance or cluster node. The communication at this point is over HTTP protocol. Once a connection is established the destination NiFi sends S2S details back to the source NIFi (Includes URLs of nodes if destination is cluster and the current load of each node.) The RPG will continuously update this information and store a local copy of this information in the event it cannot get an update at any time.

Input and output ports are used to send or receive FlowFiles from the parent process group of where they were added. So when input or output ports are added to the root canvas level of any dataflow they become "remote" input and output ports capable of sending or receiving data from another NiFi.

Whether you set the S2S protocol to HTTP or RAW the above is true. What is different is what happens next (Actual FlowFile transfer).

When using the RAW format (Socket based transfer), the "nifi.remote.input.host" and "nifi.remote.input.socket.port" configured values from each of the target NiFi instances are used by the NiFi client as the destination for sending FlowFiles.

When using the HTTP format, the "nifi.remote.input.host" and the "nifi.web.http.port" or "nifi.web.https.port" configured values from each of the target NiFi instances are used by the NiFi client as the destination for sending FlowFiles.

Advantage of RAW format is that their is a dedicated port for all S2S transfers, so under high load it affect on the NiFi HTTP interface is minimal.

Advantage of HTTP, you do not need to open an additional S2S port since teh same HTTP/HTTPS port is used to transfer FlowFile.

Thanks,

Matt

View solution in original post

MattWho · ‎04-04-2017

@Pushkara Ravindra

The intent of the Site-To-Site (S2S) protocol is to allow the exchange of NiFi FlowFiles between NiFi instances. A NiFi FlowFile consists of two parts:

1. FlowFile content <-- Original content in whatever format (NiFi is data agnostic and has no data format dependency)

2. FlowFile Attributes <-- Collection of key/value pairs (some are NiFi assigned by default while others are add via processors)

Sending FlowFiles between NiFi instances allows the originating NiFi to share the attributes it knows about a FlowFiles content with the target NiFi instance. The FlowFile Attributes are loaded in to the FlowFile repo of the target NiFi automatically.

In addition to the above, the S2S allows for the automatically smart load-balancing of FlowFiles to a target NiFi cluster. S2S allows for the auto-scaling up or down of the target Nifi cluster without the client needing to change anything.

How it all works:

The source/client NiFi instance/cluster will add a Remote Process Group (RPG) to their canvas and configure it to point at the URL of any target/destination NiFi instance or cluster node. The communication at this point is over HTTP protocol. Once a connection is established the destination NiFi sends S2S details back to the source NIFi (Includes URLs of nodes if destination is cluster and the current load of each node.) The RPG will continuously update this information and store a local copy of this information in the event it cannot get an update at any time.

Input and output ports are used to send or receive FlowFiles from the parent process group of where they were added. So when input or output ports are added to the root canvas level of any dataflow they become "remote" input and output ports capable of sending or receiving data from another NiFi.

Whether you set the S2S protocol to HTTP or RAW the above is true. What is different is what happens next (Actual FlowFile transfer).

When using the RAW format (Socket based transfer), the "nifi.remote.input.host" and "nifi.remote.input.socket.port" configured values from each of the target NiFi instances are used by the NiFi client as the destination for sending FlowFiles.

When using the HTTP format, the "nifi.remote.input.host" and the "nifi.web.http.port" or "nifi.web.https.port" configured values from each of the target NiFi instances are used by the NiFi client as the destination for sending FlowFiles.

Advantage of RAW format is that their is a dedicated port for all S2S transfers, so under high load it affect on the NiFi HTTP interface is minimal.

Advantage of HTTP, you do not need to open an additional S2S port since teh same HTTP/HTTPS port is used to transfer FlowFile.

Thanks,

Matt

Cloudera Community

Support Questions

Dive Deep into NiFi's Site-to-site architecture

Deep dive into YARN Log Aggregation / Deep dive in...

Cloudera Flow Management Operator - A technical de...

Hive LLAP deep dive

NIFI Site to Site connection between Clusters

HiveServer2 configurations deep dive

Site-To-Site communication between secured (HTTPS)...

Provenance Site to Site Reporting - via Apache NiF...

How to achieve better load-balancing using NiFi's ...

Apache Metron TP1 Deep Dive

NiFi Site-to-Site Direct Streaming to Storm