Support Questions

nk20 · ‎08-04-2022

Hi,

I am new to nifi and stuck with two issues for which i need help:-

The task is to migrate a set of tables from source to target database. The data will have to be filtered as well based on a field before migrating. The pipeline has to migrate historical data as well as perform the incremental fetch.

1. For incremental fetch, I used the GenerateTableFetch processor and set the max-value column to a date field in the table. I also used the partitioning feature to get the data in chunks by setting the partitioning size and column. It all works well. But considering it is a stateful processor, i have received feedback to go stateless as we can loose the state in case of node crash. How can i achieve the incremental fetch and partitioning feature in a stateless manner?

2. The set of tables to be migrated have a parent child relationship. The incremental fetch is required for all the tables. However child tables don't have any such column that can be used to fetch the delta and it will have to rely on the parent table's lastUpdatedTime field. Although this also has to be done in a stateless manner, i did try using QueryDatabaseTable processor by setting a join query between child and parent in the 'Custom Query' field and also setting the parents lastUpdatedTime in the max value column. But that didn't work.

Can someone please help with how to achieve both the features in a stateless manner?

Thanks Appreciate your help.

nk20 · ‎08-05-2022

Thanks Matt for your response. We will have a clustered setup and I have
implemented exactly what you described. But there are still concerns coming
up about losing the state. One of the examples given is if the state is
stored in-memory. Again I am new to NIFI and I am not sure if such
configuration is possible. But I still want to try if a stateless
implementation is possible. Thanks again Matt for the help. Would
appreciate if I can get any further ideas from the community.

View solution in original post

MattWho · ‎08-08-2022

@nk20
I am confused by your concern about in memory state. Can you provide more detail around what you are being told or what you have read that has lead to this concern? Perhaps those concerns are about something more than component state? Perhaps I can address those specific concerns. Not all NiFi components retain state. Those that do either persist that state to disk in a local state directory or write that state to zookeeper.
As long as that local disk where state directory is persisted is not lost and the Zookeeper has quorum (min three nodes), then you have your state protected for your NiFi components that write state. Out of all the components (processors, controller services, reporting tasks, etc), there are only about 25 that record state.

The only thing that lives in memory only is component status (in, out, read, write, send, received). These are 5 minute stats that live in memory and thus any restart of the NiFi service would set these stats back to 0. These have nothing to do with the FlowFiles or execution of the processor.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

View solution in original post

MattWho · ‎08-05-2022

@nk20

If you are running a standalone NiFi, state is stored via the configured local state provider. If the node crashes you don't lose that state. NiFi will load that local state when it is restarted. Only way you would lose state is if server was unrecoverable (but you have also lost your currently queued data, your entire flow, etc... You can and certainly should have your NiFi's repos, state directory, and conf directory located on RAID disks to protect against loss in event of disk failure. A better option is to setup a NiFi cluster. Processors like GenerateTableFetch will then use cluster state which is stored in Zookeeper (ZK) (recommend setting up an external 3 node ZK Cluster rather then using NiFi's embedded ZK).

There are many advantages to using a NiFi cluster rather than a standalone single NiFi instance beyond just having state stored in ZK.
1. Distributed processing across multiple server
2. Externally stored cluster state
3. Avoid complete flow outage in event of a node failure.
4. All nodes execute exact same flow and thus each have a copy of it.

In a NiFi cluster you would start your dataflow with your GenerateTableFetch processor configured to execute on "Primary node" only. Within a NiFi cluster one node will be elected to be the "primary node". The success relationship connection would then be configured to load balance the generated FlowFiles containing your SQL statements. This would allow all nodes in your cluster to concurrently execute those SQL statements in your downstream processors which are configured to execute on al nodes.

If the currently elected primary node should crash, a new primary node will be elected. When that happens the processor configured for "primary node" only execution will retrieve that last state written to ZK and pickup processing where old node left off.

Off the top of my head nothing comes to mind in terms of being able to solve your use case in a stateless manor. However, maybe others in the community have some thoughts here.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

nk20 · ‎08-05-2022

Thanks Matt for your response. We will have a clustered setup and I have
implemented exactly what you described. But there are still concerns coming
up about losing the state. One of the examples given is if the state is
stored in-memory. Again I am new to NIFI and I am not sure if such
configuration is possible. But I still want to try if a stateless
implementation is possible. Thanks again Matt for the help. Would
appreciate if I can get any further ideas from the community.

MattWho · ‎08-08-2022

@nk20
I am confused by your concern about in memory state. Can you provide more detail around what you are being told or what you have read that has lead to this concern? Perhaps those concerns are about something more than component state? Perhaps I can address those specific concerns. Not all NiFi components retain state. Those that do either persist that state to disk in a local state directory or write that state to zookeeper.
As long as that local disk where state directory is persisted is not lost and the Zookeeper has quorum (min three nodes), then you have your state protected for your NiFi components that write state. Out of all the components (processors, controller services, reporting tasks, etc), there are only about 25 that record state.

The only thing that lives in memory only is component status (in, out, read, write, send, received). These are 5 minute stats that live in memory and thus any restart of the NiFi service would set these stats back to 0. These have nothing to do with the FlowFiles or execution of the processor.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

Cloudera Community

Support Questions

Incremental fetch in a stateless manner