Support Questions

Find answers, ask questions, and share your expertise

NiFi + Registry Backup/Restoration

avatar
New Contributor

Hello,
My team is about to deploy a clustered scaling NiFi, along with NiFi Registry and Zookeeper. We are deploying on Openshift Kubernetes.

We are determining which directories to mount our persistent storage to on both NiFi and the Registry.

1. Since we are using Registry, do we need to back anything up on the NiFi pod?

2. What are all of the locations we would want mount persistent storage for both NiFi and Registry? 

3. What are all of the locations we would want to establish a backup procedure for on NiFi and Registry if we are using local filesystem persistence?

4. Do we still need to backup any locations if we use the Git and S3 persistence providers?

5. What does the restoration process look like for both
A. local persistence providers only
B. Git + S3 persistence

6. Do we need to worry about Zookeeper at all for backups?

Thanks ahead of time!

2 ACCEPTED SOLUTIONS

avatar
Master Mentor

@TreantProtector 

There is a lot of ask in this one post.
1. NiFi Registry is used to store NiFi version controlled NiFi process groups (This takes user manual action to both initiate version control and push new versions to NiFi-Registry. It does not store the flow.xml.gz or flow.json.gz files that contains all the flow information NiFi loads on startup.  So it is not a substitute for protecting those files on NiFi.  All nodes in a NIFi cluster use the same flow.xml.gz/flow.json.gz, so it is not necessary to preserve the files from every node for recovery.

2a (NiFi)

  • Apache NiFi stores the complete dataflow(s) on your canvas in the flow.xml.gz (legacy format) and flow.json.gz (current format).  Preserving this file will preserve all your dataflows on the canvas (NOTE: all sensitive properties like passwords are encrypted in these files using the configures sensitive.props.key in NiFi, so make sure you save that password or you will need to scrub these files of all enc{...} values to load it. removing values woudl require you to re-enter all encrypted values in the NiFi components)
  • Apache NiFi has a local state directory configured.  This is unique to each node and stores state information for processors that store local state.  Should be preserved to avoid data duplication.
  • Apache NiFi content_repository(s) - Holds active (content claims still used by actively queued FlowFiles within your dataflows) and archived content claims (archive subdirectories holding archived claims which are not being referenced by any active FlowFiles in the UI). This repository is tightly coupled to the flowfile_repository. Content_repository(s) hold unique per node claims and need to be protected on all nodes to avoid data loss.
  • Apache NiFi flowfile_repository - Contains metadata/attributes (to include reference to content claim in content_repository(s) along with byte offset and length). Tightly coupled to content_repository(s) on same node so make sure same flowfile_repository is loaded with corresponding content_repository(s) from same node.  This must be protected to avoid data loss.
  • Apache provenance_repository - Holds event data about FlowFile transactions and are unique per node.  Loss of these is a loss or provenance history, but would not cause loss of any queued FlowFiles.  These are typicallly also placed on protected storage
  • Apache metadata_repository - Metadata about users who authenticated to NiFi and flow configuration history when using embedded H2 DB.  Not necessary to retain unless you want to preserve that historical information.
  • NiFi extension directory contains any custom NiFi nars to have added to your NiFi.  Copies of yoru custom nars should be preserved somewhere to prevent losing them to they can restored easily should it be needed.
  • Apache NiFi local authorization files like users.xml and authorizations.xml which contain the users and their associated authorizations granted over time through the NiFi UI should be preserved or you'll need to set those back up again in recovery (same on all nodes)
  • Node specific configured local directories used in your dataflows (dataflows built on canvas).  Some components may allow you configure local directories for persistent directory storage.  If you are using these they should be persisted.  Example: DistributedMapCacheServer 1.25.0

2b. NiFi-Registry

  • NiFi-Registry database which contains all information about version controlled flows and buckets should be protected unless you are using an external DB which you are protecting by other means. default uses an embedded H2 DB.
  • NiFi-Registry extensions directory if being used to store version controlled extensions (jars)
  • NiFi-Registry persistence provider stores the actual version controlled NiFi process groups and is tightly coupled to the NiFi-Registry database.  If using external GitFlowPersistence provider, refer to git for for persistence requirements.
  • NiFi-Registry bundle persistence has local and S3 options and protected storage should be used if using local
  • NiFi-Regsitry local authorization files like users.xml and authorizations.xml which contain the users and their associated authorizations granted over time through the NiFi-Registry UI should be preserved or you'll need to set those back up again in recovery.
  • Reference material: https://nifi.apache.org/docs/nifi-registry-docs/html/administration-guide.html#backup-recovery

3. covered in above - refer to Apache NiFi nifi.properties file for your configured local storage paths.

4. yes - covered above

5a. Not sure I follow the question. On restoration NiFi or NiFi will read the persistence provider (whether they are local, git, or S3) preserving the NiFi and NiFi-Registry conf directory configuration files would make restoration easier.  While the NiFi content_repository(s) and flowfile_repository are tightly coupled to one another on the same node and tie back to the flow.xml.gz/flow.json.gz (same all nodes) content. which node they get restored to does not matter (specific node information is not present in any of those). 
NOTE: content_repositories are directly correlated to the content_repository property name in the nifi.properties file.  
nifi.content.repository.directory.default=/dir1/node1
nifi.content.repository.directory.repo2=/dir2/node1
Upon restoration content_repository contents persisted for /dir1/node1 must still be set in "defualt" and not set to different property name.  This is because the flowfile metadata in the corresponding flowfile_repository does not contain directory details.  It simply says you can find content  for FlowFile xyz in nifi.content.repository.directory.default at sub-directory (num), content claim, byte offset, and num bytes.  So if you put dir2 in the default content_repository you'll mess up finding your content.

6. Zookeeper is used to store cluster state used by a good number of NiFi processors (refer to individual processor documentation for state information.  For every processor documentation. there is a "state management" section that tells you if the specific processor component stores state and if that state is local or cluster). State is stored for a specifc component For cluster state stored in zookeeper it is not node specific state as all components that use cluster state utilize same state information.  Failing to protect against loss of state info typically leads to data duplication, but all depends on how a given processor is using that state information.
Example: 
ListSFTP 1.25.0.

If you found any of the suggestions/solutions provided helped you with your issue, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

avatar
Master Mentor

@TreantProtector 

Everything the user adds to the canvas including controller service and reporting tasks are auto-saved in the flow.json.gz.  Each time a change is made the current flow.json.gz is archived and new flow.json.gz is generated.   Within the flow.json.g are all components (processors, connections, controller services, reporting tasks, funnels, process groups, ports, parameters, etc.)  and their configurations.  Any configuration property that is "sensitive" (passwords) are going to be encrypted in the flow.json.gz file.  So in order to load that flow.json.gz in another NiFi, you would need to know the nifi.sensitive.props.algorithm and nifi.sensitive.props.key used by the original NiFi which it came from.
Encrypted Passwords in Flows

If you don't have that info, the flow.json.gz can still be loaded on another NiFi after manually editing the file to remove all the "enc{...}" values.  Once flow.json.gz loads, an authorized user would need to re-enter all passwords in all components where it is needed via the NiFi UI.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

5 REPLIES 5

avatar
Community Manager

@TreantProtector Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our NiFi experts @mburgess @MattWho  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Master Mentor

@TreantProtector 

There is a lot of ask in this one post.
1. NiFi Registry is used to store NiFi version controlled NiFi process groups (This takes user manual action to both initiate version control and push new versions to NiFi-Registry. It does not store the flow.xml.gz or flow.json.gz files that contains all the flow information NiFi loads on startup.  So it is not a substitute for protecting those files on NiFi.  All nodes in a NIFi cluster use the same flow.xml.gz/flow.json.gz, so it is not necessary to preserve the files from every node for recovery.

2a (NiFi)

  • Apache NiFi stores the complete dataflow(s) on your canvas in the flow.xml.gz (legacy format) and flow.json.gz (current format).  Preserving this file will preserve all your dataflows on the canvas (NOTE: all sensitive properties like passwords are encrypted in these files using the configures sensitive.props.key in NiFi, so make sure you save that password or you will need to scrub these files of all enc{...} values to load it. removing values woudl require you to re-enter all encrypted values in the NiFi components)
  • Apache NiFi has a local state directory configured.  This is unique to each node and stores state information for processors that store local state.  Should be preserved to avoid data duplication.
  • Apache NiFi content_repository(s) - Holds active (content claims still used by actively queued FlowFiles within your dataflows) and archived content claims (archive subdirectories holding archived claims which are not being referenced by any active FlowFiles in the UI). This repository is tightly coupled to the flowfile_repository. Content_repository(s) hold unique per node claims and need to be protected on all nodes to avoid data loss.
  • Apache NiFi flowfile_repository - Contains metadata/attributes (to include reference to content claim in content_repository(s) along with byte offset and length). Tightly coupled to content_repository(s) on same node so make sure same flowfile_repository is loaded with corresponding content_repository(s) from same node.  This must be protected to avoid data loss.
  • Apache provenance_repository - Holds event data about FlowFile transactions and are unique per node.  Loss of these is a loss or provenance history, but would not cause loss of any queued FlowFiles.  These are typicallly also placed on protected storage
  • Apache metadata_repository - Metadata about users who authenticated to NiFi and flow configuration history when using embedded H2 DB.  Not necessary to retain unless you want to preserve that historical information.
  • NiFi extension directory contains any custom NiFi nars to have added to your NiFi.  Copies of yoru custom nars should be preserved somewhere to prevent losing them to they can restored easily should it be needed.
  • Apache NiFi local authorization files like users.xml and authorizations.xml which contain the users and their associated authorizations granted over time through the NiFi UI should be preserved or you'll need to set those back up again in recovery (same on all nodes)
  • Node specific configured local directories used in your dataflows (dataflows built on canvas).  Some components may allow you configure local directories for persistent directory storage.  If you are using these they should be persisted.  Example: DistributedMapCacheServer 1.25.0

2b. NiFi-Registry

  • NiFi-Registry database which contains all information about version controlled flows and buckets should be protected unless you are using an external DB which you are protecting by other means. default uses an embedded H2 DB.
  • NiFi-Registry extensions directory if being used to store version controlled extensions (jars)
  • NiFi-Registry persistence provider stores the actual version controlled NiFi process groups and is tightly coupled to the NiFi-Registry database.  If using external GitFlowPersistence provider, refer to git for for persistence requirements.
  • NiFi-Registry bundle persistence has local and S3 options and protected storage should be used if using local
  • NiFi-Regsitry local authorization files like users.xml and authorizations.xml which contain the users and their associated authorizations granted over time through the NiFi-Registry UI should be preserved or you'll need to set those back up again in recovery.
  • Reference material: https://nifi.apache.org/docs/nifi-registry-docs/html/administration-guide.html#backup-recovery

3. covered in above - refer to Apache NiFi nifi.properties file for your configured local storage paths.

4. yes - covered above

5a. Not sure I follow the question. On restoration NiFi or NiFi will read the persistence provider (whether they are local, git, or S3) preserving the NiFi and NiFi-Registry conf directory configuration files would make restoration easier.  While the NiFi content_repository(s) and flowfile_repository are tightly coupled to one another on the same node and tie back to the flow.xml.gz/flow.json.gz (same all nodes) content. which node they get restored to does not matter (specific node information is not present in any of those). 
NOTE: content_repositories are directly correlated to the content_repository property name in the nifi.properties file.  
nifi.content.repository.directory.default=/dir1/node1
nifi.content.repository.directory.repo2=/dir2/node1
Upon restoration content_repository contents persisted for /dir1/node1 must still be set in "defualt" and not set to different property name.  This is because the flowfile metadata in the corresponding flowfile_repository does not contain directory details.  It simply says you can find content  for FlowFile xyz in nifi.content.repository.directory.default at sub-directory (num), content claim, byte offset, and num bytes.  So if you put dir2 in the default content_repository you'll mess up finding your content.

6. Zookeeper is used to store cluster state used by a good number of NiFi processors (refer to individual processor documentation for state information.  For every processor documentation. there is a "state management" section that tells you if the specific processor component stores state and if that state is local or cluster). State is stored for a specifc component For cluster state stored in zookeeper it is not node specific state as all components that use cluster state utilize same state information.  Failing to protect against loss of state info typically leads to data duplication, but all depends on how a given processor is using that state information.
Example: 
ListSFTP 1.25.0.

If you found any of the suggestions/solutions provided helped you with your issue, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
New Contributor

Thank you so much @MattWho for your detailed response. If we are mostly only concerned with backing up and restoring the process groups/Registry data, what would be the bare minimum we would need to backup on the NiFi (not Registry) pod to restore operations with fresh containers?

I think you mentioned we would definitely want to backup flow.json.gz for this scenario, but I wanted to make sure.

avatar
Master Mentor

@TreantProtector 

Everything the user adds to the canvas including controller service and reporting tasks are auto-saved in the flow.json.gz.  Each time a change is made the current flow.json.gz is archived and new flow.json.gz is generated.   Within the flow.json.g are all components (processors, connections, controller services, reporting tasks, funnels, process groups, ports, parameters, etc.)  and their configurations.  Any configuration property that is "sensitive" (passwords) are going to be encrypted in the flow.json.gz file.  So in order to load that flow.json.gz in another NiFi, you would need to know the nifi.sensitive.props.algorithm and nifi.sensitive.props.key used by the original NiFi which it came from.
Encrypted Passwords in Flows

If you don't have that info, the flow.json.gz can still be loaded on another NiFi after manually editing the file to remove all the "enc{...}" values.  Once flow.json.gz loads, an authorized user would need to re-enter all passwords in all components where it is needed via the NiFi UI.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Community Manager

@TreantProtector Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.  Thanks.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: