Support Questions
Find answers, ask questions, and share your expertise

NiFi controller service in an undefined status

Explorer

Hello,
after some operation in NiFi GUI (parameter value change in a parameter context), the GUI got sort of stuck.
A controller service show status disabled in GUI, however, I cannot enable or delete a controller service, as the NiFi claims it is running on a node.
Do you have some Idea how to get rid of this situation?
Best regards
Jaro

1 ACCEPTED SOLUTION

Accepted Solutions

Master Guru

@Jarinek 

 

As part of changing the value of a parameter, all the components that are configured to use that parameter must be stopped and restarted.  Some components like controller services have dependent components like processors. So to stop such a controller service, the dependent processor must be stopped first. 

When NiFi requests that a component stop, it transition in to an intermediate stage fo "stopping" where NiFi asks the thread to exit.  The UI would reflect component as stopped or disabled even though it is still in the transition stage. But that does not always guarantee a component will ever complete the transition from "stopping" to stopped.  Some client libraries used by components and not written by NiFi developers do support interrupting running threads.  In scenarios like this the the "stopping" thread kicked off by NiFi stays active against that component until the active processor thread finally exits (if it ever does).  And as i said that specific component thread never completing is out of NiFi's core control.  Restarting NiFi is only way to kill any threads that do not gracefully exit.  

Things you can try here:
1. When you try to enable or delete a controller service, I'd expect the exception to provide the component UUID and host where it claims it is running.  You could go to the cluster UI and manually "disconnect" the node where it is still running.  Then open a web browser to that specific, now disconnected" node's UI.  Use the UUID of the component to locate that component and check its status.  Try disabling it and if that does not work, try restarting that node only.  If disabling the component worked, go back to UI of another node still in cluster and via the cluster UI request that the disconnected node reconnect.  While the exception might point out one host as the issue, it may be more than one host. The UI simply returns the first exception.
2. If above is not successful, have you tried restarting the entire cluster?
3. You could also inspect the the flow.xml.gz from each host to make sure they match.  Component state (started, stopped, enabled, and disabled) is retained in the flow.xml.gz, but is not included within the flow fingerprint used when connecting a node to the cluster.  As a connecting node is suppose to inherit the state from the cluster.  You could check each flow.xml.gz for the component UUID and see what its current recorded state is in the flow.xml.gz.  If you find one or more with a state of enabled/running and other nodes with an alternate state, simply copy the flow.xml.gz with the desired state to the nodes with the undesired state.  Make sure flow.xml.gz ownership and permissions are correct and restart your NiFi cluster.

As far as to why a component never transitioned from state to another, that would require examining numerous thread dumps spaced apart by at least 5 minutes.  This would allow you to identify active thread making no progress over the course of those numerous thread dumps.  Then you would have to to dig in to those threads to see what they are waiting on or blocked by.  Sometimes there are limitation within some clients that NiFi has no control over.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt



View solution in original post

3 REPLIES 3

Cloudera Employee

Hello @Jarinek 

Are all the nifi nodes in running state?  Have you tried to clear the browser cache and cookies and retry?

Master Guru

@Jarinek 

 

As part of changing the value of a parameter, all the components that are configured to use that parameter must be stopped and restarted.  Some components like controller services have dependent components like processors. So to stop such a controller service, the dependent processor must be stopped first. 

When NiFi requests that a component stop, it transition in to an intermediate stage fo "stopping" where NiFi asks the thread to exit.  The UI would reflect component as stopped or disabled even though it is still in the transition stage. But that does not always guarantee a component will ever complete the transition from "stopping" to stopped.  Some client libraries used by components and not written by NiFi developers do support interrupting running threads.  In scenarios like this the the "stopping" thread kicked off by NiFi stays active against that component until the active processor thread finally exits (if it ever does).  And as i said that specific component thread never completing is out of NiFi's core control.  Restarting NiFi is only way to kill any threads that do not gracefully exit.  

Things you can try here:
1. When you try to enable or delete a controller service, I'd expect the exception to provide the component UUID and host where it claims it is running.  You could go to the cluster UI and manually "disconnect" the node where it is still running.  Then open a web browser to that specific, now disconnected" node's UI.  Use the UUID of the component to locate that component and check its status.  Try disabling it and if that does not work, try restarting that node only.  If disabling the component worked, go back to UI of another node still in cluster and via the cluster UI request that the disconnected node reconnect.  While the exception might point out one host as the issue, it may be more than one host. The UI simply returns the first exception.
2. If above is not successful, have you tried restarting the entire cluster?
3. You could also inspect the the flow.xml.gz from each host to make sure they match.  Component state (started, stopped, enabled, and disabled) is retained in the flow.xml.gz, but is not included within the flow fingerprint used when connecting a node to the cluster.  As a connecting node is suppose to inherit the state from the cluster.  You could check each flow.xml.gz for the component UUID and see what its current recorded state is in the flow.xml.gz.  If you find one or more with a state of enabled/running and other nodes with an alternate state, simply copy the flow.xml.gz with the desired state to the nodes with the undesired state.  Make sure flow.xml.gz ownership and permissions are correct and restart your NiFi cluster.

As far as to why a component never transitioned from state to another, that would require examining numerous thread dumps spaced apart by at least 5 minutes.  This would allow you to identify active thread making no progress over the course of those numerous thread dumps.  Then you would have to to dig in to those threads to see what they are waiting on or blocked by.  Sometimes there are limitation within some clients that NiFi has no control over.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt



View solution in original post

Community Manager

@Jarinek, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. 



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: