Support Questions

Find answers, ask questions, and share your expertise

NiFi 1.5.0 Site-To-Site communications failure prevents NiFi from starting

avatar
Explorer

Hi everyone,

We are attempting to set up a NiFi (v1.5.0) cluster on Amazon ECS, with a load balancer and DNS entry pointing to the UIs on every node. This works from a UI perspective, and we are able to get site-to-site working between cluster nodes using a Remote Process Group. However, when the cluster is coming up fresh, we have this issue:

When we start NiFi, it attempts to refresh the status of the RPG, which we ping via the load balancer. This fails with a 502 error, because the UI is not yet up behind the load balancer. However, instead of continuing to bring the UI up, the service receives a SHUTDOWN signal from bootstrap and restarts. This causes the service to loop through its startup process indefinitely. HOWEVER - sometimes, eventually, magically, the UI will actually start. We believe that there is a race condition between a thread starting the UI and a thread starting the site-to-site components. When the UI thread finishes first, NiFi goes live. Otherwise, it loops. This loop can go on anywhere from 10 minutes to several hours.

My question is: is this a bug or an intended function? Are there any known workarounds? I've been unable to find documentation or forum posts on this issue.

Here is a stack trace from the nifi-app.log file during a failed start. As you can see, it throws the 502 error, then begins restarting the whole service without an indication as to exactly why this is happening. Ideal behavior here would be that this error is non-blocking, and that the UI would still start successfully. A bulletin or warning should then appear informing the user that the remote process group is unreachable.

Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 2018-06-06 20:59:06,143 WARN [Remote Process Group 47a43fe6-2c9b-39aa-4fae-5851f048d58b Thread-1] o.a.n.r.util.SiteToSiteRestApiClient Failed to get controller from http://<URL>:8080/nifi-api due to org.apache.nifi.remote.util.SiteToSiteRestApiClient$HttpGetFailedException: response code 502:Bad Gateway with explanation: null
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 2018-06-06 20:59:06,143 DEBUG [Remote Process Group 47a43fe6-2c9b-39aa-4fae-5851f048d58b Thread-1] o.a.n.r.util.SiteToSiteRestApiClient 
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: org.apache.nifi.remote.util.SiteToSiteRestApiClient$HttpGetFailedException: response code 502:Bad Gateway with explanation: null
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at org.apache.nifi.remote.util.SiteToSiteRestApiClient.execute(SiteToSiteRestApiClient.java:1203)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at org.apache.nifi.remote.util.SiteToSiteRestApiClient.execute(SiteToSiteRestApiClient.java:1237)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at org.apache.nifi.remote.util.SiteToSiteRestApiClient.fetchController(SiteToSiteRestApiClient.java:419)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at org.apache.nifi.remote.util.SiteToSiteRestApiClient.getController(SiteToSiteRestApiClient.java:394)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at org.apache.nifi.remote.util.SiteToSiteRestApiClient.getController(SiteToSiteRestApiClient.java:361)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at org.apache.nifi.remote.util.SiteToSiteRestApiClient.getController(SiteToSiteRestApiClient.java:346)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at org.apache.nifi.remote.StandardRemoteProcessGroup$InitializationTask.run(StandardRemoteProcessGroup.java:1177)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 	at java.lang.Thread.run(Thread.java:748)
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 2018-06-06 20:59:06,143 WARN [Remote Process Group 47a43fe6-2c9b-39aa-4fae-5851f048d58b Thread-1] o.a.n.remote.StandardRemoteProcessGroup org.apache.nifi.remote.StandardRemoteProcessGroup$InitializationTask@42a30cb When communicating with remote instance, got unexpected result. Bad Gateway
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 2018-06-06 20:59:07,185 INFO [main] o.apache.nifi.controller.FlowController Successfully synchronized controller with proposed flow
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 2018-06-06 20:59:07,256 DEBUG [main] o.a.nifi.controller.StandardFlowService First controller initialization, initializing controller...
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 2018-06-06 20:59:07,949 DEBUG [Listen to Bootstrap] org.apache.nifi.BootstrapListener Listening for Bootstrap Requests
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 2018-06-06 20:59:09,951 DEBUG [Listen to Bootstrap] org.apache.nifi.BootstrapListener Listening for Bootstrap Requests
Jun  6 20:59:12 nifi-cluster--01f8453d323eeea25 nifi-app: 2018-06-06 20:59:11,953 DEBUG [Listen to Bootstrap] org.apache.nifi.BootstrapListener Listening for Bootstrap Requests
Jun  6 20:59:22 nifi-cluster--01f8453d323eeea25 nifi-app: 2018-06-06 20:59:15,697 INFO [main] org.apache.nifi.NiFi Launching NiFi...

Other thoughts:

- There is no obvious difference in logs between a failed launch and the eventual success. Even in the eventual success, the 502 error appears. The success log just keeps going along with normal startup logs instead of starting from the beginning again.

- Having nifi.flowcontroller.autoResumeState=false does not solve the issue

- The issue is the same for both service restarts and whole node reboots.

1 ACCEPTED SOLUTION

avatar
Explorer

Apologies to anyone reading this; we discovered the error was actually due to another process on our system interfering with Java operations. NiFi was not at fault.

View solution in original post

6 REPLIES 6

avatar

Hi @Sam Rubin,

I see the WARN message regularly when NiFi restarts if target remote NiFi process group is not running, but haven't encountered start/shutdown loop because of it. The [RemoteProcessGroup47a43fe6-2c9b-39aa-4fae-5851f048d58bThread-1] that logged the WARN message is a background thread and periodically (per 60 secs) checks target RPG status, and does not affect the main thread.

Is it possible for you to share entire nifi-app.log and nifi-bootstrap.log? I suspect there is other issues preventing NiFi to startup correctly.

avatar
Explorer

Hi @kkawamura, thanks for your response. Attached are snippets of the nifi logs. Let me know if you see anything amiss.

nifi-app-snippet.txt

nifi-bootstrap-snippet.txt

avatar

Thanks @Sam Rubin for the log.

I couldn't find anything unusual in the logs.

By looking at NiFi source code, the "Failed to send shutdown command to port ... Will kill ..." message is only logged by calling RunNiFi.stop() method. And the method is only called from RunNiFi.main() which is called by nifi.sh.

Would you run the nifi.sh with debug option? Such as:

sh -bx ./bin/nifi.sh start

Please share std output of above command. Then we can look how the nifi.sh works in detail.

Just in case, did you modify nifi.sh?

avatar
Explorer

Hi @kkawamura,

Sorry for the delay, here are the logs as requested. We haven't made any changes to nifi.sh or any of the startup logic; our only changes are to conf files.

nifi-bx-stdout.txt

avatar

Hi @Sam Rubin

I took a look on the shell output and found -Xms and -Xmx are being configured with quite small amount:

++ /bin/java -cp '/apps/nifi/nifi-1.5.0/conf:/apps/nifi/nifi-1.5.0/lib/bootstrap/*' -Xms12m -Xmx24m -Dorg.apache.nifi.bootstrap.config.log.dir=/apps/nifi/nifi-1.5.0/logs -Dorg.apache.nifi.bootstrap.config
.pid.dir=/apps/nifi/nifi-1.5.0/run -Dorg.apache.nifi.bootstrap.config.file=/apps/nifi/nifi-1.5.0/conf/bootstrap.conf org.apache.nifi.bootstrap.RunNiFi start

It may not be the direct cause of having start/shutdown loop, but NiFi wouldn't run with 24m heap size. Is this configuration intentional?? Please check bootstrap.conf and increase -Xms and -Xmx to 512m at least.

avatar
Explorer

Apologies to anyone reading this; we discovered the error was actually due to another process on our system interfering with Java operations. NiFi was not at fault.