Created 08-04-2020 03:44 PM
Hi, for some reason, my one-node cluster always display the following error message in the log file for a dozen times (takes a while ) before everything returns to normal to allow connections via GUI. Would appreciate any help.
2020-08-04 18:28:14,296 INFO [main] o.a.nifi.controller.StandardFlowService Connecting Node: ip-172-31-33-183:8080
2020-08-04 18:28:19,452 WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to cluster due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed unmarshalling 'CONNECTION_RESPONSE' protocol message from localhost/127.0.0.1:11443 due to: java.net.SocketTimeoutException: Read timed out
2020-08-04 18:28:29,578 WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to cluster due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed unmarshalling 'CONNECTION_RESPONSE' protocol message from localhost/127.0.0.1:11443 due to: java.net.SocketTimeoutException: Read timed out
2020-08-04 18:28:39,661 WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to cluster due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed unmarshalling 'CONNECTION_RESPONSE' protocol message from localhost/127.0.0.1:11443 due to: java.net.SocketTimeoutException: Read timed out
2020-08-04 18:28:49,785 WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to cluster due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed unmarshalling 'CONNECTION_RESPONSE' protocol message from localhost/127.0.0.1:11443 due to: java.net.SocketTimeoutException: Read timed out
2020-08-04 18:28:50,682 WARN [Process Cluster Protocol Request-2] o.a.n.c.p.impl.SocketProtocolListener Failed processing protocol message from localhost due to org.apache.nifi.cluster.protocol.ProtocolException: Failed marshalling protocol message in response to message type: CONNECTION_REQUEST due to java.net.SocketException: Broken pipe (Write failed)
org.apache.nifi.cluster.protocol.ProtocolException: Failed marshalling protocol message in response to message type: CONNECTION_REQUEST due to java.net.SocketException: Broken pipe (Write failed)
at org.apache.nifi.cluster.protocol.impl.SocketProtocolListener.dispatchRequest(SocketProtocolListener.java:184)
at org.apache.nifi.io.socket.SocketListener$2$1.run(SocketListener.java:136)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:879)
at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:850)
at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123)
at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:138)
at java.io.DataOutputStream.writeInt(DataOutputStream.java:197)
at org.apache.nifi.cluster.protocol.jaxb.JaxbProtocolContext$1.marshal(JaxbProtocolContext.java:83)
at org.apache.nifi.cluster.protocol.impl.SocketProtocolListener.dispatchRequest(SocketProtocolListener.java:182)
... 4 common frames omitted
Created 08-25-2020 04:41 AM
I have a multi-node cluster and I see this message as well. When this happens the node experiencing this becomes disconnected from the cluster. At that point, you can not make any changes to the flows.
Upon some searches I came across these properties.
nifi.cluster.node.connection.timeout=10 sec
nifi.cluster.node.read.timeout=10 sec
I believe 10 sec is pretty generous. Is there something else that needs to be tweaked.
Created 08-25-2020 05:52 AM
Hello @Love-Nifi and @vchhipa ,
Thank you for posting your inquiry about timeouts. Without the full log, I can provide only some "if you see this, do that" kind of instructions.
If you see an ERROR message with:
org.apache.nifi.controller.UninheritableFlowException: Failed to connect node to cluster because local flow is different than cluster flow, then follow the below is the steps to resolve the issue:
1. Go to NIFi UI > Global Menu > Cluster
2. Check which host is the coordinator and login to that host on the shell.
3. Go to flow.xml.gz file location. [default location is /var/lib/nifi/conf/]
4. Copy flow.xml.gz on the disconnected node and replace the original flow.xml.gz with copied flow.xml.gz file.
5. Check permissions and ownership of newly copied flow.xml.gz file and then restart Nifi on the disconnected node only.
If you are suspecting purely timeout issues, please attempt to tweak the below values in nifi.properties and restart the service:
- nifi.cluster.node.protocol.threads=50 (Default 10)
- nifi.cluster.node.connection.timeout=30 sec (Default 5 sec)
- nifi.cluster.node.read.timeout=30 sec (Default 5 sec)
Please find below a set of configurations that worth tuning on larger clusters based on https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html
The below are some example values for larger clusters (you need to tune it based on your unique setup):
nifi.cluster.node.protocol.threads=70
nifi.cluster.node.protocol.max.threads=100
nifi.zookeeper.session.timeout=30 sec
nifi.zookeeper.connect.timeout=30 sec
nifi.cluster.node.connection.timeout=60 sec
nifi.cluster.node.read.timeout=60 sec
nifi.ui.autorefresh.interval=900 sec
nifi.cluster.protocol.heartbeat.interval=20 sec
nifi.components.status.repository.buffer.size=300
nifi.components.status.snapshot.frequency=5 mins
nifi.cluster.node.protocol.max.threads=120
nifi.cluster.node.protocol.threads=80
nifi.cluster.node.read.timeout=90 sec
nifi.cluster.node.connection.timeout=90 sec
nifi.cluster.node.read.timeout=90 sec
Please check if you notice any certificate related exception, like:
WARN [Clustering Tasks Thread-2] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed marshalling 'HEARTBEAT' protocol message due to: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
In this case, create a new keystore and truststore and add client auth in the keystore.
Best regards:
Ferenc
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: