Reply
New Contributor
Posts: 4
Registered: ‎05-06-2018

one tserver lead to other tserver hang up of this kudu cluster

[ Edited ]

Hi,

  

19:30 The cluster's 46-node hbase service is hung, and then the HBase service is restarted. It is found that the IO is too slow and continues to hang。Then Hbase service is also stoppedand the kudu service is stopped either

 

19:33:39  the error log of Zp-prd-hadoop-46:

Runtime error: Unable to start connection negotiation thread: Could not create thread: Resource temporarily unavailable 

Then we wanted to reboot the system,but cannot login the linux system with the slow speedthe reason we have not found),Then IDC reboot the system

 

After Zp-prd-hadoop-46 linux system is reboot,we start kudu and hbase service

Then tserver is stopped,then we manually start the tserver

 

The log of Zabbix:

 zabbix.png

At 22:20 or so, except for 46 nodes, the tservers of other machines kudu were successively hung up.

In addition to 46 nodes, other tservers reported the following error:

W0427 22:27:05.252460 92334 leader_election.cc:283] T c7b0f593416349d78a3f9e17d69aa642 P d77b8d073a6147ebacf53b3f5d8fa355 [CANDIDATE]: Term 1 election: Tablet error from VoteRequest() call to peer f371f68c687743aeb494adafbab3640f: Not found: Tablet not found: c7b0f593416349d78a3f9e17d69aa642

F0427 22:27:05.817157 92462 raft_consensus.cc:1264] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error.

W0427 22:27:11.134296 92328 env_posix.cc:864] Time spent sync call for /opt/hadoop/data4/kudu/tserver/data/35aff008c42d43928bf27ec2bf857b0c.metadata: real 4.947s       user 0.000s     sys 0.001s

W0427 22:27:11.191977 92337 consensus_peers.cc:411] T 6f4bc807fe084fffb919ae1be790abbc P d77b8d073a6147ebacf53b3f5d8fa355 -> Peer 943af2a6ff15464c8860eb5085c738d5 (zp-prd-hadoop-42:7050): Couldn't send request to peer 943af2a6ff15464c8860eb5085c738d5 for tablet 6f4bc807fe084fffb919ae1be790abbc. Status: Remote error: Service unavailable: UpdateConsensus request on kudu.consensus.ConsensusService from 10.205.151.47:46331 dropped due to backpressure. The service queue is full; it has 150 items.. Retrying in the next heartbeat period. Already tried 1 times.

 

 

 

 

 

Then OPS start every tserver,but also shows the following errorsand after start the kudu tserverthe service is shutdown then stopped

W0427 23:49:53.884155 98561 consensus_peers.cc:411] T cf7f286121db47e89f04c984bb1cbbe9 P d77b8d073a6147ebacf53b3f5d8fa355 -> Peer f6fc0ff84f9048628e2d0bb9fcf9dbc7 (zp-prd-hadoop-71:7050): Couldn't send request to peer f6fc0ff84f9048628e2d0bb9fcf9dbc7 for tablet cf7f286121db47e89f04c984bb1cbbe9. Status: Network error: Client connection negotiation failed: client connection to 10.205.151.71:7050: connect: Connection refused (error 111). Retrying in the next heartbeat period. Already tried 11 times.

W0427 23:49:54.040323 98559 consensus_peers.cc:411] T 287793b55c804fd2b98c12df5ac54e54 P d77b8d073a6147ebacf53b3f5d8fa355 -> Peer 68b6b79b2f4a42b1ae64ecea0d10a0f5 (zp-prd-hadoop-72:7050): Couldn't send request to peer 68b6b79b2f4a42b1ae64ecea0d10a0f5 for tablet 287793b55c804fd2b98c12df5ac54e54. Status: Network error: Client connection negotiation failed: client connection to 10.205.151.72:7050: connect: Connection refused (error 111). Retrying in the next heartbeat period. Already tried 12 times.

W0427 23:49:54.090852 98561 leader_election.cc:277] T 6c3d4464c70b4ce2a7d6af4f4d97ad85 P d77b8d073a6147ebacf53b3f5d8fa355 [CANDIDATE]: Term 20 pre-election: RPC error from VoteRequest() call to peer 943af2a6ff15464c8860eb5085c738d5: Network error: Client connection negotiation failed: client connection to 10.205.151.42:7050: connect: Connection refused (error 111)

W0427 23:49:54.090891 98558 leader_election.cc:277] T 6c3d4464c70b4ce2a7d6af4f4d97ad85 P d77b8d073a6147ebacf53b3f5d8fa355 [CANDIDATE]: Term 20 pre-election: RPC error from VoteRequest() call to peer 878093bc85374533b10e7780d0fbbab5: Network error: Client connection negotiation failed: client connection to 10.205.151.49:7050: connect: Connection refused (error 111)

W0427 23:49:54.136245 98560 consensus_peers.cc:411] T 7c77cb74e36a43b7a99a2aad634f44d2 P d77b8d073a6147ebacf53b3f5d8fa355 -> Peer d26f1bda31884f10abab9f7f51ba8f57 (zp-prd-hadoop-43:7050): Couldn't send request to peer d26f1bda31884f10abab9f7f51ba8f57 for tablet 7c77cb74e36a43b7a99a2aad634f44d2. Status: Network error: Client connection negotiation failed: client connection to 10.205.151.43:7050: connect: Connection refused (error 111). Retrying in the next heartbeat period. Already tried 15 times.

 RTX截图未命名1111.jpg

Then wo restarted the Zp-prd-hadoop-46 tserver,then the log shows the following errors

I0428 08:46:43.331285 46441 raft_consensus.cc:773] T e9340ae367b84269a66b978a1a7bf7c5 P c770fc7f035f489e802d85260eddf2de: Attempting to remove follower 878093bc85374533b10e7780d0fbbab5 from the Raft config. Reason: Leader has been unable to successfully communicate with Peer 878093bc85374533b10e7780d0fbbab5 for more than 300 seconds (11953.628s)

W0428 08:46:43.331295 46441 raft_consensus.cc:777] T e9340ae367b84269a66b978a1a7bf7c5 P c770fc7f035f489e802d85260eddf2de: Unable to remove follower 878093bc85374533b10e7780d0fbbab5: Illegal state: Leader has not yet committed an operation in its own term

W0428 08:46:43.331360 33702 consensus_peers.cc:411] T e9340ae367b84269a66b978a1a7bf7c5 P c770fc7f035f489e802d85260eddf2de -> Peer 878093bc85374533b10e7780d0fbbab5 (zp-prd-hadoop-49:7050): Couldn't send request to peer 878093bc85374533b10e7780d0fbbab5 for tablet e9340ae367b84269a66b978a1a7bf7c5. Status: Network error: Client connection negotiation failed: client connection to 10.205.151.49:7050: connect: Connection refused (error 111). Retrying in the next heartbeat period. Already tried 23912 times.

I0428 08:46:43.332494 185586 raft_consensus.cc:412] T 911b9f3dda704d5e84d178493013f1ec P c770fc7f035f489e802d85260eddf2de [term 23 FOLLOWER]: Starting pre-election (detected failure of leader 943af2a6ff15464c8860eb5085c738d5)

I0428 08:46:43.332521 185586 raft_consensus.cc:2456] T 911b9f3dda704d5e84d178493013f1ec P c770fc7f035f489e802d85260eddf2de: Snoozing failure detection for 19.873s (starting election)

I0428 08:46:43.332569 185586 raft_consensus.cc:434] T 911b9f3dda704d5e84d178493013f1ec P c770fc7f035f489e802d85260eddf2de [term 23 FOLLOWER]: Starting pre-election with config: opid_index: 185112 OBSOLETE_local: false peers { permanent_uuid: "878093bc85374533b10e7780d0fbbab5" member_type: VOTER last_known_addr { host: "zp-prd-hadoop-49" port: 7050 } } peers { permanent_uuid: "943af2a6ff15464c8860eb5085c738d5" member_type: VOTER last_known_addr { host: "zp-prd-hadoop-42" port: 7050 } } peers { permanent_uuid: "c770fc7f035f489e802d85260eddf2de" member_type: VOTER last_known_addr { host: "zp-prd-hadoop-46" port: 7050 } }

I0428 08:46:43.332698 113757 raft_consensus.cc:773] T 3ab68e1ff9ff4bd79cc0d3cdbfd519d0 P c770fc7f035f489e802d85260eddf2de: Attempting to remove follower d77b8d073a6147ebacf53b3f5d8fa355 from the Raft config. Reason: Leader has been unable to successfully communicate with Peer d77b8d073a6147ebacf53b3f5d8fa355 for more than 300 seconds (11957.327s)

W0428 08:46:43.332717 113757 raft_consensus.cc:777] T 3ab68e1ff9ff4bd79cc0d3cdbfd519d0 P c770fc7f035f489e802d85260eddf2de: Unable to remove follower d77b8d073a6147ebacf53b3f5d8fa355: Illegal state: Leader has not yet committed an operation in its own term

W0428 08:46:43.332803 33703 consensus_peers.cc:411] T 3ab68e1ff9ff4bd79cc0d3cdbfd519d0 P c770fc7f035f489e802d85260eddf2de -> Peer d77b8d073a6147ebacf53b3f5d8fa355 (zp-prd-hadoop-47:7050): Couldn't send request to peer d77b8d073a6147ebacf53b3f5d8fa355 for tablet 3ab68e1ff9ff4bd79cc0d3cdbfd519d0. Status: Network error: Client connection negotiation failed: client connection to 10.205.151.47:7050: connect: Connection refused (error 111). Retrying in the next heartbeat period. Already tried 23884 times.

I0428 08:46:43.332828 185586 leader_election.cc:221] T 911b9f3dda704d5e84d178493013f1ec P c770fc7f035f489e802d85260eddf2de [CANDIDATE]: Term 24 pre-election: Requesting pre-vote from peer 878093bc85374533b10e7780d0fbbab5

I0428 08:46:43.332842 185586 leader_election.cc:221] T 911b9f3dda704d5e84d178493013f1ec P c770fc7f035f489e802d85260eddf2de [CANDIDATE]: Term 24 pre-election: Requesting pre-vote from peer 943af2a6ff15464c8860eb5085c738d5

W0428 08:46:43.332947 33702 leader_election.cc:277] T 911b9f3dda704d5e84d178493013f1ec P c770fc7f035f489e802d85260eddf2de [CANDIDATE]: Term 24 pre-election: RPC error from VoteRequest() call to peer 878093bc85374533b10e7780d0fbbab5: Network error: Client connection negotiation failed: client connection to 10.205.151.49:7050: connect: Connection refused (error 111)

W0428 08:46:43.332970 33705 leader_election.cc:277] T 911b9f3dda704d5e84d178493013f1ec P c770fc7f035f489e802d85260eddf2de [CANDIDATE]: Term 24 pre-election: RPC error from VoteRequest() call to peer 943af2a6ff15464c8860eb5085c738d5: Network error: Client connection negotiation failed: client connection to 10.205.151.42:7050: connect: Connection refused (error 111)

I0428 08:46:43.332984 33705 leader_election.cc:248] T 911b9f3dda704d5e84d178493013f1ec P c770fc7f035f489e802d85260eddf2de [CANDIDATE]: Term 24 pre-election: Election decided. Result: candidate lost.

I0428 08:46:43.333039 59834 raft_consensus.cc:2456] T 911b9f3dda704d5e84d178493013f1ec P c770fc7f035f489e802d85260eddf2de: Snoozing failure detection for 16.010s (election complete)

I0428 08:46:43.333050 59834 raft_consensus.cc:2216] T 911b9f3dda704d5e84d178493013f1ec P c770fc7f035f489e802d85260eddf2de [term 23 FOLLOWER]: Leader pre-election lost for term 24. Reason: could not achieve majority

I0428 08:46:43.334172 131277 raft_consensus.cc:773] T 94de7db51751487b80f0f4ddef23c149 P c770fc7f035f489e802d85260eddf2de: Attempting to remove follower 1faac4e2c007428698fa71d6f7a43ab2 from the Raft config. Reason: Leader has been unable to successfully communicate with Peer 1faac4e2c007428698fa71d6f7a43ab2 for more than 300 seconds (11957.915s)

I0428 08:46:43.334192 99202 raft_consensus.cc:773] T 28551806fe6c4be1a7738fd1dbf0376f P c770fc7f035f489e802d85260eddf2de: Attempting to remove follower 28f41e86f6884943bb51ad0ed7d55446 from the Raft config. Reason: Leader has been unable to successfully communicate with Peer 28f41e86f6884943bb51ad0ed7d55446 for more than 300 seconds (11957.317s)

 RTX截图未命名1.jpg

We found that a certain hard disk was in good time and bad, and this hard disk stored a wal log of the 46-node tserver.

Since the hard disk has not been repaired, as long as the 46 tserver restarts well, the tservers of other machines will be suspended. Finally, the tserver of the 46 servers will not be restarted, and the kudu cluster's tservers in addition to the 46 nodes will all be restarted, and the entire cluster will be restarted.

Guess one: kudu server clock synchronization problems?

After checking, it was found that the kudu server was in normal synchronization.

Conjecture 2: Because the hard disk that stores the tserver wal logs of a machine is broken, it causes the other nodes of the cluster to hang directly. Therefore, it is guessed that the leader of the node is good or bad, misleading to other nodes and copying Data from these points. But it cannot be copied (the IO is slow or the hard disk is not read or written), and eventually causes the tserver of other nodes to hang up.

 

Stg environment simulation test results:

  1. Cgroup restrictions tserver read wal log hard disk IO, found that can not be limited, tserver process number has continued to change, making io limit unreliable.
  2. Remove the wal directory under the fs_wal_dir directory of a node's tserver, restart the tserver, and find that it can be restarted. Other nodes do not have an exception and hang the tserver. (This test does not import data into the kudu cluster.).
  3. When the cluster is running normally, remove the wal directory of the fs_wal_dir directory of a tserver node, and import the data to the kudu table corresponding to the tserver. The tserver of the node will be directly hung up, and the wal cannot be found. The file of the tablet, and other nodes did not appear abnormal hang tserver phenomenon.

 

   So why one tserver lead to other tserver hang up of this kudu cluster ,Thank you!

Cloudera Employee
Posts: 36
Registered: ‎04-08-2014

Re: one tserver lead to other tserver hang up of this kudu cluster

Hello King,

I'm sorry to hear you have run into so many problems.

 

1) It's hard to know what might be going wrong without knowing the version you are running. What version of CDH and Kudu is this?

 

2) It seems like you are discussing many scenarios here. I think we need to focus in on one specific scenario at a time because otherwise it becomes too difficult to have a conversation. Please clarify whether you are hitting problems in production or whether your problems are related to your simulated failure tests in staging.

 

3) To answer your question about hard disk failures, it depends on your version of Kudu. However usually if there is a disk failure then Kudu must copy all data that was on that machine to other machines, so that could slow down the entire cluster. This might cause problems if the cluster is very close to its resource limits already.

 

Regards,

Mike

 

New Contributor
Posts: 4
Registered: ‎05-06-2018

Re: one tserver lead to other tserver hang up of this kudu cluster

Hello Mike,

Thank you for your reply.

 1) the kudu version is kudu 1.5.0-cdh5.13.0 but the other service use cdh 5.10.0 and we do not use clouder manager to manager our cluster. This cluster has 23 machines, and split into 20 tserver and 3 master .

 2) The problems in production was one Tserver (46 node) which it failed hard disk that saved the tserver wals log and lead to other 19 tservers hanged up . To resolve the problems in production,we reboot the kudu cluster's tservers in addition to the 46 nodes,then kudu cluster is OK. We simulated failure tests in staging but failed .The result shows that one tserver hanged up can not lead to other tserver hanged up.

the scenarios in production was showing the following order(46 node has hard disk failures):

     a) 46 node tserver hanged up
     b) started the 46 node tserver manually
     c) other 19 tservers was shutdown one by one.
     d) stoped the 46 node tserver ,and rebooted the kudu cluster's tservers in addition to the 46 nodes ,kudu cluster was ok.
     e) Then reboot the 46 node tserver ,other 19 tservers was shutdown one by one.

 3) We looked at the Zabbix monitoring, found the cluster resource usage is normal in addition to the 46 nodes .And other services in cluster in addition to the 46 nodes such as hive ,hbase services was ok at that time。

Regards,

King

Cloudera Employee
Posts: 36
Registered: ‎04-08-2014

Re: one tserver lead to other tserver hang up of this kudu cluster

OK let's focus on your production issue. I am confused about the number of nodes. You say you have 46 nodes of Hadoop on CDH 5.10 and 22 nodes of Kudu on CDH 5.13?

 

I think you are saying that one of your nodes (node 46) was hung and you don't know why. Was it running Kudu? Then it was rebooted. After that you are saying that all 19 tablet servers were also hung after that? Can you clarify what that looked like and why you think they were hung?

 

Thanks,

Mike

New Contributor
Posts: 4
Registered: ‎05-06-2018

Re: one tserver lead to other tserver hang up of this kudu cluster

[ Edited ]

Hello Mike,

Thank you for your reply.

I'm very sorry for your misunderstanding due to my misrepresentation about 46 node which is one of the cluster node,and the kudu cluster has 23 nodes,which splits into 20 tablet servers and 3 master servers.

All 19 tablet servers were also hung,first the tablet server thread was disappeared, then the error log of the 19 tablet servers showed the following:

Log file created at: 2018/04/28 00:08:52
Running on machine: zp-prd-hadoop-71
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0428 00:08:52.087950 104522 ts_tablet_manager.cc:805] T 043afd0fb83d4c539a8d66109c7c6702 P f6fc0ff84f9048628e2d0bb9fcf9dbc7: Failed to load consensus metadata: Not found: Unable to load consensus metadata for tablet 043afd0fb83d4c539a8d66109c7c6702: /opt/hadoop/data1/kudu/tserver/consensus-meta/043afd0fb83d4c539a8d66109c7c6702: No such file or directory (error 2)
F0428 05:27:23.480379 104613 raft_consensus.cc:1264] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error.
F0428 05:27:25.163254 104599 raft_consensus.cc:1264] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error.
F0428 05:27:23.480379 104613 raft_consensus.cc:1264] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error.
 F0428 05:27:25.163254 104599 raft_consensus.cc:1264] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error.

 

       Also we check the ntp service and linux system clock,every thing is ok.

Regards,

King

Cloudera Employee
Posts: 36
Registered: ‎04-08-2014

Re: one tserver lead to other tserver hang up of this kudu cluster

 

Hi King, I'll address the two types of errors you saw separately:

 

(1) E0428 00:08:52.087950 104522 ts_tablet_manager.cc:805] T 043afd0fb83d4c539a8d66109c7c6702 P f6fc0ff84f9048628e2d0bb9fcf9dbc7: Failed to load consensus metadata: Not found: Unable to load consensus metadata for tablet 043afd0fb83d4c539a8d66109c7c6702: /opt/hadoop/data1/kudu/tserver/consensus-meta/043afd0fb83d4c539a8d66109c7c6702: No such file or directory (error 2)

 

I think it's likely that when you caused a hard reboot of the server, the consensus metadata was not able to fsync. This is something that you can control. If you want to avoid this in the future then set --log_force_fsync_all=true, however there may be a performance impact since that flag also controls whether the WAL will fsync.

 

(2) F0428 05:27:23.480379 104613 raft_consensus.cc:1264] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error.

 

This fatal error should cause a crash -- not a hang -- right? I mention this because you said it caused a "hang" in your description.

 

Is it possible that when you rebooted the server, it came up with a highly skewed clock, and then NTP corrected it after Kudu started up? Maybe that was the cause of the problem. One way to check this is by doing a hard reboot of a server and see if you can reproduce the problem that way.

 

This is a known issue and I just filed https://issues.apache.org/jira/browse/KUDU-2435 so that we can try to consider better alternatives to handling that situation by causing a crash.

New Contributor
Posts: 4
Registered: ‎05-06-2018

Re: one tserver lead to other tserver hang up of this kudu cluster

[ Edited ]

Hi Mike,

Thank you for your reply. I hava some problems.

1) if I rebooted the server at that time , the tablet server of this server would change the tablet RaftConfig from LEADER to FOLLOWER ,and replicating to other servers in a cluster would not exsist.(https://issues.apache.org/jira/browse/KUDU-2435)

2) if I rebooted the server at that time, then it came up with a highly skewed clock , the Kudu tserver would not be restarted because of highly skewed clock.

 

   F0509 09:08:03.652022 36544 tablet_server_main.cc:80] Check failed: _s.ok() Bad status: Service unavailable: Cannot initialize clock: Error reading clock. Clock considered unsynchronized

Regards,

King

Highlighted
Cloudera Employee
Posts: 36
Registered: ‎04-08-2014

Re: one tserver lead to other tserver hang up of this kudu cluster

(1) The server would come up will all replicas as followers, but it's
possible that some of those replicas on that server could be elected as
leader soon after that, right?

(2) I wonder if it is possible that the time skew was not so much that NTP
considered it unsynchronized, but the skew was enough so that the
HybridTime clock on the remote peer could not be increased to that time? I
was not aware that was possible but that seems to be one potential
explanation for what we are seeing here.
Announcements