Support Questions

Find answers, ask questions, and share your expertise

"BlockingRecv error" when inserting rows into KUDU 1.2

avatar
Explorer
Sometimes the WARNING "Negotiation complete: Network error: Server connection negotiation failed: server connection from 11.137.128.14:20391: BlockingRecv error: Recv() got EOF from remote (error 108)" occured when a lot of rows(about 200k rows/second) were inserted into kudu by storm. Then I restarted the storm job, the inserting job would be OK. No WARN/ERROR log appeared in storm.
7 REPLIES 7

avatar
Expert Contributor

I'm not familiar with Storm; how exactly does it write to Kudu? The error sounds like one piece of the Storm job abruptly disconnected from one of the Kudu servers. It may be harmless (maybe Storm will retry), or it may indicate a failure in the job. Given that Storm didn't exhibit any errors, it's likely to be the former, though you could always verify that by reading the data out of the table and ensuring that it's all there.

 

Is there any other associated logging in Kudu about this failure? Maybe a trace message of some kind?

 

avatar
Explorer
Here are some trace messages: W0322 09:40:06.534153 129361 negotiation.cc:240] Failed RPC negotiation. Trace: 0322 09:39:51.533481 (+ 0us) reactor.cc:378] Submitting negotiation task for server connection from 11.136.156.50:39299 0322 09:39:51.533517 (+ 36us) sasl_server.cc:230] Waiting for connection header 0322 09:39:57.673309 (+6139792us) sasl_server.cc:238] Connection header received 0322 09:39:57.673309 (+ 0us) sasl_server.cc:181] Waiting for next SASL message... 0322 09:40:06.534117 (+8860808us) negotiation.cc:231] Negotiation complete: Timed out: Server connection negotiation failed: server connection from 11.136.156.50:39299 Metrics: {"negotiator.queue_time_us":8} W0322 09:40:16.245877 164604 connection.cc:464] server connection from 11.136.156.55:43035 recv error: Network error: recv error: Connection reset by peer (error 104) W0322 09:40:16.246330 164604 connection.cc:378] Connection torn down before Call kudu.tserver.TabletServerService.Write from 11.136.156.55:43035 (ReqId={client: 574a7048c3824c869a1e553825845ba6, seq_no=467150, attempt_no=1}) could send its response ohter trace messages: W0322 09:27:45.480054 101702 negotiation.cc:240] Failed RPC negotiation. Trace: 0322 09:27:31.439139 (+ 0us) reactor.cc:378] Submitting negotiation task for server connection from 11.1 31.120.17:41515 0322 09:27:31.439526 (+ 387us) sasl_server.cc:230] Waiting for connection header 0322 09:27:32.087993 (+648467us) sasl_server.cc:238] Connection header received 0322 09:27:32.087994 (+ 1us) sasl_server.cc:181] Waiting for next SASL message... 0322 09:27:45.479950 (+13391956us) sasl_server.cc:295] SASL Server: Received NEGOTIATE request from client 0322 09:27:45.479964 (+ 14us) sasl_server.cc:341] Sent NEGOTIATE response 0322 09:27:45.479967 (+ 3us) sasl_server.cc:181] Waiting for next SASL message... 0322 09:27:45.480007 (+ 40us) negotiation.cc:231] Negotiation complete: Network error: Server connection negotiation failed: server connection from 11.131.120.17:41515: BlockingRecv error: Recv() got EOF from remo te (error 108) Metrics: {"negotiator.queue_time_us":352,"thread_start_us":344,"threads_started":1}

avatar
Expert Contributor

I reformatted your errors so they're a bit easier to read:

W0322 09:40:06.534153 129361 negotiation.cc:240] Failed RPC negotiation. Trace:
0322 09:39:51.533481 (+ 0us) reactor.cc:378] Submitting negotiation task for server connection from 11.136.156.50:39299
0322 09:39:51.533517 (+ 36us) sasl_server.cc:230] Waiting for connection header
0322 09:39:57.673309 (+6139792us) sasl_server.cc:238] Connection header received
0322 09:39:57.673309 (+ 0us) sasl_server.cc:181] Waiting for next SASL message...
0322 09:40:06.534117 (+8860808us) negotiation.cc:231] Negotiation complete: Timed out: Server connection negotiation failed: server connection from 11.136.156.50:39299
Metrics: {"negotiator.queue_time_us":8}

W0322 09:40:16.245877 164604 connection.cc:464] server connection from 11.136.156.55:43035 recv error: Network error: recv error: Connection reset by peer (error 104)

W0322 09:40:16.246330 164604 connection.cc:378] Connection torn down before Call kudu.tserver.TabletServerService.Write from 11.136.156.55:43035 (ReqId={client: 574a7048c3824c869a1e553825845ba6, seq_no=467150, attempt_no=1}) could send its response

W0322 09:27:45.480054 101702 negotiation.cc:240] Failed RPC negotiation. Trace:
0322 09:27:31.439139 (+ 0us) reactor.cc:378] Submitting negotiation task for server connection from 11.1 31.120.17:41515
0322 09:27:31.439526 (+ 387us) sasl_server.cc:230] Waiting for connection header
0322 09:27:32.087993 (+648467us) sasl_server.cc:238] Connection header received
0322 09:27:32.087994 (+ 1us) sasl_server.cc:181] Waiting for next SASL message...
0322 09:27:45.479950 (+13391956us) sasl_server.cc:295] SASL Server: Received NEGOTIATE request from client
0322 09:27:45.479964 (+ 14us) sasl_server.cc:341] Sent NEGOTIATE response
0322 09:27:45.479967 (+ 3us) sasl_server.cc:181] Waiting for next SASL message...
0322 09:27:45.480007 (+ 40us) negotiation.cc:231] Negotiation complete: Network error: Server connection negotiation failed: server connection from 11.131.120.17:41515: BlockingRecv error: Recv() got EOF from remote (error 108)
Metrics: {"negotiator.queue_time_us":352,"thread_start_us":344,"threads_started":1}

Still, that looks fairly innocuous to me, it's just a connection failure. Have you checked whether all of your data was written?

avatar
Explorer
Write speed decreases from 200K rows/second to 5K rows/second. There are many warnings "error 104"/"error 108" in trace log. The data in kudu is not available.

avatar
Explorer
I'm sorry, I mistake the storm log. There are many IOExeception in kudu client log, such as "[ERROR 2017-03-22 10:30:21 o.a.k.c.TabletClient:791 New I/O worker #1] [Peer 772d8abf66c24789a51ba852aea1b478] Unexpected exception from downstream on [id: 0x2b9826d2, /11.131.120.17:55922 => /11.173.197.31:7050] java.io.IOException: Broken pipe"

avatar
Expert Contributor
Hi,

Are you pooling clients? Are you sure that your storm workers are not
suffering GC pauses? It would be worth running the storm workers with
verbose GC logging and check that you aren't seeing GC around the same time.

The server side log indicates that the workers are connecting and then
taking 6-10 seconds to perform the necessary round trips for RPC connection
negotiation. So, the client's disconnecting them.

This would also explain the throughput drop.

-Todd

avatar
Explorer
Thank you! It's really because storm workers are suffering GC pauses! We will check the code of storm workers.