Created on 06-22-2018 07:19 AM - edited 09-16-2022 06:22 AM
I recently tried to make a real time PoC using Kafka, Spark2 and Kudu and I'm facing severe performance impact with Kudu.
To sum up the case, some devices send short messages to a Kafka topic, collected by a Spark2 Streaming job (5 seconds batch) and persisted to Kudu throw upsert operations.
Volume is up to 600.000 messages / 5 seconds, 40bytes / message. My Kudu counts with 6 Tablet Servers (Hard Memory = 20GB / Kudu Tablet Server), each TS has 3 SSD disks and the table is defined with hash partition of 100 buckets, and 3 Masters.
Kudu is deployed with a Cloudera CDH ExpressEdition 5.14 in Azure to make the PoC, and all the nodes are in the same vnet.
When I start the Spark2 job (execution time remains between 1 and 3 seconds), everything seems to go fine the first 24 hours. Next it appears a kind of bottleneck upserting rows into Kudu, and the delay varies between 5 and more than 600 seconds.
Analizing the Kudu TS logs, I can see warning and error messages I can't interpret.
For instance:
W0622 08:21:50.793898 2051 negotiation.cc:313] Failed RPC negotiation. Trace:
0622 08:21:50.793636 (+ 0us) reactor.cc:579] Submitting negotiation task for server connection from 10.164.3.101:45662
0622 08:21:50.793817 (+ 181us) server_negotiation.cc:176] Beginning negotiation
0622 08:21:50.793819 (+ 2us) server_negotiation.cc:365] Waiting for connection header
0622 08:21:50.793862 (+ 43us) negotiation.cc:304] Negotiation complete: Network error: Server connection negotiation failed: server connection from xxx.xxx.xxx.xxx:45662: BlockingRecv error: Recv() got EOF from remote (error 108)
Metrics: {"server-negotiator.queue_time_us":151,"thread_start_us":86,"threads_started":1}
This warning appears several times on several TS, but I didn't notice any trouble with our network.
E0622 11:24:43.982810 28825 hybrid_clock.cc:395] Unable to read clock for last 0.774s: Service unavailable: Error reading clock. Clock considered unsynchronized
This error has begun after 5 days using Kudu day and night.
W0622 11:24:47.979626 66209 connection.cc:657] server connection from xxx.xxx.xxx.xxx:50238 send error: Network error: failed to write to TLS socket: Connection reset by peer
W0622 11:24:47.979672 66209 connection.cc:422] Connection torn down before Call kudu.tserver.TabletServerService.Write from xxx.xxx.xxx.xxx:50238 (ReqId={client: afb1a862dcca4b3494e27a011fe31d63, seq_no=4637933, attempt_no=1}) could send its response
W0622 11:24:47.979729 66209 connection.cc:188] Error closing socket: Network error: TlsSocket::Close: Success
W0622 11:24:47.981634 66209 connection.cc:511] server connection from xxx.xxx.xxx.xxx:33178 recv error: Network error: failed to read from TLS socket: Connection reset by peer (error 104)
W0622 11:24:47.981720 66209 connection.cc:188] Error closing socket: Network error: TlsSocket::Close: Broken pipe
W0622 11:24:48.057577 66210 connection.cc:657] server connection from xxx.xxx.xxx.xxx:41744 send error: Network error: failed to write to TLS socket: Broken pipe
W0622 11:24:48.057612 66210 connection.cc:422] Connection torn down before Call kudu.tserver.TabletServerService.Write from xxx.xxx.xxx.xxx:41744 (ReqId={client: 47ee414afec74fcb86f86a19b62d8f7d, seq_no=4631450, attempt_no=1}) could send its response
W0622 11:24:48.057677 66210 connection.cc:188] Error closing socket: Network error: TlsSocket::Close: Success
W0622 11:24:48.057819 66210 connection.cc:422] Connection torn down before Call kudu.tserver.TabletServerService.Write from xxx.xxx.xxx.xxx:41744 (ReqId={client: 47ee414afec74fcb86f86a19b62d8f7d, seq_no=4631470, attempt_no=1}) could send its response
In this case, I notice this warning on several TS, but I can't identify any trouble with my network (an independant vnet in Azure)
W0621 14:34:19.201867 125277 env_posix.cc:862] Time spent sync call for /data/kudu/0/data/data/008809b1a384414b93b9d95b6275217a.metadata: real 2.234s user 0.000s sys 0.000s
W0621 14:34:32.088975 125277 env_posix.cc:862] Time spent sync call for /data/kudu/0/data/data/56153c373b084500b3b1fd9310c30c53.data: real 1.786s user 0.000s sys 0.000s
W0621 14:37:23.982235 125277 env_posix.cc:862] Time spent sync call for /data/kudu/0/data/data/df02bb85c15c4652ae653d68e7b8d0ae.metadata: real 7.751s user 0.000s sys 0.000s
W0621 14:37:45.769942 125275 kernel_stack_watchdog.cc:191] Thread 118579 stuck at /data/jenkins/workspace/generic-package-centos64-7-0-impala/topdir/BUILD/kudu-1.6.0-cdh5.14.0/src/kudu/consensus/log.cc:664 for 557ms:
Kernel stack:
[<ffffffffc03eecc5>] do_get_write_access+0x285/0x4c0 [jbd2]
[<ffffffffc03eef27>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
[<ffffffffc0442d4b>] __ext4_journal_get_write_access+0x3b/0x80 [ext4]
[<ffffffffc04144a0>] ext4_reserve_inode_write+0x70/0xa0 [ext4]
[<ffffffffc0414523>] ext4_mark_inode_dirty+0x53/0x210 [ext4]
[<ffffffffc0417af0>] ext4_dirty_inode+0x40/0x60 [ext4]
[<ffffffff8122ffba>] __mark_inode_dirty+0x16a/0x270
[<ffffffff8121ee01>] update_time+0x81/0xd0
[<ffffffff8121eef0>] file_update_time+0xa0/0xf0
[<ffffffff811865a8>] __generic_file_aio_write+0x198/0x400
[<ffffffff81186869>] generic_file_aio_write+0x59/0xa0
[<ffffffffc040aeeb>] ext4_file_write+0xdb/0x470 [ext4]
[<ffffffff812020e9>] do_sync_readv_writev+0x79/0xd0
[<ffffffff81203cce>] do_readv_writev+0xce/0x260
[<ffffffff81203ef5>] vfs_writev+0x35/0x60
[<ffffffff812042f2>] SyS_pwritev+0xc2/0xf0
[<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
User stack:
(thread did not respond: maybe it is blocking signals)
In this case, I just can't understand what the message means.
I'm sure something is wrong with my global configuration (Kudu or Network), but I can't understand what.
Any help would be great!
Thanks
Created 07-11-2018 09:32 AM
Created 06-25-2018 03:07 PM
Hi,
For the warning you saw in watchdog, it looks like a kernal bug on EL 6 machines using EXT 4, which requires either upgrade to RHEL7 or to use XFS instead of EXT4.
And would you mind sharing the master log when you see the slowness (from the symptom you described, one possiblity could be KUDU-2264)? Thanks!
Best,
Hao
Created 07-11-2018 09:32 AM
Created 07-11-2018 02:54 PM
Sounds like good news. Thanks for the update!