Reply
Highlighted
New Member
Posts: 0
Registered: ‎05-19-2016

kernel panic on high load

Hi all,

 

we have a cluster of 1 Namenode and 6 datanodes. On high load some datanodes crashes:

 

 

<6>warning: `jsvc' uses 32-bit capabilities (legacy support in use)
<3>BUG: scheduling while atomic: ksoftirqd/0/4/0x10000100
<4>Modules linked in: bonding ipv6 vfat fat iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core tg3 ptp pps_core hpilo hpwdt ioatdma dca power_meter acpi_ipmi ipmi_si ipmi_msghandler sg ext4 jbd2 mbcache sd_mod crc_t10dif xhci_hcd hpsa wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>Pid: 4, comm: ksoftirqd/0 Not tainted 2.6.32-573.22.1.el6.x86_64 #1
<4>Call Trace:
<4> <IRQ>  [<ffffffff81063386>] ? __schedule_bug+0x66/0x70
<4> [<ffffffff8153999c>] ? thread_return+0x6ac/0x7d0
<4> [<ffffffff810f0628>] ? handle_edge_irq+0x98/0x180
<4> [<ffffffff815434f5>] ? do_IRQ+0x75/0xf0
<4> [<ffffffff814c2ba2>] ? tcp_fragment+0x332/0x340
<4> [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11
<4> [<ffffffff814c2ba2>] ? tcp_fragment+0x332/0x340
<4> [<ffffffff8106f1da>] ? __cond_resched+0x2a/0x40
<4> [<ffffffff814c2ba2>] ? tcp_fragment+0x332/0x340
<4> [<ffffffff814c2ba2>] ? tcp_fragment+0x332/0x340
<4> [<ffffffff81539d90>] ? _cond_resched+0x30/0x40
<4> [<ffffffff8100fd16>] ? is_valid_bugaddr+0x16/0x40
<4> [<ffffffff812906ff>] ? report_bug+0x1f/0xc0
<4> [<ffffffff81010fcf>] ? die+0x7f/0x90
<4> [<ffffffff8153d274>] ? do_trap+0xc4/0x160
<4> [<ffffffff8100cd95>] ? do_invalid_op+0x95/0xb0
<4> [<ffffffff814c2ba2>] ? tcp_fragment+0x332/0x340
<4> [<ffffffffa0236831>] ? bond_dev_queue_xmit+0x31/0xa0 [bonding]
<4> [<ffffffffa02368ff>] ? bond_xmit_slave_id+0x5f/0xd0 [bonding]
<4> [<ffffffff8100c01b>] ? invalid_op+0x1b/0x20
<4> [<ffffffff814c2ba2>] ? tcp_fragment+0x332/0x340
<4> [<ffffffff814c2a74>] ? tcp_fragment+0x204/0x340
<4> [<ffffffff814b89a1>] ? tcp_match_skb_to_sack+0xb1/0xf0
<4> [<ffffffff814b8ac8>] ? tcp_sacktag_walk+0xe8/0x520
<4> [<ffffffff814abd05>] ? ip_local_out+0x25/0x30
<4> [<ffffffff814b9322>] ? tcp_sacktag_write_queue+0x422/0xb80
<4> [<ffffffff814beeb0>] ? tcp_ack+0x9c0/0x13f0
<4> [<ffffffff814bd0e1>] ? tcp_validate_incoming+0x61/0x2e0
<4> [<ffffffff814bfbe8>] ? tcp_rcv_established+0x308/0x7e0
<4> [<ffffffff814c81b3>] ? tcp_v4_do_rcv+0x2e3/0x490
<4> [<ffffffff814bfb72>] ? tcp_rcv_established+0x292/0x7e0
<4> [<ffffffff814c9ac2>] ? tcp_v4_rcv+0x512/0x920
<4> [<ffffffff814a6878>] ? ip_local_deliver+0x98/0xa0
<4> [<ffffffff814a65ed>] ? ip_local_deliver_finish+0xdd/0x2d0
<4> [<ffffffff814a6878>] ? ip_local_deliver+0x98/0xa0
<4> [<ffffffff814a5d3d>] ? ip_rcv_finish+0x12d/0x440
<4> [<ffffffff814a62c5>] ? ip_rcv+0x275/0x350
<4> [<ffffffff8146b628>] ? __netif_receive_skb+0x208/0x570
<4> [<ffffffff8146ef08>] ? netif_receive_skb+0x58/0x60
<4> [<ffffffff8146f010>] ? napi_skb_finish+0x50/0x70
<4> [<ffffffff81470e79>] ? napi_gro_receive+0x39/0x50
<4> [<ffffffffa019199b>] ? tg3_poll_work+0xd7b/0xfb0 [tg3]
<4> [<ffffffffa0191c1f>] ? tg3_poll_msix+0x4f/0x160 [tg3]
<4> [<ffffffff81470f93>] ? net_rx_action+0x103/0x2f0
<4> [<ffffffff81080361>] ? __do_softirq+0xc1/0x1e0
<4> [<ffffffff8100c38c>] ? call_softirq+0x1c/0x30
<4> <EOI>  [<ffffffff8100fc25>] ? do_softirq+0x65/0xa0
<4> [<ffffffff8107ff20>] ? ksoftirqd+0x80/0x110
<4> [<ffffffff8107fea0>] ? ksoftirqd+0x0/0x110
<4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
<4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>------------[ cut here ]------------
<2>kernel BUG at net/ipv4/tcp_output.c:983!
<4>invalid opcode: 0000 [#1] SMP
<4>last sysfs file: /sys/devices/system/cpu/online
<4>CPU 0
<4>Modules linked in: bonding ipv6 vfat fat iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core tg3 ptp pps_core hpilo hpwdt ioatdma dca power_meter acpi_ipmi ipmi_si ipmi_msghandler sg ext4 jbd2 mbcache sd_mod crc_t10dif xhci_hcd hpsa wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 4, comm: ksoftirqd/0 Not tainted 2.6.32-573.22.1.el6.x86_64 #1 HP ProLiant DL380 Gen9
<4>RIP: 0010:[<ffffffff814c2ba2>]  [<ffffffff814c2ba2>] tcp_fragment+0x332/0x340
<4>RSP: 0018:ffff8800282037e8  EFLAGS: 00010293
<4>RAX: 0000000000000001 RBX: ffff88403e8c6d00 RCX: 000000000000059c
<4>RDX: 000000000000059c RSI: ffff88403e8c6d00 RDI: ffff884043482e00
<4>RBP: ffff880028203838 R08: 000000000000059c R09: 000000000000059c
<4>R10: ffff882f6cb72940 R11: ffff88403e8c6d38 R12: 0000000000000030
<4>R13: ffff884043482e00 R14: ffff880028203968 R15: 0000000000000000
<4>FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 00007f7b54eab200 CR3: 00000040514d1000 CR4: 00000000001407f0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process ksoftirqd/0 (pid: 4, threadinfo ffff882053980000, task ffff88205397b520)
<4>Stack:
<4> ffff880130bce780 ffff88146d1b2cc0 ffff880028203858 0000059c6d1b2cc0
<4><d> ffff880028203858 0000000000000000 0000000000000002 ffff88403e8c6d38
<4><d> ffff880028203968 0000000000000000 ffff880028203858 ffffffff814b89a1
<4>Call Trace:
<4> <IRQ>
<4> [<ffffffff814b89a1>] tcp_match_skb_to_sack+0xb1/0xf0
<4> [<ffffffff814b8ac8>] tcp_sacktag_walk+0xe8/0x520
<4> [<ffffffff814abd05>] ? ip_local_out+0x25/0x30
<4> [<ffffffff814b9322>] tcp_sacktag_write_queue+0x422/0xb80
<4> [<ffffffff814beeb0>] tcp_ack+0x9c0/0x13f0
<4> [<ffffffff814bd0e1>] ? tcp_validate_incoming+0x61/0x2e0
<4> [<ffffffff814bfbe8>] tcp_rcv_established+0x308/0x7e0
<4> [<ffffffff814c81b3>] tcp_v4_do_rcv+0x2e3/0x490
<4> [<ffffffff814bfb72>] ? tcp_rcv_established+0x292/0x7e0
<4> [<ffffffff814c9ac2>] tcp_v4_rcv+0x512/0x920
<4> [<ffffffff814a6878>] ? ip_local_deliver+0x98/0xa0
<4> [<ffffffff814a65ed>] ip_local_deliver_finish+0xdd/0x2d0
<4> [<ffffffff814a6878>] ip_local_deliver+0x98/0xa0
<4> [<ffffffff814a5d3d>] ip_rcv_finish+0x12d/0x440
<4> [<ffffffff814a62c5>] ip_rcv+0x275/0x350
<4> [<ffffffff8146b628>] __netif_receive_skb+0x208/0x570
<4> [<ffffffff8146ef08>] netif_receive_skb+0x58/0x60
<4> [<ffffffff8146f010>] napi_skb_finish+0x50/0x70
<4> [<ffffffff81470e79>] napi_gro_receive+0x39/0x50
<4> [<ffffffffa019199b>] tg3_poll_work+0xd7b/0xfb0 [tg3]
<4> [<ffffffffa0191c1f>] tg3_poll_msix+0x4f/0x160 [tg3]
<4> [<ffffffff81470f93>] net_rx_action+0x103/0x2f0
<4> [<ffffffff81080361>] __do_softirq+0xc1/0x1e0
<4> [<ffffffff8100c38c>] call_softirq+0x1c/0x30
<4> <EOI>
<4> [<ffffffff8100fc25>] ? do_softirq+0x65/0xa0
<4> [<ffffffff8107ff20>] ksoftirqd+0x80/0x110
<4> [<ffffffff8107fea0>] ? ksoftirqd+0x0/0x110
<4> [<ffffffff810a138e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>Code: 8b 84 24 cc 00 00 00 e9 39 ff ff ff 0f 1f 44 00 00 89 c1 c1 e8 08 81 e1 ff 00 ff 00 25 ff 00 ff 00 c1 e1 08 01 c8 e9 6b fe ff ff <0f> 0b eb fe 0f 0b eb fe 66 0f 1f 44 00 00 55 48 89 e5 41 57 41
<1>RIP  [<ffffffff814c2ba2>] tcp_fragment+0x332/0x340
<4> RSP <ffff8800282037e8>

We use Cloudera "5.4.5-1.cdh5.4.5.p0.7" on a CentOS 6.7 with kernel "2.6.32-573.22.1.el6.x86_64".

 

Any Ideas how to fix that?

 

Regards

Announcements