Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

kernel panic on high load

Highlighted

kernel panic on high load

New Member

Hi all,

 

we have a cluster of 1 Namenode and 6 datanodes. On high load some datanodes crashes:

 

 

<6>warning: `jsvc' uses 32-bit capabilities (legacy support in use)
<3>BUG: scheduling while atomic: ksoftirqd/0/4/0x10000100
<4>Modules linked in: bonding ipv6 vfat fat iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core tg3 ptp pps_core hpilo hpwdt ioatdma dca power_meter acpi_ipmi ipmi_si ipmi_msghandler sg ext4 jbd2 mbcache sd_mod crc_t10dif xhci_hcd hpsa wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>Pid: 4, comm: ksoftirqd/0 Not tainted 2.6.32-573.22.1.el6.x86_64 #1
<4>Call Trace:
<4> <IRQ>  [<ffffffff81063386>] ? __schedule_bug+0x66/0x70
<4> [<ffffffff8153999c>] ? thread_return+0x6ac/0x7d0
<4> [<ffffffff810f0628>] ? handle_edge_irq+0x98/0x180
<4> [<ffffffff815434f5>] ? do_IRQ+0x75/0xf0
<4> [<ffffffff814c2ba2>] ? tcp_fragment+0x332/0x340
<4> [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11
<4> [<ffffffff814c2ba2>] ? tcp_fragment+0x332/0x340
<4> [<ffffffff8106f1da>] ? __cond_resched+0x2a/0x40
<4> [<ffffffff814c2ba2>] ? tcp_fragment+0x332/0x340
<4> [<ffffffff814c2ba2>] ? tcp_fragment+0x332/0x340
<4> [<ffffffff81539d90>] ? _cond_resched+0x30/0x40
<4> [<ffffffff8100fd16>] ? is_valid_bugaddr+0x16/0x40
<4> [<ffffffff812906ff>] ? report_bug+0x1f/0xc0
<4> [<ffffffff81010fcf>] ? die+0x7f/0x90
<4> [<ffffffff8153d274>] ? do_trap+0xc4/0x160
<4> [<ffffffff8100cd95>] ? do_invalid_op+0x95/0xb0
<4> [<ffffffff814c2ba2>] ? tcp_fragment+0x332/0x340
<4> [<ffffffffa0236831>] ? bond_dev_queue_xmit+0x31/0xa0 [bonding]
<4> [<ffffffffa02368ff>] ? bond_xmit_slave_id+0x5f/0xd0 [bonding]
<4> [<ffffffff8100c01b>] ? invalid_op+0x1b/0x20
<4> [<ffffffff814c2ba2>] ? tcp_fragment+0x332/0x340
<4> [<ffffffff814c2a74>] ? tcp_fragment+0x204/0x340
<4> [<ffffffff814b89a1>] ? tcp_match_skb_to_sack+0xb1/0xf0
<4> [<ffffffff814b8ac8>] ? tcp_sacktag_walk+0xe8/0x520
<4> [<ffffffff814abd05>] ? ip_local_out+0x25/0x30
<4> [<ffffffff814b9322>] ? tcp_sacktag_write_queue+0x422/0xb80
<4> [<ffffffff814beeb0>] ? tcp_ack+0x9c0/0x13f0
<4> [<ffffffff814bd0e1>] ? tcp_validate_incoming+0x61/0x2e0
<4> [<ffffffff814bfbe8>] ? tcp_rcv_established+0x308/0x7e0
<4> [<ffffffff814c81b3>] ? tcp_v4_do_rcv+0x2e3/0x490
<4> [<ffffffff814bfb72>] ? tcp_rcv_established+0x292/0x7e0
<4> [<ffffffff814c9ac2>] ? tcp_v4_rcv+0x512/0x920
<4> [<ffffffff814a6878>] ? ip_local_deliver+0x98/0xa0
<4> [<ffffffff814a65ed>] ? ip_local_deliver_finish+0xdd/0x2d0
<4> [<ffffffff814a6878>] ? ip_local_deliver+0x98/0xa0
<4> [<ffffffff814a5d3d>] ? ip_rcv_finish+0x12d/0x440
<4> [<ffffffff814a62c5>] ? ip_rcv+0x275/0x350
<4> [<ffffffff8146b628>] ? __netif_receive_skb+0x208/0x570
<4> [<ffffffff8146ef08>] ? netif_receive_skb+0x58/0x60
<4> [<ffffffff8146f010>] ? napi_skb_finish+0x50/0x70
<4> [<ffffffff81470e79>] ? napi_gro_receive+0x39/0x50
<4> [<ffffffffa019199b>] ? tg3_poll_work+0xd7b/0xfb0 [tg3]
<4> [<ffffffffa0191c1f>] ? tg3_poll_msix+0x4f/0x160 [tg3]
<4> [<ffffffff81470f93>] ? net_rx_action+0x103/0x2f0
<4> [<ffffffff81080361>] ? __do_softirq+0xc1/0x1e0
<4> [<ffffffff8100c38c>] ? call_softirq+0x1c/0x30
<4> <EOI>  [<ffffffff8100fc25>] ? do_softirq+0x65/0xa0
<4> [<ffffffff8107ff20>] ? ksoftirqd+0x80/0x110
<4> [<ffffffff8107fea0>] ? ksoftirqd+0x0/0x110
<4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
<4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>------------[ cut here ]------------
<2>kernel BUG at net/ipv4/tcp_output.c:983!
<4>invalid opcode: 0000 [#1] SMP
<4>last sysfs file: /sys/devices/system/cpu/online
<4>CPU 0
<4>Modules linked in: bonding ipv6 vfat fat iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core tg3 ptp pps_core hpilo hpwdt ioatdma dca power_meter acpi_ipmi ipmi_si ipmi_msghandler sg ext4 jbd2 mbcache sd_mod crc_t10dif xhci_hcd hpsa wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 4, comm: ksoftirqd/0 Not tainted 2.6.32-573.22.1.el6.x86_64 #1 HP ProLiant DL380 Gen9
<4>RIP: 0010:[<ffffffff814c2ba2>]  [<ffffffff814c2ba2>] tcp_fragment+0x332/0x340
<4>RSP: 0018:ffff8800282037e8  EFLAGS: 00010293
<4>RAX: 0000000000000001 RBX: ffff88403e8c6d00 RCX: 000000000000059c
<4>RDX: 000000000000059c RSI: ffff88403e8c6d00 RDI: ffff884043482e00
<4>RBP: ffff880028203838 R08: 000000000000059c R09: 000000000000059c
<4>R10: ffff882f6cb72940 R11: ffff88403e8c6d38 R12: 0000000000000030
<4>R13: ffff884043482e00 R14: ffff880028203968 R15: 0000000000000000
<4>FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 00007f7b54eab200 CR3: 00000040514d1000 CR4: 00000000001407f0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process ksoftirqd/0 (pid: 4, threadinfo ffff882053980000, task ffff88205397b520)
<4>Stack:
<4> ffff880130bce780 ffff88146d1b2cc0 ffff880028203858 0000059c6d1b2cc0
<4><d> ffff880028203858 0000000000000000 0000000000000002 ffff88403e8c6d38
<4><d> ffff880028203968 0000000000000000 ffff880028203858 ffffffff814b89a1
<4>Call Trace:
<4> <IRQ>
<4> [<ffffffff814b89a1>] tcp_match_skb_to_sack+0xb1/0xf0
<4> [<ffffffff814b8ac8>] tcp_sacktag_walk+0xe8/0x520
<4> [<ffffffff814abd05>] ? ip_local_out+0x25/0x30
<4> [<ffffffff814b9322>] tcp_sacktag_write_queue+0x422/0xb80
<4> [<ffffffff814beeb0>] tcp_ack+0x9c0/0x13f0
<4> [<ffffffff814bd0e1>] ? tcp_validate_incoming+0x61/0x2e0
<4> [<ffffffff814bfbe8>] tcp_rcv_established+0x308/0x7e0
<4> [<ffffffff814c81b3>] tcp_v4_do_rcv+0x2e3/0x490
<4> [<ffffffff814bfb72>] ? tcp_rcv_established+0x292/0x7e0
<4> [<ffffffff814c9ac2>] tcp_v4_rcv+0x512/0x920
<4> [<ffffffff814a6878>] ? ip_local_deliver+0x98/0xa0
<4> [<ffffffff814a65ed>] ip_local_deliver_finish+0xdd/0x2d0
<4> [<ffffffff814a6878>] ip_local_deliver+0x98/0xa0
<4> [<ffffffff814a5d3d>] ip_rcv_finish+0x12d/0x440
<4> [<ffffffff814a62c5>] ip_rcv+0x275/0x350
<4> [<ffffffff8146b628>] __netif_receive_skb+0x208/0x570
<4> [<ffffffff8146ef08>] netif_receive_skb+0x58/0x60
<4> [<ffffffff8146f010>] napi_skb_finish+0x50/0x70
<4> [<ffffffff81470e79>] napi_gro_receive+0x39/0x50
<4> [<ffffffffa019199b>] tg3_poll_work+0xd7b/0xfb0 [tg3]
<4> [<ffffffffa0191c1f>] tg3_poll_msix+0x4f/0x160 [tg3]
<4> [<ffffffff81470f93>] net_rx_action+0x103/0x2f0
<4> [<ffffffff81080361>] __do_softirq+0xc1/0x1e0
<4> [<ffffffff8100c38c>] call_softirq+0x1c/0x30
<4> <EOI>
<4> [<ffffffff8100fc25>] ? do_softirq+0x65/0xa0
<4> [<ffffffff8107ff20>] ksoftirqd+0x80/0x110
<4> [<ffffffff8107fea0>] ? ksoftirqd+0x0/0x110
<4> [<ffffffff810a138e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>Code: 8b 84 24 cc 00 00 00 e9 39 ff ff ff 0f 1f 44 00 00 89 c1 c1 e8 08 81 e1 ff 00 ff 00 25 ff 00 ff 00 c1 e1 08 01 c8 e9 6b fe ff ff <0f> 0b eb fe 0f 0b eb fe 66 0f 1f 44 00 00 55 48 89 e5 41 57 41
<1>RIP  [<ffffffff814c2ba2>] tcp_fragment+0x332/0x340
<4> RSP <ffff8800282037e8>

We use Cloudera "5.4.5-1.cdh5.4.5.p0.7" on a CentOS 6.7 with kernel "2.6.32-573.22.1.el6.x86_64".

 

Any Ideas how to fix that?

 

Regards