Hi,<br />
<br />
We have VMs with CentOS 6.4 and 2-3 minutes after Apache Tomcat starts, the CPU becomes 100% consumed by the system (kernel) with no apparent reason.<br />
<br />
It may be related to a kernel dead lock/timers/memory management, but can't find the real cause:<br />
<br />
top - 12:16:00 up 1:53, 7 users, load average: 0.74, 3.62, 3.92<br />
Tasks: 157 total, 1 running, 154 sleeping, 1 stopped, 1 zombie<br />
Cpu0 : 0.7%us, 86.4%sy, 0.0%ni, 1.4%id, 11.4%wa, 0.0%hi, 0.0%si, 0.0%st<br />
Cpu1 : 0.7%us, 96.3%sy, 0.0%ni, 1.0%id, 2.0%wa, 0.0%hi, 0.0%si, 0.0%st<br />
Cpu2 : 0.3%us, 86.2%sy, 0.0%ni, 1.3%id, 12.1%wa, 0.0%hi, 0.0%si, 0.0%st<br />
Cpu3 : 1.1%us, 44.6%sy, 0.0%ni, 1.8%id, 52.5%wa, 0.0%hi, 0.0%si, 0.0%st<br />
Mem: 3916788k total, 750516k used, 3166272k free, 18940k buffers<br />
Swap: 6160376k total, 224k used, 6160152k free, 84180k cached<br />
<br />
We tried all tricks and resolutions in Google, which most of them un-related to this CentOS version: leap second, divider=10, hugepages defrag, disabling Hyper Threading, ...<br />
But none helps.<br />
<br />
Please let us know if this is a suspected OS bug and how can we determine and work around that.<br />
<br />
It could be also related to ESX5.1 we use, but we have a large clusters and it doesn't seem to be related to specific hardware but happens on many physical systems.<br />
<br />
We installed kernel-debug on one of the servers and so the following output in /var/log/messages, but I'm not sure it's related to this problem:<br />
<br />
Jan 28 06:12:16 labliwb8024 kernel: <br />
Jan 28 06:12:16 labliwb8024 kernel: =================================<br />
Jan 28 06:12:16 labliwb8024 kernel: [ INFO: inconsistent lock state ]<br />
Jan 28 06:12:16 labliwb8024 kernel: 2.6.32-431.3.1.el6.x86_64.debug <a href="http://bugs.centos.org/view.php?id=1">0000001</a><br />
Jan 28 06:12:16 labliwb8024 kernel: ---------------------------------<br />
Jan 28 06:12:16 labliwb8024 kernel: inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.<br />
Jan 28 06:12:16 labliwb8024 kernel: swapper/0 [HC0[0]:SC1[3]:HE1:SE0] takes:<br />
Jan 28 06:12:16 labliwb8024 kernel: (lock#2){+.?...}, at: [<ffffffffa0240c0e>] VMCI_GrabLock_BH+0xe/0x10 [vmci]<br />
Jan 28 06:12:16 labliwb8024 kernel: {SOFTIRQ-ON-W} state was registered at:<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810bbd38>] __lock_acquire+0x638/0x1560<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810bcd04>] lock_acquire+0xa4/0x120<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff8155e6e6>] _spin_lock+0x36/0x70<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffffa0240c2e>] VMCI_GrabLock+0xe/0x10 [vmci]<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffffa0242df2>] VMCIContext_InitContext+0x162/0x310 [vmci]<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffffa0244db0>] VMCI_HostInit+0x30/0xa0 [vmci]<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffffa023ebf8>] vmci_host_init+0x18/0x130 [vmci]<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffffa001714e>] 0xffffffffa001714e<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810ce1f3>] sys_init_module+0xe3/0x260<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b<br />
Jan 28 06:12:16 labliwb8024 kernel: irq event stamp: 11882598<br />
Jan 28 06:12:16 labliwb8024 kernel: hardirqs last enabled at (11882598): [<ffffffff8107f76f>] tasklet_action+0x4f/0x140<br />
Jan 28 06:12:16 labliwb8024 kernel: hardirqs last disabled at (11882597): [<ffffffff8107f743>] tasklet_action+0x23/0x140<br />
Jan 28 06:12:16 labliwb8024 kernel: softirqs last enabled at (11882594): [<ffffffff8107f233>] _local_bh_enable+0x13/0x20<br />
Jan 28 06:12:16 labliwb8024 kernel: softirqs last disabled at (11882595): [<ffffffff8100c40c>] call_softirq+0x1c/0x30<br />
Jan 28 06:12:16 labliwb8024 kernel: <br />
Jan 28 06:12:16 labliwb8024 kernel: other info that might help us debug this:<br />
Jan 28 06:12:16 labliwb8024 kernel: no locks held by swapper/0.<br />
Jan 28 06:12:16 labliwb8024 kernel: <br />
Jan 28 06:12:16 labliwb8024 kernel: stack backtrace:<br />
Jan 28 06:12:16 labliwb8024 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-431.3.1.el6.x86_64.debug <a href="http://bugs.centos.org/view.php?id=1">0000001</a><br />
Jan 28 06:12:16 labliwb8024 kernel: Call Trace:<br />
Jan 28 06:12:16 labliwb8024 kernel: <IRQ> [<ffffffff810b9b07>] ? print_usage_bug+0x177/0x180<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810baaad>] ? mark_lock+0x35d/0x430<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810bbcd8>] ? __lock_acquire+0x5d8/0x1560<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810a8e3f>] ? cpu_clock+0x6f/0x80<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810ba5fd>] ? lock_release_holdtime+0x3d/0x190<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810b74dd>] ? trace_hardirqs_off+0xd/0x10<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff8155e4d7>] ? _spin_unlock_irqrestore+0x67/0x80<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff812b9bbf>] ? debug_check_no_obj_freed+0x18f/0x210<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810bcd04>] ? lock_acquire+0xa4/0x120<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffffa0240c0e>] ? VMCI_GrabLock_BH+0xe/0x10 [vmci]<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810b74dd>] ? trace_hardirqs_off+0xd/0x10<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810156d3>] ? native_sched_clock+0x13/0x80<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff8155e75b>] ? _spin_lock_bh+0x3b/0x70<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffffa0240c0e>] ? VMCI_GrabLock_BH+0xe/0x10 [vmci]<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffffa0240c0e>] ? VMCI_GrabLock_BH+0xe/0x10 [vmci]<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffffa02452d4>] ? VMCIEvent_Dispatch+0x94/0x280 [vmci]<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810b74dd>] ? trace_hardirqs_off+0xd/0x10<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffffa0244b88>] ? VMCI_ReadDatagramsFromPort+0x1b8/0x1d0 [vmci]<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffffa023e142>] ? dispatch_datagrams+0x32/0x60 [vmci]<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff8107f823>] ? tasklet_action+0x103/0x140<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff8107f0ef>] ? __do_softirq+0xdf/0x210<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff8100c40c>] ? call_softirq+0x1c/0x30<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff8100fc3d>] ? do_softirq+0xad/0xe0<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff8107ee25>] ? irq_exit+0x95/0xa0<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff81565c65>] ? do_IRQ+0x75/0xf0<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x16<br />
Jan 28 06:12:16 labliwb8024 kernel: <EOI> [<ffffffff8103fbab>] ? native_safe_halt+0xb/0x10<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff810baf4d>] ? trace_hardirqs_on+0xd/0x10<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff81016b22>] ? default_idle+0x52/0xc0<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff81009fcb>] ? cpu_idle+0xbb/0x110<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff8153f40a>] ? rest_init+0x7a/0x80<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff81bfdfc0>] ? start_kernel+0x456/0x462<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff81bfd33a>] ? x86_64_start_reservations+0x125/0x129<br />
Jan 28 06:12:16 labliwb8024 kernel: [<ffffffff81bfd453>] ? x86_64_start_kernel+0x115/0x124<br />
Jan 28 06:12:16 labliwb8024 kernel: [0]: VMCI: Updating context from (ID=0xffffffff) to (ID=0xe3b6b8e0) on event (type=0).
↧