During routine upgrades on hosts, we started seeing some hosts that experience issues. When we run a yum update on them to bring them up to 6.4, and reboot them, when they come up they start having slowly increasing CPU load, and insane CPU time numbers for processes. CPU load starts out at a normal level, and over the course of a few minutes continues to ramp up, to several hundred. The CPU time for some processes also goes up into several hundreds of thousands or even millions of hours, almost immediately after it comes up from a reboot.<br />
<br />
The confusing part is that is we do a fresh install on the same host, it works completely as expected, with none of these issues. <br />
<br />
We have looked at a package comparison between a host freshly installed directly to CentOS 6.4 and the upgraded host with the issues, and the only package we could find that was missing on the upgraded host was libitm, however installing this on the upgraded host and rebooting did not solve any issues. The only other package differences are the upgraded machine kept some old kernels installed, and had some extra perl packages.<br />
<br />
Here is the output of top:<br />
<br />
Tasks: 483 total, 5 running, 478 sleeping, 0 stopped, 0 zombie<br />
Cpu(s): 11.5%us, 0.5%sy, 0.0%ni, 87.7%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st<br />
Mem: 32798944k total, 28187760k used, 4611184k free, 105968k buffers<br />
Swap: 10256376k total, 0k used, 10256376k free, 1424572k cached<br />
<br />
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND<br />
3481 jhickson 20 0 13392 1404 812 R 3.8 0.0 0:00.03 top<br />
1 root 20 0 21444 1552 1240 S 0.0 0.0 8596343h init<br />
2 root 20 0 0 0 0 S 0.0 0.0 1290835h kthreadd<br />
3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0<br />
4 root 20 0 0 0 0 S 0.0 0.0 10222396h ksoftirqd/0<br />
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0<br />
6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0<br />
7 root RT 0 0 0 0 R 0.0 0.0 300194:20 migration/1<br />
8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1<br />
9 root 20 0 0 0 0 S 0.0 0.0 900583:01 ksoftirqd/1<br />
10 root RT 0 0 0 0 R 0.0 0.0 0:00.00 watchdog/1<br />
11 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2<br />
12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2<br />
13 root 20 0 0 0 0 S 0.0 0.0 20012,57 ksoftirqd/2<br />
14 root RT 0 0 0 0 S 0.0 0.0 300194:20 watchdog/2<br />
15 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3<br />
16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3<br />
17 root 20 0 0 0 0 S 0.0 0.0 300194:20 ksoftirqd/3<br />
18 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/3<br />
19 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4<br />
20 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4<br />
21 root 20 0 0 0 0 S 0.0 0.0 300194:20 ksoftirqd/4<br />
22 root RT 0 0 0 0 S 0.0 0.0 300194:20 watchdog/4<br />
23 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/5<br />
24 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/5<br />
25 root 20 0 0 0 0 S 0.0 0.0 0:00.11 ksoftirqd/5<br />
26 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/5<br />
27 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/6<br />
28 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/6<br />
<br />
uname -a:<br />
<br />
Linux brokenhost.keek.com 2.6.32-358.6.1.el6.x86_64 <a href="http://bugs.centos.org/view.php?id=1">0000001</a> SMP Tue Apr 23 19:29:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux<br />
<br />
On a physical console on the machine, we have been seeing lines like:<br />
<br />
INFO: task sh:3426 blocked for more than 120 seconds<br />
<br />
Where sh could be various things from a shell to kernel components, it's more or less seems to be random.<br />
<br />
Has anyone seen anything like this before? Is there any other information I can provide?
↧