We have a bit over a dozen SuperMicro servers that are hanging because of something with storage. These systems are using multiple SSD drives in a RAID10 using the aacraid card. We have configured kdump, nfs, and sysctl to capture dmesg and a core-dump whenever these hangs occur. (full dmesg attached.) The relevant lines from dmesg appear to be these:<br />
<br />
...<br />
<3>aacraid: Host adapter abort request (4,0,0,0)<br />
<3>aacraid: Host adapter abort request (4,0,0,0)<br />
<3>aacraid: Host adapter reset request. SCSI hang ?<br />
<3>aacraid: SCSI bus appears hung<br />
<4>IRQ 30/aacraid: IRQF_DISABLED is not guaranteed on shared IRQs<br />
<3>INFO: task jbd2/dm-0-8:666 blocked for more than 120 seconds.<br />
<3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.<br />
<6>jbd2/dm-0-8 D 0000000000000001 0 666 2 0x00000000<br />
...<br />
<br />
# uname -srvm<br />
Linux 2.6.32-358.11.1.el6.x86_64 <a href="http://bugs.centos.org/view.php?id=1">0000001</a> SMP Wed Jun 12 03:34:52 UTC 2013 x86_64<br />
<br />
On average we see these hangups and corresponding panics about once every 5 days. <br />
<br />
Changing IO scheduling to noop has not helped.
↧