We have a few (3) systems that are crashing with:<br />
<br />
Assertion failure in journal_next_log_block() at fs/jbd/journal.c:576:<br />
"journal->j_free > 1" <br />
<br />
Kernel BUG at journal:576<br />
invalid operand: 0000 [1] SMP<br />
CPU 1<br />
Modules linked in: <br />
md5 ipv6 parport_pc lp parport w83627hf eeprom adm1026 hwmon_vid hwmon<br />
i2c_sensor i2c_isa i2c_amd756 i2c_amd8111 i2c_dev i2c_core nfs lockd<br />
nfs_acl sunrpc ipt_REJECT ipt_state ip_conntrack iptable_filter<br />
ip_tables button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot<br />
dm_zero dm_mirror ext3 jbd dm_mod 3w_9xxx sata_mv libata sd_mod<br />
scsi_mod<br />
Pid: 1603, comm: kjournald Not tainted 2.6.9-42.0.3.ELsmp<br />
RIP: 0010:[<ffffffffa006c18a>]<br />
<ffffffffa006c18a>{:jbd:journal_next_log_block+76}<br />
RSP: 0018:0000010476327b88 EFLAGS: 00010212<br />
RAX: 0000000000000060 RBX: 0000010283163e00 RCX: ffffffff803e1fe8<br />
RDX: ffffffff803e1fe8 RSI: 0000000000000246 RDI: ffffffff803e1fe0<br />
RBP: 0000000000000040 R08: ffffffff803e1fe8 R09: 0000010283163e00<br />
R10: 0000000100000000 R11: ffffffff8011e884 R12: 0000010283163e24<br />
R13: 0000010476327be0 R14: 0000010283163e00 R15: 000000000000002e<br />
FS: 0000002a95560b00(0000) GS:ffffffff804e5200(0000)<br />
knlGS:00000000f7ff36c0<br />
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b<br />
CR2: 0000002a9556c000 CR3: 0000000037e42000 CR4: 00000000000006e0<br />
Process kjournald (pid: 1603, threadinfo 0000010476326000, task<br />
0000010478d777f0)<br />
Stack: 0000010453f4afa8 0000010310072240 0000000000000040<br />
0000010147528be0<br />
000001044240a880 ffffffffa0067dfe 00000e7c00000000<br />
00000101c33f2184<br />
0000000000000000 0000010310b12f50<br />
Call Trace:<ffffffffa0067dfe>{:jbd:journal_commit_transaction+1834}<br />
<ffffffff80135756>{autoremove_wake_function+0}<br />
<ffffffff80135756>{autoremove_wake_function+0}<br />
<ffffffffa006a914>{:jbd:kjournald+250}<br />
<ffffffff80135756>{autoremove_wake_function+0}<br />
<ffffffff80135756>{autoremove_wake_function+0}<br />
<ffffffffa006a814>{:jbd:commit_timeout+0}<br />
<ffffffff80110f47>{child_rip+8}<br />
<ffffffffa006a81a>{:jbd:kjournald+0}<br />
<ffffffff80110f3f>{child_rip+0}<br />
<br />
Code: 0f 0b bd e2 06 a0 ff ff ff ff 40 02 48 8b ab 18 01 00 00 48<br />
RIP <ffffffffa006c18a>{:jbd:journal_next_log_block+76} RSP<br />
<0000010476327b88><br />
<0>Kernel panic - not syncing: Oops<br />
<br />
(Note I editied together some lines in the "Modules linked in"<br />
section. The rest is cut from the serial console (size 80x24) on the<br />
system.)<br />
<br />
We are running centos 4.4 kernel. Uname -a shows:<br />
<br />
Linux cook05 2.6.9-42.0.3.ELsmp <a href="http://bugs.centos.org/view.php?id=1">0000001</a> SMP Fri Oct 6 06:28:26 CDT 2006<br />
x86_64 x86_64 x86_64 GNU/Linux <br />
<br />
The disk subsystem for this crash are 4 sata disks on a 3ware 9550<br />
(see the attached dmesg output for more info) with a mix of western<br />
digital and seagate drives. It has also crashed with sysrq enabled and<br />
(not surprisingly) the system is totally dead. We have to power cycle<br />
it to reboot it.<br />
<br />
Other systems experiencing the same crash have:<br />
<br />
* non-smp version of the same kernel with the software md raid<br />
drivers<br />
* same kernel running a megaraid raid card<br />
<br />
The same crash has also been seen with an earlier kernel version<br />
2.6.9-42.ELsmp.<br />
<br />
It seems to crash when we expect the system to have high IO, but we<br />
don't have any hard evidence of throughput/transactions to disk to<br />
support that.<br />
<br />
We can try setting up a remote kernel dump if that would be<br />
useful/would work.<br />
<br />
We get a crash every couple of days on average (sometimes two crashes<br />
with 30 min-2 hours between them) so we can try applying patches/new<br />
kernels if needed and see how the system does.<br />
<br />
I have attached selected lines from dmesg to give some additional info<br />
about the hardware and config of the system. I have a copy of<br />
/proc/kallsyms from the system that I can attach if you wish.<br />
In both cases, the files are from a post crash boot that should be<br />
identical to the pre-crash boot. <br />
<br />
If you require more/different information just let me know and I will<br />
try to obtain it.<br />
<br />
Thank you for your help.<br />
<br />
--<br />
-- rouilj<br />
<br />
John Rouillard<br />
System Administrator<br />
Renesys Corporation<br />
603-643-9300 x 111
↧