I am seeing randomly a soft lockup issue on the above mentioned machine. I have two and this is reproduced on both of them. I have installed Centos 6.4 and upgraded to 6.5, but am using Openstack kernel from RDO repositories. The version is 2.6.32-358.123.2.openstack.el6.x86_64. The machines have two Xeon CPUs and 32GB of memory. A single SSD is used as a disk. The two machines that experience this are compute nodes in an Openstack. They seem to experience this fault, when a large VM is started on them, so they experience this under load. When this starts to happen they only respond to ping, even SSH access is not possible. When a machine is rebooted, the following soft lockup messages are seen repeating many times:<br />
<br />
Jan 17 09:36:10 compute2 kernel: BUG: soft lockup - CPU#6 stuck for 67s! [sudo:10408]<br />
Jan 17 09:36:10 compute2 kernel: Modules linked in: xt_mac xt_physdev veth ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables openvswitch vxlan ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 bridge stp llc vhost_net macvtap macvlan tun kvm_intel kvm bnx2 hpilo hpwdt serio_raw sg iTCO_wdt iTCO_vendor_support i5000_edac edac_core i5k_amb shpchp ext4 jbd2 mbcache sr_mod cdrom pata_acpi ata_generic ata_piix hpsa cciss radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]<br />
Jan 17 09:36:10 compute2 kernel: CPU 6<br />
Jan 17 09:36:10 compute2 kernel: Modules linked in: xt_mac xt_physdev veth ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables openvswitch vxlan ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 bridge stp llc vhost_net macvtap macvlan tun kvm_intel kvm bnx2 hpilo hpwdt serio_raw sg iTCO_wdt iTCO_vendor_support i5000_edac edac_core i5k_amb shpchp ext4 jbd2 mbcache sr_mod cdrom pata_acpi ata_generic ata_piix hpsa cciss radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]<br />
Jan 17 09:36:10 compute2 kernel:<br />
Jan 17 09:36:10 compute2 kernel:<br />
Jan 17 09:36:10 compute2 kernel: Pid: 10408, comm: sudo Not tainted 2.6.32-358.123.2.openstack.el6.x86_64 <a href="http://bugs.centos.org/view.php?id=1">0000001</a> HP ProLiant DL360 G5<br />
Jan 17 09:36:10 compute2 kernel: RIP: 0010:[<ffffffff810d475a>] [<ffffffff810d475a>] audit_log_start+0xea/0x430<br />
Jan 17 09:36:10 compute2 kernel: RSP: 0018:ffff8808213b1998 EFLAGS: 00000206<br />
Jan 17 09:36:10 compute2 kernel: RAX: 0000000109e65d57 RBX: ffff8808213b1a48 RCX: 0000000000000141<br />
Jan 17 09:36:10 compute2 kernel: RDX: 000000000000ea60 RSI: 000000000000ea60 RDI: 0000000000000140<br />
Jan 17 09:36:10 compute2 kernel: RBP: ffffffff8100bb8e R08: 000000000000ea60 R09: 00000000ffffffff<br />
Jan 17 09:36:10 compute2 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: dead000000200200<br />
Jan 17 09:36:10 compute2 kernel: R13: 0000000000000000 R14: 0000000000000286 R15: ffff8808213b18f8<br />
Jan 17 09:36:10 compute2 kernel: FS: 00007f350feb07a0(0000) GS:ffff880028380000(0000) knlGS:0000000000000000<br />
Jan 17 09:36:10 compute2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b<br />
Jan 17 09:36:10 compute2 kernel: CR2: 0000000003453b10 CR3: 000000082146b000 CR4: 00000000000027e0<br />
Jan 17 09:36:10 compute2 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000<br />
Jan 17 09:36:10 compute2 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400<br />
Jan 17 09:36:10 compute2 kernel: Process sudo (pid: 10408, threadinfo ffff8808213b0000, task ffff8807ab190040)<br />
Jan 17 09:36:10 compute2 kernel: Stack:<br />
Jan 17 09:36:10 compute2 kernel: ffff8808213b19b8 000000000000ea60 000000d00000044f 0000000000000000<br />
Jan 17 09:36:10 compute2 kernel: <d> ffff8808213b19d8 ffff88081b626d40 0000000000000000 ffff8807ab190040<br />
Jan 17 09:36:10 compute2 kernel: <d> ffffffff81063990 dead000000100100 dead000000200200 ffff88000003ad80<br />
Jan 17 09:36:10 compute2 kernel: Call Trace:<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff81063990>] ? default_wake_function+0x0/0x20<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff8112a831>] ? get_page_from_freelist+0x3d1/0x830<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff8114497a>] ? handle_mm_fault+0x23a/0x310<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff810d4caa>] ? audit_log_common_recv_msg+0x6a/0xf0<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff8104759c>] ? __do_page_fault+0x1ec/0x480<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff810d50f8>] ? audit_receive+0x3c8/0xd90<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff814748db>] ? netlink_unicast+0x2db/0x320<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff81475350>] ? netlink_sendmsg+0x2c0/0x3d0<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff81436b33>] ? sock_sendmsg+0x123/0x150<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff8112c093>] ? __alloc_pages_nodemask+0x113/0x8d0<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff811a2590>] ? mntput_no_expire+0x30/0x110<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff81436e49>] ? sys_sendto+0x139/0x190<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff811a2590>] ? mntput_no_expire+0x30/0x110<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff810dcd87>] ? audit_syscall_entry+0x1d7/0x200<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b<br />
Jan 17 09:36:10 compute2 kernel: Code: 01 00 00 41 8d 7c 05 00 39 cf 0f 83 5f 01 00 00 45 85 ff 0f 84 98 00 00 00 85 d2 0f 84 90 00 00 00 48 8b 05 69 a1 b3 00 49 89 f0 <49> 29 c0 4c 89 c0 4c 01 e0 48 85 c0 7e cf 48 c7 45 80 00 00 00<br />
Jan 17 09:36:10 compute2 kernel: Call Trace:<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff810d47db>] ? audit_log_start+0x16b/0x430<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff81063990>] ? default_wake_function+0x0/0x20<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff8112a831>] ? get_page_from_freelist+0x3d1/0x830<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff8114497a>] ? handle_mm_fault+0x23a/0x310<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff810d4caa>] ? audit_log_common_recv_msg+0x6a/0xf0<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff8104759c>] ? __do_page_fault+0x1ec/0x480<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff810d50f8>] ? audit_receive+0x3c8/0xd90<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff814748db>] ? netlink_unicast+0x2db/0x320<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff81475350>] ? netlink_sendmsg+0x2c0/0x3d0<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff81436b33>] ? sock_sendmsg+0x123/0x150<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff8112c093>] ? __alloc_pages_nodemask+0x113/0x8d0<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff811a2590>] ? mntput_no_expire+0x30/0x110<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff81436e49>] ? sys_sendto+0x139/0x190<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff811a2590>] ? mntput_no_expire+0x30/0x110<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff810dcd87>] ? audit_syscall_entry+0x1d7/0x200<br />
Jan 17 09:36:10 compute2 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b<br />
<br />
The other machine has a similar call trace:<br />
Jan 17 09:40:55 compute1 kernel: Call Trace:<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff81063990>] ? default_wake_function+0x0/0x20<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff8114497a>] ? handle_mm_fault+0x23a/0x310<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff810d4caa>] ? audit_log_common_recv_msg+0x6a/0xf0<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff8104759c>] ? __do_page_fault+0x1ec/0x480<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff810d50f8>] ? audit_receive+0x3c8/0xd90<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff814748db>] ? netlink_unicast+0x2db/0x320<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff81475350>] ? netlink_sendmsg+0x2c0/0x3d0<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff81436b33>] ? sock_sendmsg+0x123/0x150<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff8112c093>] ? __alloc_pages_nodemask+0x113/0x8d0<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff811a2590>] ? mntput_no_expire+0x30/0x110<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff81436e49>] ? sys_sendto+0x139/0x190<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff811a2590>] ? mntput_no_expire+0x30/0x110<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff810dcd87>] ? audit_syscall_entry+0x1d7/0x200<br />
Jan 17 09:40:55 compute1 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
↧