[elrepo] RAID 5 issue
Peter Steele
pwsteele at gmail.com
Fri Oct 2 18:17:07 EDT 2015
We were originally running with the stock CentOS 7.1 kernel, release
3.10.0-229.4.2.el7.x86_64. We hit what appears to be the bug described here:
https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/785668
although obviously a CentOS manifestation of it. We're using containers
created and managed by libvirt-lxc and were able to confirm that
containers running in two different hosts were frequently unable to
communicate with each other due to their arp tables not getting
populated properly. As an experiment we tried installing the 4.2
kernel-ml release from El Repo and the arp table issue appeared to be
fixed.
Since we had no other solution at the time we decided to switch to the
the 4.2 kernel and give it some soak time in-house. Everything worked
well for a few weeks but we've started seeing kernel crashes with the
following pattern:
Sep 29 08:41:18 jg-02 kernel: WARNING: CPU: 1 PID: 7400 at
drivers/md/raid5.c:4244 break_stripe_batch_list+0x255/0x270 [raid456]()
Sep 29 08:41:18 jg-02 kernel: Modules linked in: fuse veth raid456
async_raid6_recov async_memcpy async_pq async_xor xor async_tx
raid6_pq xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4
iptable_filter ip_tables tun bonding kvm_amd kvm crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel binfmt_misc
lrw gf128mul glue_helper ablk_helper amd64_edac_mod cryptd edac_mce_amd
serio_raw pcspkr edac_core joydev fam15h_power k10temp input_leds
8250_fintek sp5100_tco shpchp i2c_piix4 acpi_cpufreq ipmi_devintf
ipmi_msghandler ext4 mbcache jbd2 raid1 sr_mod cdrom sd_mod mgag200
syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm
ata_generic pata_acpi drm pata_atiixp
Sep 29 08:41:18 jg-02 kernel: ahci libahci libata e1000e ptp pps_core
uas usb_storage
Sep 29 08:41:18 jg-02 kernel: CPU: 1 PID: 7400 Comm: md1_raid5 Not
tainted 4.2.0-1.el7.elrepo.x86_64 #1
Sep 29 08:41:18 jg-02 kernel: Hardware name: Supermicro AS
-1012G-MTF/H8SGL, BIOS 3.5 11/25/2013
Sep 29 08:41:18 jg-02 kernel: 0000000000000000 000000004031b4f6
ffff8807fb23baa8 ffffffff816aebe9
Sep 29 08:41:18 jg-02 kernel: 0000000000000000 0000000000000000
ffff8807fb23bae8 ffffffff810798ea
Sep 29 08:41:18 jg-02 kernel: 00000000ffffffff 0000000000000000
ffff8809a61a8000 ffff8807fb44a008
Sep 29 08:41:18 jg-02 kernel: Call Trace:
Sep 29 08:41:18 jg-02 kernel: [<ffffffff816aebe9>] dump_stack+0x45/0x57
Sep 29 08:41:18 jg-02 kernel: [<ffffffff810798ea>]
warn_slowpath_common+0x8a/0xc0
Sep 29 08:41:18 jg-02 kernel: [<ffffffff81079a1a>]
warn_slowpath_null+0x1a/0x20
Sep 29 08:41:18 jg-02 kernel: [<ffffffffa049cc45>]
break_stripe_batch_list+0x255/0x270 [raid456]
Sep 29 08:41:18 jg-02 kernel: [<ffffffffa04a1f32>]
handle_stripe+0x922/0x2220 [raid456]
Sep 29 08:41:18 jg-02 kernel: [<ffffffff810a3d42>] ?
default_wake_function+0x12/0x20
Sep 29 08:41:18 jg-02 kernel: [<ffffffff810bc17b>] ?
autoremove_wake_function+0x2b/0x40
Sep 29 08:41:18 jg-02 kernel: [<ffffffffa04a3b8d>]
handle_active_stripes.isra.45+0x35d/0x480 [raid456]
Sep 29 08:41:18 jg-02 kernel: [<ffffffff815294b9>] ?
md_wakeup_thread+0x39/0x70
Sep 29 08:41:18 jg-02 kernel: [<ffffffffa0498b16>] ?
do_release_stripe+0x96/0x180 [raid456]
Sep 29 08:41:18 jg-02 kernel: [<ffffffffa04a4d62>] raid5d+0x4a2/0x690
[raid456]
Sep 29 08:41:18 jg-02 kernel: [<ffffffff816b1a6f>] ? __schedule+0x2af/0x880
Sep 29 08:41:18 jg-02 kernel: [<ffffffff8152b2c6>] md_thread+0x136/0x150
Sep 29 08:41:18 jg-02 kernel: [<ffffffff810bc150>] ?
prepare_to_wait_event+0xf0/0xf0
Sep 29 08:41:18 jg-02 kernel: [<ffffffff8152b190>] ? find_pers+0x80/0x80
Sep 29 08:41:18 jg-02 kernel: [<ffffffff81097cb8>] kthread+0xd8/0xf0
Sep 29 08:41:18 jg-02 kernel: [<ffffffff81097be0>] ?
kthread_create_on_node+0x1b0/0x1b0
Sep 29 08:41:18 jg-02 kernel: [<ffffffff816b5f5f>] ret_from_fork+0x3f/0x70
Sep 29 08:41:18 jg-02 kernel: [<ffffffff81097be0>] ?
kthread_create_on_node+0x1b0/0x1b0
Sep 29 08:41:18 jg-02 kernel: ---[ end trace bec5dfa241e24011 ]---
This appears to be the bug described here:
https://bugzilla.redhat.com/show_bug.cgi?id=1258153. So now we're sort
of stuck. We can't go back to the CentOS 3.10.0 kernel due to the arp
table issue, and kernel 4.2 has a bug with software RAID 5. Is there a
fix in the queue for the RAID 5 issue?
Peter
More information about the elrepo
mailing list