[elrepo] RAID 5 issue

Peter Steele pwsteele at gmail.com
Fri Oct 2 18:17:07 EDT 2015


We were originally running with the stock CentOS 7.1 kernel, release 
3.10.0-229.4.2.el7.x86_64. We hit what appears to be the bug described here:

https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/785668

although obviously a CentOS manifestation of it. We're using containers 
created and managed by libvirt-lxc and were able to confirm that 
containers running in two different hosts were frequently unable to 
communicate with each other due to their arp tables not getting 
populated properly. As an experiment we tried installing the 4.2 
kernel-ml release from El Repo and the arp table issue appeared to be 
fixed.

Since we had no other solution at the time we decided to switch to the 
the 4.2 kernel and give it some soak time in-house. Everything worked 
well for a few weeks  but we've started seeing kernel crashes with the 
following pattern:

Sep 29 08:41:18 jg-02 kernel: WARNING: CPU: 1 PID: 7400 at 
drivers/md/raid5.c:4244 break_stripe_batch_list+0x255/0x270 [raid456]()
Sep 29 08:41:18 jg-02 kernel: Modules linked in: fuse veth raid456 
async_raid6_recov async_memcpy async_pq async_xor xor async_tx
   raid6_pq xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 
iptable_filter ip_tables tun bonding kvm_amd kvm crct10dif_pclmul 
crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel binfmt_misc 
lrw gf128mul glue_helper ablk_helper amd64_edac_mod cryptd edac_mce_amd 
serio_raw pcspkr edac_core joydev fam15h_power k10temp input_leds 
8250_fintek sp5100_tco shpchp i2c_piix4 acpi_cpufreq ipmi_devintf 
ipmi_msghandler ext4 mbcache jbd2 raid1 sr_mod cdrom sd_mod mgag200 
syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm 
ata_generic pata_acpi drm pata_atiixp
Sep 29 08:41:18 jg-02 kernel: ahci libahci libata e1000e ptp pps_core 
uas usb_storage
Sep 29 08:41:18 jg-02 kernel: CPU: 1 PID: 7400 Comm: md1_raid5 Not 
tainted 4.2.0-1.el7.elrepo.x86_64 #1
Sep 29 08:41:18 jg-02 kernel: Hardware name: Supermicro AS 
-1012G-MTF/H8SGL, BIOS 3.5        11/25/2013
Sep 29 08:41:18 jg-02 kernel: 0000000000000000 000000004031b4f6 
ffff8807fb23baa8 ffffffff816aebe9
Sep 29 08:41:18 jg-02 kernel: 0000000000000000 0000000000000000 
ffff8807fb23bae8 ffffffff810798ea
Sep 29 08:41:18 jg-02 kernel: 00000000ffffffff 0000000000000000 
ffff8809a61a8000 ffff8807fb44a008
Sep 29 08:41:18 jg-02 kernel: Call Trace:
Sep 29 08:41:18 jg-02 kernel: [<ffffffff816aebe9>] dump_stack+0x45/0x57
Sep 29 08:41:18 jg-02 kernel: [<ffffffff810798ea>] 
warn_slowpath_common+0x8a/0xc0
Sep 29 08:41:18 jg-02 kernel: [<ffffffff81079a1a>] 
warn_slowpath_null+0x1a/0x20
Sep 29 08:41:18 jg-02 kernel: [<ffffffffa049cc45>] 
break_stripe_batch_list+0x255/0x270 [raid456]
Sep 29 08:41:18 jg-02 kernel: [<ffffffffa04a1f32>] 
handle_stripe+0x922/0x2220 [raid456]
Sep 29 08:41:18 jg-02 kernel: [<ffffffff810a3d42>] ? 
default_wake_function+0x12/0x20
Sep 29 08:41:18 jg-02 kernel: [<ffffffff810bc17b>] ? 
autoremove_wake_function+0x2b/0x40
Sep 29 08:41:18 jg-02 kernel: [<ffffffffa04a3b8d>] 
handle_active_stripes.isra.45+0x35d/0x480 [raid456]
Sep 29 08:41:18 jg-02 kernel: [<ffffffff815294b9>] ? 
md_wakeup_thread+0x39/0x70
Sep 29 08:41:18 jg-02 kernel: [<ffffffffa0498b16>] ? 
do_release_stripe+0x96/0x180 [raid456]
Sep 29 08:41:18 jg-02 kernel: [<ffffffffa04a4d62>] raid5d+0x4a2/0x690 
[raid456]
Sep 29 08:41:18 jg-02 kernel: [<ffffffff816b1a6f>] ? __schedule+0x2af/0x880
Sep 29 08:41:18 jg-02 kernel: [<ffffffff8152b2c6>] md_thread+0x136/0x150
Sep 29 08:41:18 jg-02 kernel: [<ffffffff810bc150>] ? 
prepare_to_wait_event+0xf0/0xf0
Sep 29 08:41:18 jg-02 kernel: [<ffffffff8152b190>] ? find_pers+0x80/0x80
Sep 29 08:41:18 jg-02 kernel: [<ffffffff81097cb8>] kthread+0xd8/0xf0
Sep 29 08:41:18 jg-02 kernel: [<ffffffff81097be0>] ? 
kthread_create_on_node+0x1b0/0x1b0
Sep 29 08:41:18 jg-02 kernel: [<ffffffff816b5f5f>] ret_from_fork+0x3f/0x70
Sep 29 08:41:18 jg-02 kernel: [<ffffffff81097be0>] ? 
kthread_create_on_node+0x1b0/0x1b0
Sep 29 08:41:18 jg-02 kernel: ---[ end trace bec5dfa241e24011 ]---

This appears to be the bug described here: 
https://bugzilla.redhat.com/show_bug.cgi?id=1258153. So now we're sort 
of stuck. We can't go back to the CentOS 3.10.0 kernel due to the arp 
table issue, and kernel 4.2 has a bug with software RAID 5. Is there a 
fix in the queue for the RAID 5 issue?

Peter



More information about the elrepo mailing list