[elrepo] RAID 5 issue

Sat Oct 3 11:47:14 EDT 2015

On Fri, Oct 2, 2015 at 3:17 PM, Peter Steele <pwsteele at gmail.com> wrote:
> We were originally running with the stock CentOS 7.1 kernel, release
> 3.10.0-229.4.2.el7.x86_64. We hit what appears to be the bug described here:
>
> https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/785668
>
> although obviously a CentOS manifestation of it. We're using containers
> created and managed by libvirt-lxc and were able to confirm that containers
> running in two different hosts were frequently unable to communicate with
> each other due to their arp tables not getting populated properly. As an
> experiment we tried installing the 4.2 kernel-ml release from El Repo and
> the arp table issue appeared to be fixed.
>
> Since we had no other solution at the time we decided to switch to the the
> 4.2 kernel and give it some soak time in-house. Everything worked well for a
> few weeks  but we've started seeing kernel crashes with the following
> pattern:
>
> Sep 29 08:41:18 jg-02 kernel: WARNING: CPU: 1 PID: 7400 at
> drivers/md/raid5.c:4244 break_stripe_batch_list+0x255/0x270 [raid456]()
>(snip)
> This appears to be the bug described here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1258153. So now we're sort of
> stuck. We can't go back to the CentOS 3.10.0 kernel due to the arp table
> issue, and kernel 4.2 has a bug with software RAID 5. Is there a fix in the
> queue for the RAID 5 issue?

You may want to give a 4.3 kernel a try and see if it has fixed the issue:

http://elrepo.org/people/ajb/devel/kernel-ml/el7/x86_64/RPMS/

Please note that this is an RC version and was released for testing
purposes only. If the problem exits, then you'd need to file a report
at http://bugzilla.kernel.org so that this gets fixed upstream
(kernel.org).

Akemi