[elrepo] hard lockups on CPU's with elrepo kernel 3.10.103 on CentOS 6

Grigory Shamov Grigory.Shamov at umanitoba.ca
Thu Oct 6 12:31:24 EDT 2016


Hi All,

We are running kernel-lt-3.10.103 on about 300 CentOS 6.8 machines of our
HPC cluster. 
The machines are fairly old Intel Xeon X5650s (Wesmere/Nehalem chips,
SSE4.2).
We have first tested if the kernel works with our driver stack, were
satisfied, and went to production.

It turned out though that under production load,  time to time, on some of
the nodes (a few of them, seemingly at random), kernel panics on
nmi_watchdog hard lockups (and time to time emits barfs about soft
lockups) emitting various messages like this:

³²"
Kernel panic - not synching: Watchdog detected hard LOCKUP on cpu 3
 Š Call trace follows; mentions watchdog_overflow_callback Š
Shutting down cpus with NMI
drms_kms_helper: panic occurred, switching back to text console
³²²

Then we have tried simply to increase kernel.watchdog_thresh; on 3.10 it
is set to 10, while on CentOS 6 2.6.32 kernel it used to be 60.
It made things worse, the test node quickly had kernel panic with Call
trace mentioning ³perf_event_overflow².

Is there anything we can do about these errors, and what would be the
possible reason for them? Could anyone suggest a fix? Thank you very much
in advance.  


-- 
Grigory Shamov

Westgrid/ComputeCanada Site Lead
University of Manitoba
E2-588 EITC Building,
(204) 474-9625







More information about the elrepo mailing list