[elrepo] drm/i915: Resetting chip after gpu hang (Intel Braswell/Cherry View: 0x22b1)

Tue Feb 9 01:57:33 EST 2016

Dear list,

with my Braswell chipset I experience a lag of some seconds from time to time, before X continues to refresh the display. From my understanding, the reason may be a buggy computation of watermarks within the i915 driver for Valleyview/Cherryview. This forces the gpu to get resetted which takes the time.
However, with kernel 4.4, the gpu works as I would expect it. So my guess is, that when porting back the watermark computation within the i915 driver to the el7 kernel, the issue should get solved.

I don’t know if it will be sufficient only to modify the i915 part, or if it will be required to patch the drm subsystem also. This would be a much more dramatic change, as other gpu drivers will also be involved.

What are your thoughts or clues on that?
Is there anyone also experiencing this issue and interested in solving that within the el7 kernel?

Below, you’ll find some log file excerpts.
Best - Björn

Each time when booting up the system, a trace similar to the following one gets written:
[    6.571568] Call Trace:
[    6.571578]  [<ffffffff8163515c>] dump_stack+0x19/0x1b
[    6.571584]  [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[    6.571587]  [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[    6.571631]  [<ffffffffa0213279>] vlv_wait_port_ready+0x139/0x180 [i915]
[    6.571677]  [<ffffffffa023f36c>] intel_enable_dp+0x20c/0x2a0 [i915]
[    6.571737]  [<ffffffffa023f6f4>] chv_pre_enable_dp+0x1b4/0x200 [i915]
[    6.571781]  [<ffffffffa021b2f0>] valleyview_crtc_enable+0x260/0x350 [i915]
[    6.571825]  [<ffffffffa0219562>] __intel_set_mode+0xb12/0xd10 [i915]
[    6.571870]  [<ffffffffa022060b>] intel_crtc_set_config+0xaab/0xff0 [i915]
[    6.571894]  [<ffffffffa010ca7b>] ? kfree_state+0x4b/0x50 [drm]
[    6.571913]  [<ffffffffa00fcdb7>] drm_mode_set_config_internal+0x67/0x100 [drm]
[    6.571921]  [<ffffffffa0196508>] restore_fbdev_mode+0xc8/0xf0 [drm_kms_helper]
[    6.571930]  [<ffffffffa01983f5>] drm_fb_helper_restore_fbdev_mode_unlocked+0x25/0x70 [drm_kms_helper]
[    6.571937]  [<ffffffffa0198462>] drm_fb_helper_set_par+0x22/0x50 [drm_kms_helper]
[    6.571945]  [<ffffffffa019837f>] drm_fb_helper_hotplug_event+0x8f/0xe0 [drm_kms_helper]
[    6.571989]  [<ffffffffa022f56e>] intel_fbdev_output_poll_changed+0x1e/0x30 [i915]
[    6.571997]  [<ffffffffa018c7b7>] drm_kms_helper_hotplug_event+0x27/0x30 [drm_kms_helper]
[    6.572004]  [<ffffffffa018c88d>] output_poll_execute+0x6d/0x190 [drm_kms_helper]
[    6.572008]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[    6.572011]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[    6.572015]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[    6.572018]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[    6.572022]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[    6.572027]  [<ffffffff81645818>] ret_from_fork+0x58/0x90
[    6.572031]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[    6.572033] ---[ end trace 439a960569a8132a ]---
[    6.669131] Console: switching to colour frame buffer device 128x48
[    6.756489] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
[    6.756491] i915 0000:00:02.0: registered panic notifier

Then, each time when the lag occurs later on, the following is written to dmesg. Might that occur due to buggy watermark computation?
[  156.714911] [drm] stuck on render ring
[  156.743834] [drm] GPU HANG: ecode 8:0:0x85dffffb, in Xorg [1432], reason: Ring hung, action: reset
[  156.743852] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  156.743861] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  156.743870] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  156.743878] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  156.743888] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  156.748853] drm/i915: Resetting chip after gpu hang

Additionally, a drm error entry gets written to /sys/class/drm/card0/error. However it’s a huge file so these are just the first few lines:

GPU HANG: ecode 8:0:0x85dffffb, in Xorg [1432], reason: Ring hung, action: reset
Time: 1453771555 s 843906 us
Kernel: 3.10.0-327.4.4.el7.x86_64
Active process (on ring render): Xorg [1432]
Reset count: 0
Suspend count: 0
PCI ID: 0x22b1
EIR: 0x00000000
IER: 0x00000000
GTIER gt 0: 0x01010121
GTIER gt 1: 0x01010101
GTIER gt 2: 0x00000070
GTIER gt 3: 0x00000101
PGTBL_ER: 0x00000000
FORCEWAKE: 0x00000000
DERRMR: 0x00000000
CCID: 0x00000000
Missed interrupts: 0x00000000
  fence[0] = 9bf01f006c0001
  fence[1] = 307d0010307a003
  fence[2] = 308700003087003
  fence[3] = 30a701f03088003
  fence[4] = 30f701f030b8003
  fence[5] = 30fd000030fd003
  fence[6] = 31250070310e003
  fence[7] = 303600003036003
  fence[8] = 303a00103037003
  fence[9] = 312e00803126003
  fence[10] = 313300103132003
  fence[11] = 31a200503137003
  fence[12] = 1e5c01f01b5d001
  fence[13] = 329a007031db003
  fence[14] = 31af001031ae003
  fence[15] = 00000000
  INSTDONE_0: 0xffdfffff
  INSTDONE_1: 0xffffffff
  INSTDONE_2: 0xffffffff
  INSTDONE_3: 0xff73fffd
ERROR: 0x00000001
FAULT_TLB_DATA: 0x00000000 0x000009f4
DONE_REG: 0x07ffffff
render command stream:
  HEAD: 0x00005a78
  TAIL: 0x00005a90
  CTL: 0x0001f001
  HWS: 0x00000000
  ACTHD: 0x00000000 00005a78
  IPEIR: 0x00000000
  IPEHR: 0x7a000004
  INSTDONE: 0xffdfffff
  BBADDR: 0x00000000 00a0b488
  BB_STATE: 0x00000000
  INSTPS: 0x8000010b
  INSTPM: 0x00006080
  FADDR: 0x00000000 01efca90
  RC PSMI: 0x00001010
  FAULT_REG: 0x000000c1
  SYNC_0: 0x00000000 [last synced 0x00000000]
  SYNC_1: 0x00000000 [last synced 0x00000000]
  SYNC_2: 0x00000000 [last synced 0x00000000]
  GFX_MODE: 0x0000a000
  PDP0: 0x000000006f849000
  PDP1: 0x000000006f848000
  PDP2: 0x000000006f847000
  PDP3: 0x000000006f846000
  seqno: 0xfffff203
  waiting: yes
  ring->head: 0x00000000
  ring->tail: 0x00005a98
  hangcheck: hung [40]
bsd command stream:
  HEAD: 0x00000000
  TAIL: 0x00000000
  CTL: 0x00000000
  HWS: 0x00037000
  ACTHD: 0x00000000 00000000
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0xfffffffe
  BBADDR: 0x00000000 00000000
  BB_STATE: 0x00000000
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00000000 00000000
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  SYNC_0: 0x00000000 [last synced 0x00000000]
  SYNC_1: 0x00000000 [last synced 0x00000000]
  SYNC_2: 0x00000000 [last synced 0x00000000]
  GFX_MODE: 0x00008000
  PDP0: 0x0000000000000000
  PDP1: 0x0000000000000000
  PDP2: 0x0000000000000000
  PDP3: 0x0000000000000000
  seqno: 0xffffeffe
  waiting: no
  ring->head: 0x00000000
  ring->tail: 0x00000000
  hangcheck: idle [0]
blt command stream:
  HEAD: 0x00000368
  TAIL: 0x00000368
  CTL: 0x0001f001
  HWS: 0x00059000
  ACTHD: 0x00000000 00000368
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0xfffffffe
  BBADDR: 0x00000000 00600038
  BB_STATE: 0x00000000
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00000000 006a0368
  RC PSMI: 0x00000018
  FAULT_REG: 0x00000000
  SYNC_0: 0x00000000 [last synced 0x00000000]
  SYNC_1: 0x00000000 [last synced 0x00000000]
  SYNC_2: 0x00000000 [last synced 0x00000000]
  GFX_MODE: 0x00008000
  PDP0: 0x000000006f849000
  PDP1: 0x000000006f848000
  PDP2: 0x000000006f847000
  PDP3: 0x000000006f846000
  seqno: 0xfffff202
  waiting: no
  ring->head: 0x00000000
  ring->tail: 0x00000000
  hangcheck: idle [0]
vebox command stream:
  HEAD: 0x00000000
  TAIL: 0x00000000
  CTL: 0x00000000
  HWS: 0x0007b000
  ACTHD: 0x00000000 00000000
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0xfffffffe
  BBADDR: 0x00000000 00000000
  BB_STATE: 0x00000000
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00000000 00000000
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  SYNC_0: 0x00000000 [last synced 0x00000000]
  SYNC_1: 0x00000000 [last synced 0x00000000]
  SYNC_2: 0x00000000 [last synced 0x00000000]
  GFX_MODE: 0x00008000
  PDP0: 0x0000000000000000
  PDP1: 0x0000000000000000
  PDP2: 0x0000000000000000
  PDP3: 0x0000000000000000
  seqno: 0xffffeffe
  waiting: no
  ring->head: 0x00000000
  ring->tail: 0x00000000
  hangcheck: idle [0]
vm[0]
  Active [0]:
  Pinned [14]:
    00000000    81920 01 01 0 0 P dirty uncached
    00014000   131072 40 40 0 0 P dirty uncached
    00035000     4096 01 01 0 0 P snooped
    00037000     8192 01 01 0 0 P dirty uncached
    00039000   131072 40 40 0 0 P dirty uncached
    00059000     8192 01 01 0 0 P dirty uncached
    0005b000   131072 40 40 0 0 P dirty uncached
    0007b000     8192 01 01 0 0 P dirty uncached
    0007d000   131072 40 40 0 0 P dirty uncached
    0009e000  3145728 40 00 0 0 P dirty uncached
    01ee3000    81920 01 01 0 0 P dirty uncached
    01ef7000   131072 40 40 0 0 P dirty uncached
    0202d000    16384 40 00 0 0 P dirty uncached
    01b5d000  3145728 36 00 0 0 P X dirty uncached (name: 4) (fence: 12)
vm[1]
  Active [0]:
  Pinned [0]:
vm[2]
  Active [5]:
    00601000    36864 37 00 fffff204 0 render uncached
    0199c000  2916352 02 00 fffff204 fffff204 X dirty render uncached
    01c64000   159744 37 00 fffff204 0 userptr render snooped
    0165c000   262144 37 00 fffff204 0 render uncached
    00a0b000    16384 7e 00 fffff204 0 dirty render uncached
  Pinned [0]:
vm[3]
  Active [0]:
  Pinned [0]:
vm[4]
  Active [0]:
  Pinned [0]:
vm[5]
  Active [0]:
  Pinned [0]:
render ring (submitted by Xorg [1432]) --- gtt_offset = 0x00a0b000