[elrepo] Problem with CUDA since 331.67.elrepo

Fri May 2 18:03:10 EDT 2014

On 02/05/14 18:00, Michael Lampe wrote:
> Phil Perry wrote:
>
>> I've merged your patches, and built some testing packages (331.67-2)
>> which I can release to the testing repo, but I've come across a small
>> issue whilst doing some quick pre-release testing.
>>
>> On RHEL6, when running glxgears the animation noticeably stutters, it is
>> no longer smooth. The fps count is still reported as ~60fps, apparently
>> linked to the refresh rate of my panel, but the animation "looks" more
>> like 5-10 fps!
>>
>> Downgrading to 331.67-1 confirmed we appear to have introduced a glitch.
>>
>> Unloading the nvidia-uvm module had no effect so that does not appear to
>> be the cause.
>>
>> Commenting out the 'NVreg_ModifyDeviceFiles=0' in
>> /etc/modprobe.d/nvidia.conf fixed the issue.
>>
>> Are you able to observe similar behaviour?
>>
>> I don't observe any issues on RHEL5 where glxgears reports ~11,000fps
>> with or without 'NVreg_ModifyDeviceFiles=0'.
>
> Well, I admit to have tested mostly with el5, which works like I
> described, see
> https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Deployment_Guide/s1-pam-console.html.
>
>
> The only el6 machine with nvidia hardware I have available here at work
> is a GPU-Server. It has no video-out, so I cannot login at the console.
> It also uses another mechanism for device file permissions (actually
> none: r/w for everyone).
>
> El6 doesn't have /etc/security/console.perms.d/50-default.perms, it
> would want to use something like /lib/udev/rules.d/70-acl.rules to add
> an acl for _all_ locally logged in users.
>
> Options/ideas:
>
> 1) Write a udev rule for nvidia's modules. I'm 99% sure this won't work,
> because the nvidia stuff doesn't populate sysfs and never creates udev
> events.
>
> 2) Create /etc/security/console.perms.d/50-nvidia.perms with a line like
> this:
>
> <console> 0600 /dev/nvidia* 0600 root
>
> Then permissions should be handled as in el5. Multiple logins via X
> won't work I guess, because permissions cannot be accumulated like
> entries to an acl.
>
> 3) Create all devices once r/w for everyone and stick to that.
>
> 4) Admit defeat. Remove NVreg_ModifyDeviceFiles=0, put in suid root
> nvidia-modprobe, and let nvidia have their bloody way.
>
> Better ideas?
>
> -Michael

ATM I'm inclined to go with option 4 for the following reasons:

1. My goal is to package the NVIDIA driver to replicate as closely as 
possible (and where appropriate) the behaviour of the NVIDIA installer 
package, whilst providing a consistent packaged solution that addresses 
issues such as library conflicts (e.g, libGL).

2. I'm not particularly keen to reinvent the wheel. If nvidia-modprobe 
works then I see no reason to craft another solution for a problem that 
doesn't exist. It may not be the way we would have gone about solving 
the problem, but it's the solution nvidia have given us.

3. I'm also really not keen on having the nvidia-uvm module loaded by 
default. My understanding is that on a default NVIDIA installer 
installation only the nvidia module is loaded by default. CUDA 
applications trigger the nvidia-uvm module to load (if not already 
loaded) at run time by forking nvidia-modprobe. So I don't think it 
appropriate to load the nvidia-uvm module by default on all 
installations as a) it deviates from upstream default behaviour and b) 
is not particularly efficient for the 90% plus users who don't use CUDA 
and don't need the nvidia-uvm module loaded.

I prefer this to the alternative of creating the device files and 
loading the module on all installations. Another option would be to 
split nvidia-uvm out into a separate package (e.g, kmod-nvidia-uvm) that 
installs the nvidia-uvm kernel module, loads it and creates the 
necessary device files. Then CUDA users can install this extra package 
without encumbering the rest of the nvidia user-base. However, this 
wouldn't be my preferred option as it creates more work for me having to 
maintain and build an extra package for every driver release (over 
multiple arches / distro releases). I already have to manually update 
and build 12 packages for each new nvidia release (soon to be 15 with 
the release of RHEL7) so you'll understand why I'm not keen to add 4-5 more.

So I'd propose dropping the /etc/modprobe.d/nvidia.conf settings, adding 
nvidia-modprobe to the package and see how that works. BTW, I believe 
this is all rpmfusion has done for their Fedora packages, and likewise 
debian.

I'm happy to ship an /etc/modprobe.d/nvidia.conf file, and mark it as 
noreplace so users can add their personal configurations as required. 
I'm also happy to populate it with the options you've provided, as 
examples, but commented out by default. We can provide a brief 
description / document these options and allow users to uncomment them 
if they wish on a case by case basis.

I'm hoping that will at least give us a minimalistic working base.

Thoughts?

Phil