[elrepo] RE driver update incompatibility issue

Wed Jan 30 08:05:00 EST 2013

On 28/01/13 21:26, Lamar Owen wrote:
> On 01/28/2013 02:54 PM, Phil Perry wrote:
>> On 25/01/13 23:13, Lamar Owen wrote:
>>>
>>> On Jan 25, 2013, at 4:03 PM, Nux! wrote:
>>>
>>>> - why not in cases like this send a Requires for some noarch that
>>>> executes a script and does a "yum replace"[1] based on pci id?
>>>>
>>>> How does that sound?
>>>>
>>>> [1] -
>>>> http://dl.iuscommunity.org/pub/ius/stable/Redhat/6/SRPMS/yum-plugin-replace-0.2.5-1.ius.el6.src.rpm
>>>>
>>>
>>> Ah! It _does_ exist!
>>>
>>> Having gone through the 173.xx before, and now 304xx with three
>>> separate boxes, this would be very nice, IMO.
>>>
>>> As it was, I had several packages that required nvidia-x11-drv
>>> installed, and had to reinstall them as well when 'crossgrading' from
>>> nvidia-x11-drv to nvidia-x11-drv-304xx and friends.
>>>
>>
>> Hi Nux, Lamar,
>>
>> Thanks for the suggestions. I apologize, I've just got back from a
>> weekend away so haven't yet had time to look at it in any detail.
>
> Phil,
>
> First, let me say that I thank you for the packaging, for sure. I
> personally have no complaints...... and I'll apologize for the length of
> this post ahead of time.... :-)
>

No need to apologize - it got me thinking for sure!

>>
>> But, if I understand correctly, the above yum plugin allows one to do
>> something like 'yum replace pkg1 --replace-with pkg2'.
>>
>> This requires two additional packages to be installed - the noarch
>> containing the script and the yum replace plugin.
>>
>> Is yum replace really needed? Couldn't one just do:
>>
>> yum erase --nodeps kmod-nvidia nvidia-x11-drv
>> yum install kmod-nvidia-304xx nvidia-x11-drv-304xx
>>
>> which should achieve the same thing?
>>
>
> I'm not sure that it would necessarily have the same effect;
> particularly if you need nvidia-x11-drv-32bit (which I do, for several
> things). Call me old-school, but --nodeps tastes like paregoric. Package
> replacement should be fully depsolved, IMO. I should be able to say 'yum
> replace nvidia-x11-drv --replace-with nvidia-x11-drv-304xx' and it pick
> up the proper kmod to replace, along with devel packages and the 32-bit
> compat package. And, when I upgrade my video card, going the other way
> with an equally terse line would be very nice indeed.
>

Agreed. Further, my "solution" would also lose any user additions to 
xorg.conf.

My "concern/objection" was more related to forcing installation of a 3rd 
party yum plugin package onto the users system (although if we did this 
I guess we'd have to provide the yum plugin package too for repo 
closure) given that we are even reluctant to force (Require) users to 
install the yum fastestmirror plugin which is available in the 
distribution repos and provides great benefit to both elrepo 
infrastructure and users.

>> Anyway, my main concern here is that we over complicate matters
>> looking for a solution to a problem that arises very infrequently (~
>> every 5 years based on the last legacy release?) and is relatively
>> easy to resolve when it does arise - i.e, it's easy to yum downgrade
>> or yum erase and yum install the previous version.
>>
> Actually, if you're using NetworkManager and using per-user network
> profiles (like wireless logins) it's not quite as straightforward. I've
> dealt with that before, in Fedoras 12, 13, and 14 using a different
> repo's nvidia driver, and getting the kernel and the driver out of sync
> (I _love_ kabi-versioning kmods like the ones ELrepo builds!). It
> becomes a catch-22; at that time you had to have the GUI to get
> networking, but the GUI wouldn't come up due to the driver not being
> in-sync, and so you were dropped to a command line with no networking,
> and ifup may or may not work.... I had that happen twice before I got
> wise to it, and made sure the updated kmod was available before allowing
> yum to upgrade the kernel or the x11 driver (which was decoupled more
> from the kmod than it is with the ELrepo packages). I could see
> something similar happen in EL6. At least with the ELrepo packages the
> versioning is set up better than what I was dealing with with F1[234],
> where the kabi could indeed change in an update.
>
> And I realize those are corner cases; they just happen to be corner
> cases I've experienced.
>

Corner cases are great, aren't they. That's a good one!

Here's another, that can catch you out of you are not careful when using 
kmods. Suppose we build a kmod that causes a kernel panic (it's 
happened, a simple typo in some backported code can easily cause this). 
And suppose this kmod is weak-linked to every kernel on your system (you 
might only have 3 kernels under the default settings). Now every kernel 
on your system panics and you can't boot your system - not very user 
friendly!

What's worse is that it's difficult for us to test against such things, 
as although we can sanity test the package installs etc, often we can't 
modprobe the module as some modules won't load if the hardware isn't 
present so often we can't actually test the module itself.

The solution to this particular conundrum is to keep a non-kABI 
compatible kernel installed on your system (I would recommend one of 
Alan's longterm kernel [kernel-lt] packages) as these will never be 
affected and thus always available to boot in a kmod emergency, then 
simply uninstall the offending kmod.

Anyway, I digress somewhat, but the point we both make contributes 
significantly towards my caution and reluctance to make large sweeping 
changes for little perceived benefit. Elrepo does it's utmost to always 
remember the underlying philosophy of stability in Enterprise Linux.

> And when we're talking about video drivers, robustness of the process is
> king, as when things break here, you get command line (which is fine for
> me, but I'm not the typical user).
>

Absolutely, and as I said above, elrepo's approach will always be one of 
caution we feel is in keeping with Enterprise Linux as opposed to a 
release now, fix it later approach.

>> I think the main issue here is that some folks have a tendency to
>> install and forget stuff on their systems and then rather than taking
>> at least some responsibility for maintaining what they've installed
>> from 3rd party repo's choose to blame everyone but themselves when
>> something goes wrong.
>
> Oh, I personally agree with that statement, since people really need to
> track what they have loaded (I did, and I didn't get bit). But having
> done packaging for a number of years, several years ago, for PostgreSQL,
> I do have a bit of compassion for the 'install-it-and-forget-it' crowd,
> as well as for the packager who is saddled with upstream's
> incompatibility decisions.....
>

Yes, for sure it's a balance and experience is a wonderful thing. I by 
no means claim to be an expert or have all the answers, but we (elrepo) 
were very lucky to have Dag join us at a very early stage and he has 
been instrumental in guiding us in the packaging decisions we make.

> And less is definitely more, IMO, since the more you try to do in
> packaging, the more tends to break. Been there, broke that. Had some
> people rather angry with me, and for good reason.....
>

Again, we are very much on the same page here and Dag has consistently 
tried to steer us in the less is more direction and to look for 
precedent elsewhere in what we seek to do in packaging terms.

>> On a technical level, I'm still trying to get my head around how Nux's
>> suggestion might work. Once a yum transaction is underway we seem
>> fairly limited in what we can do.
>>
> I have run up against this myself, back in PostgreSQL 6/7 days. RPM
> scriptlets are quite constricted in their ability to do things
> (contrasted with Debian dpkgs, which have a lot of leeway, relatively
> speaking). I had grand plans of doing PostgreSQL dumps and restores
> inside scriptlets; Jeff Johnson straightened me out pretty quickly and
> made me realize that, since the scriptlets also may have to run inside
> an anaconda chroot, there are things you just can't (reliably) do. And
> since anaconda allows third-party repo selection during install these
> days, it's still just as true today as it was in 1999, even for ELrepo
> packages. Kickstarts are alos a case in point, there.
>

Yes, there will always be a whole host of ways users come up with of 
(ab)using our packages that we never envisaged or tested for. We already 
know that some of our kmods break when installed from a kickstart and I 
have no idea how they behave under anaconda.

> The only thing I can think of that would improve the general user
> experience in the long run would complicate things for you as the
> packager, and would dramatically increase the package size. And that's
> to package all four 'trains' (to use a Cisco IOS-ism) of nvidia binary
> and select, udev-style, which one gets loaded at either install-time or
> boot-time. I'm not sure it's worth it to do all that, especially for
> something that, as you say, doesn't tend to happen very often at all. So
> someone would install nvidia-x11-drv and dependencies, and those
> packages would select, maybe at boot time, which actual binaries need to
> load based on udev. That's a pretty major undertaking, IMO. A cool side
> effect would be painless hardware upgrades and downgrades within the
> nvidia family; if the driver selects during boot which modules to load
> then you get the best driver for whatever hardware you have installed.
> But I say that being completely ignorant of how udev-aware the nvidia
> stuff is...... :-)
>

Actually we had previously considered packaging multiple modules in one 
kmod package to overcome kABI breakage so this is not something totally 
new to us.

But here the issue is further complicated by the need to provide matched 
versions of the binary X11 components and that makes the issue a whole 
*lot* more complicated as it's no longer a case of managing a few 
different versions of one kernel module but also the many associated X11 
libraries that are statically linked.

>> Originally I decided against echoing a warning to the console as I
>> felt this was little more than repeating the warning the driver
>> already logs to /var/log/messages, but in hind sight this might not
>> have been such a bad idea. We could put a script in %post to check
>> compatibility and echo a recommendation for the correct legacy driver
>> together with a link to the url for the relevant documentation.
>
> This could be done; in the anaconda chroot the user won't see the
> warning, though (some folk actually do updates using discs; I seem to
> remember seeing a post along those lines on the CentOS list
> recently..... of course, most of those won't go select third-party repos
> during the update, but it is remotely possible). These days most people
> update with yum, and in that context it shouldn't be a problem.
>
>>
>> But lets discuss it some more as at the very least it's an interesting
>> packaging challenge :-)
>
> Indeed, and I appreciate your openness to such a challenge.
>

The folks at elrepo all come from an academic background, so 
investigating challenges is in our very nature :-)

> The pipe-dream, and a 'better than Windows' experience, is a single
> package set that covers all legacy versions plus the current version and
> leverages udev to load the right bits at boot time. I have no clue how
> difficult that would be to implement, other than it's likely to be
> pretty hard.
>

I know you didn't mean it like this, but I'm really not interested in 
better than Windows, and I'm sure you agree we need a package solution 
that works well for us under the framework in which we operate. To this 
end, I see little precedent for unified packaged solutions - the world 
of RPM seems to operate under the precedent of the user selecting the 
correct version. For example, I see plenty of precedent for this in RHEL 
in the form of samba and samba3x, and php and php53 package sets. Again, 
these solutions might not be ideal, but they are what Enterprise Linux 
users have.

> As I say, I'm not complaining about the status quo; I'm very grateful
> for all the work that you do and have done in just getting us the bits
> packaged, and packaged very well indeed, IMO.
>

Thanks, and even though it's only really one or two users complaining we 
do listen and take on board that criticism/feedback and are constantly 
looking to evolve/improve our packages.

But I think in this case it has to be evolution rather than revolution 
and we need to look at ways to improve what we have rather than reinvent 
the nvidia wheel from scratch.

So, to this end, what _can_ we do?

I'm more than willing to reconsider / revisit the idea of echoing a 
warning to the console upon package installation/updating that will warn 
a user if the version being installed/updated does not support the 
detected hardware. This might look something like this:

WARNING: This version of the NVIDIA driver does not support the detected 
hardware.
Please uninstall this driver and install the kmod-nvidia-304xx legacy driver
Please visit: http://elrepo.org/tiki/kmod-nvidia-304xx

Secondly, we could package and make available the script / program that 
detects the nvidia hardware and advises users on the correct driver 
installation - that would be easy as we'd need to write / develop such a 
script for use above. Then if users are unsure which driver they need 
they can run the script to find out before downloading and installing a 
large nvidia package.

Of course what none of this does is prevent the scenario where a user of 
older 6xxx/7xxx hardware updates from version 304 to version 310 (or 
above), although the console warning may help.

And as always, we welcome other suggestions.