HN comments for: NVIDIA Transitions Fully Towards Open-Source Linux GPU Kernel Modules

hypeatei

55 replies

23h35m

2024-07-17 18:56:06 UTC

How is the NVIDIA driver situation on Linux these days? I built a new desktop with an AMD GPU since I didn't want to deal with all the weirdness of closed source or lacking/obsolete open source drivers.

mepian

20 replies

23h17m

2024-07-17 19:13:49 UTC

The current stable proprietary driver is a nightmare on Wayland with my 3070, constant flickering and stuttering everywhere. Apparently the upcoming version 555 is much better, I'm sticking with X11 until it comes out. I never tried the open-source one yet, not sure if it supports my GPU at all.

bcrescimanno

11 replies

21h40m

2024-07-17 20:51:03 UTC

The 555 version is the current version. It was officially released on June 27.

https://www.phoronix.com/news/NVIDIA-555.58-Linux-Driver

JasonSage

10 replies

20h43m

2024-07-17 21:47:59 UTC

In defense of the parent, upcoming can still be a relative term, albeit a bit misleading. For example: I'm running the 550 drivers still because my upstream nixos-unstable doesn't have 555 for me yet.

mananaysiempre

5 replies

19h39m

2024-07-17 22:52:01 UTC

nixos-unstable doesn't have 555

Version 555.58.02 is under “latest” in nixos-unstable as of about three weeks ago[1]. (Somebody should check with qyliss if she knows the PR tracker is dead... But the last nixos-unstable bump was two days ago, so it’s there.)

[1] https://github.com/NixOS/nixpkgs/commit/4e15c4a8ad30c02d6c26...

JasonSage

4 replies

18h50m

2024-07-17 23:41:06 UTC

`nvidia-smi` shows that my driver version is 550.78. I ran `nixos-rebuild switch --upgrade` yesterday. My nixos channel is `nixos-unstable`.

Do you know something I don't? I'd love to be on the latest version.

I should have written my post better, it implies that 555 does not exist in nixpkgs, which I never meant. There's certainly a phrasing that captures what I'm seeing more accurately.

mananaysiempre

1 replies

6h25m

2024-07-18 12:06:25 UTC

I did not mean to chastise you or anything, just to suggest you could be able to have a newer driver if you had missed the possibility.

The thing is, AFAIU, NVIDIA has several release channels for their Linux driver[1] and 555 is not (yet?) the "production" one, which is what NixOS defaults to (550 is). If you want a different degree of freshness for your NVIDIA driver, you need to say so explicitly[2]. The necessary incantation should be

  hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.latest;

This is somewhat similar to how you get a newer kernel by setting boot.kernelPackages to linuxPackages_latest, for example, if case you've ever done that.

[1] https://www.nvidia.com/en-us/drivers/unix/

[2] https://nixos.wiki/wiki/Nvidia

JasonSage

0 replies

2h32m

2024-07-18 15:59:22 UTC

I had this configuration but was lacking a flake update to move my nixpkgs forward despite the channel, which I can understand much better looking back.

Thanks for the additional info, this HM thread has helped me quite a bit.

atrus

1 replies

16h43m

2024-07-18 01:48:17 UTC

Are you using flakes? If you don't do `nix flake update` there won't be all that much to update.

JasonSage

0 replies

15h39m

2024-07-18 02:52:12 UTC

I am! I forgot about this. Mental model check happening.

(Still on 550.)

zxexz

1 replies

15h52m

2024-07-18 02:39:02 UTC

I love NixOS, and the nvidia-x11 package is truly wonderful and captures so many options. But having such a complex package makes updating and regression testing take time. For ML stuff I ended up using it as the basis for an overlay, and ripping out literally everything I don’t need, which makes it a matter of minutes usually to make the changes requires to upgrade when a new driver is released I’m running completely headless because these are H100 nodes, and I just need persistenced and fabricmanager, and GDRMA (which wasn’t working at all, causing me to go down this rabbit hole of stripping everything away until I could figure out why).

postcert

0 replies

2h5m

2024-07-18 16:26:20 UTC

I was going to say specialisations might be useful for you to keep a previous driver version around for testing but you might be past that point!

Having the ability to keep alternate configurations for $previous_kernel and $nvidia_stable have been super helpful in diagnosing instead of rolling back.

mepian

0 replies

19h51m

2024-07-17 22:40:11 UTC

Yep, I'm on openSUSE Tumbleweed, and it's not rolled out there yet. I would rather wait than update my drivers out-of-band.

SushiHippie

0 replies

19h46m

2024-07-17 22:44:50 UTC

The versions that nixos provides are based on the files in this repo

https://github.com/aaronp24/nvidia-versions

See: https://github.com/NixOS/nixpkgs/blob/9355fa86e6f27422963132...

You could also opt to use the latest driver instead of stable: https://nixos.wiki/wiki/Nvidia

llmblockchain

3 replies

23h15m

2024-07-17 19:16:16 UTC

I have a 3070 on X and it has been great.

levkk

2 replies

22h15m

2024-07-17 20:16:26 UTC

Same setup here. Multiple displays don't work well for me. One of the displays doesn't often get detected after resuming screen saver.

llmblockchain

1 replies

21h37m

2024-07-17 20:53:46 UTC

I have two monitors connected to the 3070 and it works well. The only issue I had was suspending, the GPU would "fall of the bus" and not get its power back when the PC woke up. I had to add the kernel line "pcie_aspm=off" to prevent the GPU from falling asleep.

So... not perfect, but it works.

josephg

0 replies

19h54m

2024-07-17 22:36:53 UTC

Huh. I’m using 2 monitors connected to a 4090 on Linux mint - which is still using X11. It works flawlessly, including DPI scaling. Wake from sleep is fine too.

I haven’t tried wayland yet. Sounds like it might be time soon given other comments in this thread.

misterbishop

2 replies

22h48m

2024-07-17 19:43:35 UTC

this is resolved in 555 (currently running 555.58.02). my asus zephyrus g15 w/ 3060 is looking real good on Fedora 40. there's still optimizations needed around clocking, power, and thermals. but the graphics presentation layer has no issues on wayland. that's with hybrid/optimus/prime switching, which has NEVER worked seamlessly for me on any laptop on linux going back to 2010. gnome window animations remain snappy and not glitchy while running a game. i'm getting 60fps+ running baldurs gate 3 @ 1440p on the low preset.

robviren

1 replies

22h33m

2024-07-17 19:58:03 UTC

Had similar experience with my Legion 5i 3070 with Wayland and Nvidia 555, but my HDMI out is all screwed up now of course. Working on 550. One step forward and one step back.

misterbishop

0 replies

16h39m

2024-07-18 01:52:37 UTC

is there a mux switch?

gmokki

0 replies

10h23m

2024-07-18 08:08:19 UTC

I switched to Wayland 10 years ago when it became an option ok Fedora. First thing I had to do was to drop NVIDIA and switch to Intel GPU, and past 5 years to AMD GPU. Makes a big difference if the upstream kernel is supported.

Maybe NVIDIA drivers have kind of worked on 12 month old kernels that Ubuntu on average uses.

anon291

17 replies

23h11m

2024-07-17 19:19:59 UTC

I've literally never had an issue in decades of using NVIDIA and linux. They're closed source, but the drivers work very consistently for me. NVIDIA's just the only option if you want something actually good and to run ML workloads as well.

sqeaky

10 replies

23h5m

2024-07-17 19:25:57 UTC

but the drivers work very consistently for me

The problem with comments like this is that you never know if you will be me or you on your graphics card or laptop.

I have tried nvidia a few times and kept getting burnt. AMD just works. I don't get the fastest ML machine, but I am just a tinkerer there and OpenCL works fine for my little toy apps and my 7900XTX blazes through every wine game.

If you need it professionally than you need it, warts an all. For any casual user that 10% extra gaming performance needs to weighed against reliability.

Workaccount2

7 replies

22h40m

2024-07-17 19:51:22 UTC

It also depends heavily on the user.

A mechanic might say "This car has never given me a problem" because the mechanic doesn't consider cleaning an idle bypass circuit or adjusting valve clearances to be a "problem". To 99% percent of the population though, those are expensive and annoying problems because they have no idea what those words even mean, much less the ability to troubleshoot, diagnose, and repair.

chasil

3 replies

22h0m

2024-07-17 20:31:17 UTC

If you use a search engine for "Torvalds Nvidia" you will discern a certain attitude towards Nvidia as a corporation and its products.

This might provide you a suggestion that alternate manufacturers should be considered.

I have confirmed this to be the case on Google and Bing, so DuckDuckGo and Startpage will also exhibit this phenomena.

Dylan16807

1 replies

16h40m

2024-07-18 01:51:16 UTC

An opinion on support from over ten years ago is not a very strong suggestion.

chasil

0 replies

34m

2024-07-18 17:56:51 UTC

Your problem there is that both search engines place this image and backstory at the top of the results, so neither Google nor Bing agree with any of you.

If you think they're wrong, be sure to let them know.

dahart

0 replies

2h51m

2024-07-18 15:40:20 UTC

Torvalds has said nasty mean things to a lot of people in the past, and expressed regret over his temper & hyperbole. Try searching for something more recent https://youtu.be/wvQ0N56pW74

lyu07282

2 replies

20h59m

2024-07-17 21:32:27 UTC

a lot has probably to do with not really understanding their distributions package manager and lkms specifically, I also always suspected that most Linux users don't know if they are using Wayland or X11 and the issues they had were actually Wayland specific ones they wouldn't have with Nvidia/x11 and come to think of it, how would they even know if it's a GPU driver issue in the first place? Guess I'm the mechanic in your analogy.

vetinari

0 replies

3h32m

2024-07-18 14:58:49 UTC

If there's an issue with Nvidia/Wayland and there isn't with AMD/Wayland or Intel/Wayland, it is Nvidia issue then, not Wayland one.

sqeaky

0 replies

20h55m

2024-07-17 21:36:11 UTC

When I run Gentoo or Arch, I know. But when I run Ubuntu or Fedora, should I have needed to know?

On plenty of distros "I want to install it and forget about is reasonable" and on both Gentoo and Ubuntu I have rebooted from a working system into a system where the display stopped working, at least on Gentoo I was ready because I broke it somehow.

lmm

1 replies

18h45m

2024-07-17 23:46:02 UTC

AMD just works. I don't get the fastest ML machine, but I am just a tinkerer there and OpenCL works fine for my little toy apps and my 7900XTX blazes through every wine game.

That's the opposite of my experience. I'd love to support open-source. But the AMD experience is just too flaky, too card-dependent. NVidia is rock-solid (maybe not for Wayland, but I never wanted Wayland in the first place).

sqeaky

0 replies

3h28m

2024-07-18 15:02:46 UTC

What kind of flakiness? The only AMD GPU problem I have had involved a lightning strike killing a card while I was gaming.

My nvidia problems are generally software and update related. The NVidia stuff usually works on popular distros, but as soon anything custom or a surprise update happens then there is a chance things break.

resoluteteeth

0 replies

18h45m

2024-07-17 23:46:22 UTC

Are you using wayland or are you still on x11? My experience was that the closed source drivers were fine with x11 but a nightmare with wayland.

pizza234

0 replies

19h45m

2024-07-17 22:45:43 UTC

Up to a couple of years ago, before permanently moving to AMD GPUs, I couldn't even boot Ubuntu with an Nvida GPU. This was because Ubuntu booted by default with Nouveau, which didn't support a few/several series (I had at least two different series).

The cards worked fine with binary drivers once the system was installed, but AFAIR, I had to integrate the binary driver packages in the Ubuntu ISO in order to boot.

I presume that now, the situation is much better, but necessiting binary drivers can be a problem in itself.

l33tman

0 replies

21h31m

2024-07-17 21:00:26 UTC

Same here, been using the nvidia binary drivers on a dozen computers with various other HW and distros for decades with never any problems whatsoever.

isatty

0 replies

9h43m

2024-07-18 08:48:02 UTC

Likewise. Rock solid for decades in intel + nvidia proprietary drivers even when doing things like hot plugging for passthroughs.

bobajeff

0 replies

22h41m

2024-07-17 19:49:59 UTC

I did when my card stopped being supported by all the distros because it was too old while the legacy driver didn't fully work the same.

Keyframe

0 replies

10h51m

2024-07-18 07:40:22 UTC

Me too. Now I have a laptop with discrete nvidia and an eGPU with 3090 in it, a desktop with 4090, another laptop with another discrete nvidia.. all switching combinations work, acceleration works, game performance is on par with windows (even with proton to within a small percentage or even sometimes better). All out of the box with stock Ubuntu and installing driver from Nvidia site.

The only "trick" is I'm still on X11 and probably will stay. Note that I did try wayland on few occasions but I steered away (mostly due to other issues with it at the time).

segmondy

1 replies

22h27m

2024-07-17 20:04:29 UTC

plug, install then play, I got 3 different Nvidia GPU sets and all running without any issue, nothing crazy to do but follow installation instructions.

anonym29

0 replies

15h44m

2024-07-18 02:47:14 UTC

To some of us, running any closed source software in userland qualifies as quite crazy indeed.

green-salt

1 replies

22h56m

2024-07-17 19:35:15 UTC

Whatever pop_os uses has been quite stable for my 4070.

tormeh

0 replies

22h38m

2024-07-17 19:53:09 UTC

Pop uses X by default because of Nvidia.

drdaeman

1 replies

19h34m

2024-07-17 22:56:53 UTC

3090 owner here.

Wayland is even worse mess than it normally is. Used to flicker real bad before 555.58.02, less so with the latest driver - but still has some glitches with games. A bunch of older Electron apps still fail to render anything and require hardware acceleration disabled. I gave up trying to make it all work - can't get rid of all the flicker and drawing issues, plus Wayland seems to be a real pain in the ass with HiDPI displays.

X11 sort of works, but I had to entirely disable DPMS or one of my monitors never comes back online after going to sleep. I thought it was my KVM messing up, but that happened even with a direct connection... no idea what's going on there.

CUDA works fine, save for the regular version compatibility hiccups.

senectus1

0 replies

18h52m

2024-07-17 23:39:06 UTC

4070ti super here, X11 is fine, i have zero issues.

Wayland is mostly fine, though i get some windowframe glitches when maxing them to the monitor and a another issue that i'm pretty sure is wayland but it has obnly happened a couple of times and it locks the whole device up. I cant prove it yet.

art0rz

1 replies

23h9m

2024-07-17 19:22:03 UTC

I've been running Arch with KDE under Wayland on two different laptops both with NVIDIA GPUs using proprietary drivers for years and have not run into issues. Maybe I'm lucky? It's been flawless for me.

lyu07282

0 replies

21h35m

2024-07-17 20:56:11 UTC

The experiences always vary quite a lot, it depends so much on what you do with it. For example discord doesn't support screen sharing with Wayland, it's just one small example but those can add up over time. Another example is display rotation which was broken in kde for a long time (recently fixed).

tgsovlerkhgsel

0 replies

11h29m

2024-07-18 07:02:37 UTC

My experience with an AMD iGPU on Linux was so bad that my next laptop will be Intel. Horrible instability to the point where I could reliably crash my machine by using Google Maps for a few minutes, on both Chrome and Firefox. It got fixed eventually - with the next Ubuntu release, so I had a computer where I was afraid to use anything with WebGL for half a year.

tadasv

0 replies

23h30m

2024-07-17 19:00:44 UTC

great. rtx 4090 works out of the box after installing drivers from non-free. That's on debian bookworm.

mathfailure

0 replies

19h37m

2024-07-17 22:53:42 UTC

Depends on the version of drivers: 550 version results into black screen (you have to kill and restart X server) after waking up from sleep. 535 version doesn't have this bug. Don't know about 555.

Also tearing is a bitch. Still. Even with ForceCompositionPipeline.

jppittma

0 replies

23h23m

2024-07-17 19:08:07 UTC

4070 worked out of the box on my arch system. I used the closed source drivers and X11 and I've not encountered a single problem.

My prediction is that it will continue to improve if only because people want to run nvidia on workstations.

jcranmer

0 replies

23h20m

2024-07-17 19:11:14 UTC

I built my new-ish computer with an AMD GPU because I trusted in-kernel drivers better than out-of-kernel DKMS drivers.

That said, my previous experience with the DKMS driver stuff hasn't been bad. If you use Nvidia's proprietary driver stack, then things should generally be fine. The worst issues are that Nvidia has (historically, at least; it might be different for newer cards) refused to implement some graphics features that everybody else uses, which means that you basically need entirely separate codepaths for Nvidia in window managers, and some of them have basically said "fuck no" to doing that.

devwastaken

0 replies

15h25m

2024-07-18 03:06:27 UTC

KDE plasma 6 + Nvidia beta 555 works well. Have to make .desktop files to launch some applications explicitly Wayland.

adrian_b

0 replies

13h46m

2024-07-18 04:45:30 UTC

I am not using Wayland and I do not have any intention to use it, therefore I do not care for any problems caused by Wayland not supporting NVIDIA and demanding that NVIDIA must support Wayland.

I am using only Linux or FreeBSD on all my laptop, desktop or server computers.

On desktop and server computers I did not ever have the slightest difficulty with the NVIDIA proprietary drivers, either for OpenGL or for CUDA applications or for video decoding/encoding or for multiple monitor support, with high resolution and high color depth, on either Gentoo/Funtoo Linux or FreeBSD, during the last two decades. I also have AMD GPUs, which I use for compute applications (because they are older models, which still had FP64 support). For graphics applications they frequently had annoying bugs, unlike NVIDIA (however my AMD GPUs have been older models, preceding RDNA, which might be better supported by the open-source AMD drivers).

The only computers on which I had problems with NVIDIA on Linux were those laptops that used the NVIDIA Optimus method of coexistence with the Intel integrated GPUs. Many years ago I have needed a couple of days to properly configure the drivers and additional software so that the NVIDIA GPU was selected when desired, instead of the Intel iGPU. I do not know if any laptops with NVIDIA Optimus still exist. The laptops that I bought later had video outputs directly from the NVIDIA GPU, so there was no difference between them and desktops and the NVIDIA drivers worked flawlessly.

Both on Gentoo/Funtoo Linux and FreeBSD I never had to do anything else but to give the driver update command and everything worked fine. Moreover, NVIDIA has always provided a nice GUI application "NVIDIA X Server Settings", which provides a lot of useful information and which makes very easy any configuration tasks, like setting the desired positions of multiple monitors. A few years ago there was nothing equivalent for the AMD or Intel GPU drivers, but that might have changed meanwhile.

DaoVeles

0 replies

20h57m

2024-07-17 21:33:50 UTC

I have never had an issue with them. That said I typically go mid range on cards so they are usually hardened architecture due to a year or two of being in the high end.

bradyriddle

39 replies

23h30m

2024-07-17 19:00:43 UTC

I remember Nvidia getting hacked pretty bad a few years ago. IIRC, the hackers threatened to release everything they had unless they open sourced their drivers. Maybe they got what they wanted.

[0] https://portswigger.net/daily-swig/nvidia-hackers-allegedly-...

dralley

28 replies

23h10m

2024-07-17 19:20:45 UTC

I doubt it. It's probably a matter of constantly being prodded by their industry partners (i.e. Red Hat), constantly being shamed by the community, and reducing the amount of maintenance they need to do to keep their driver stack updated and working on new kernels.

The meat of the drivers is still proprietary, this just allows them to be loaded without a proprietary kernel module.

kabes

12 replies

21h46m

2024-07-17 20:45:16 UTC

It's hard to believe one of the highest valued companies in the world cares about being shamed for not having open source drivers.

commodoreboxer

5 replies

21h18m

2024-07-17 21:13:11 UTC

They care when it affects their bottom line, and customers leaving for the competition does that.

I don't know if that's what's happening here, honestly, but you're right that they don't care about being shamed, but building a reputation of being hard to work with and target, especially in a growing market like Linux (still tiny, but growing nonetheless, and becoming significantly more important in the areas where non-gaming GPU use is concerned) can start to erode sales and B2B relationships, and the latter particularly if you make the programmers and PMs hate using your products.

bryanlarsen

3 replies

20h51m

2024-07-17 21:40:22 UTC

in a growing market like Linux

Isn't Linux 80% of their market? ML et al is 80% of their sales, and ~99% of that is Linux.

fngjdflmdflg

1 replies

20h39m

2024-07-17 21:51:59 UTC

True, although note that the Linux market itself is increasing in size due to ML. Maybe "increasingly dominant market" is a better phrase here.

bryanlarsen

0 replies

2h36m

2024-07-18 15:55:33 UTC

Hah, good point. The OP was pedantically correct. The implication in "growing market share" is that "market share" is small, but that's definitely reading between the lines!

lmm

0 replies

18h48m

2024-07-17 23:42:44 UTC

Right, and that's where most of their growth is.

gessha

0 replies

17h27m

2024-07-18 01:04:04 UTC

customers leaving for the competition does that

What competition?

I do agree that companies don’t really care for public sentiment as long as business is going as usual. Nvidia is printing money with their data center hardware [1] where half of their yearly revenue comes from.

https://nvidianews.nvidia.com/news/nvidia-announces-financia...

nailer

4 replies

21h35m

2024-07-17 20:56:33 UTC

Having products that require a bunch of extra work due to proprietary drivers, especially when their competitors don't require that work, is not good.

josefx

3 replies

12h54m

2024-07-18 05:37:32 UTC

The biggest chunk of that "extra work" would be installing Linux in the first place, given that almost everything comes with Windows out of the box. An additional "sudo apt install nvidia-drivers" isn't going to stop anyone who already got that far.

sam_bristow

0 replies

11h51m

2024-07-18 06:40:04 UTC

Does the "everything comes with Windows out of the box" still apply for the servers and workstations where I imagine the vast majority of these high-end GPUs are going these days?

nailer

0 replies

3h43m

2024-07-18 14:48:35 UTC

Most cloud instances come with Linux out of the box.

Arch-TK

0 replies

51m

2024-07-18 17:40:27 UTC

Tainted kernel. Having to sort out secure boot problems caused by use of an out of tree module. DKMS. Annoying weird issues with different kernel versions and problems running the bleeding edge.

ZeroCool2u

0 replies

18h18m

2024-07-18 00:12:59 UTC

I mean I've personally given our Nvidia rep some light hearted shit for it. Told him I'd appreciate if he passed the feedback up the chain. Can't hurt to provide feedback!

p_l

11 replies

22h22m

2024-07-17 20:09:01 UTC

I suspect it's mainly the reduced maintenance and reduction of workload needed to support, especially with more platforms coming to be supported (not so long ago there was no ARM64 nvidia support, now they are shipping their own ARM64 servers!)

What really changed the situation is that Turing architecture GPUs bring new, more powerful management CPU, which has enough capacity to essentially run the OS-agnostic parts of driver that used to be provided as blob on linux.

knotimpressed

10 replies

20h2m

2024-07-17 22:29:24 UTC

Am I correct in reading that as Turing architecture cards include a small CPU on the GPU board, running parts of the driver/other code?

p_l

9 replies

19h14m

2024-07-17 23:17:22 UTC

In Turing microarchitecture, nVidia replaced their old "falcon" cpu with NV-RISCV RV64 chip, running various internal tasks.

"Open Drivers" from nVidia include different firmware that utilizes the new-found performance.

matheusmoreira

8 replies

10h46m

2024-07-18 07:45:18 UTC

How well isolated is this secondary computer? Do we have reason to fear the proprietary software running on it?

p_l

7 replies

10h27m

2024-07-18 08:04:01 UTC

As well isolated as anything else on the bus.

So you better actually use IOMMU

stragies

3 replies

7h42m

2024-07-18 10:48:55 UTC

Ah, yes, the magical IOMMU controller, that everybody just assumes to be implemented perfectly across the board. I'm expecting this to be like Hyperthreading, where we find out 20 years later, that the feature was faulty/maybe_bugdoored since inception in many/most/all implementations.

Same thing with USB3/TB-controllers, NPUs, etc that everybody just expects to be perfectly implemented to spec, with flawless firmwares.

p_l

2 replies

6h18m

2024-07-18 12:13:36 UTC

It's not perfect or anything, but it's usually a step up[1], and the funniest thing is that GPUs generally had less of ... "interesting" compute facilities to jump over from, just easier to access usually. My first 64 bit laptop, my first android smartphone, first few iPhones, had more MIPS32le cores with possible DMA access to memory than the main CPU cores, and that was just counting one component of many (the wifi chip).

Also, Hyperthreading wasn't itself faulty or "bugdoored". The tricks necessary to get high performance out of CPUs were, and then there was intel deciding to drop various good precautions in name of still higher single core performance.

Fortunately, after several years, IOMMU availability becomes more common (current laptop I'm writing this on has proper separate groups for every device it seems)

[1] There's always the OpenBSD of navel gazing about writing "secure" C code, becoming slowly obsolescent thanks to being behind in performance and features, and ultimately getting pwned because your C focus and not implementing "complex" features helping mitigate access results in pwnable SMTPd running as root.

stragies

1 replies

3h59m

2024-07-18 14:32:19 UTC

All fine and well, but I always come back to "If I were a manufacturer/creator of some work/device/software, that does something in the plausible realm of 'telecommunication', how do make sure, that my product can always comply with https://en.wikipedia.org/wiki/Lawful_interception requests? Allow for ingress/egress of data/commands at as low a level as possible!"

So as a chipset creator company director it would seem like a no-brainer to me to have to tell my engineers unfortunately to not fix some exploitable bug in the IOMMU/Chipset. Unless I want to never sell devices that could potentially be used to move citizens internet packets around in a large scale deployment.

And implement/not_fix something similar in other layers as well, e.g. ME.

p_l

0 replies

25m

2024-07-18 18:06:26 UTC

If your product is supposed to comply with Lawful Interception, you're going to implement proper LI interfaces, not leave bullshit DMA bugs in.

The very point of Lawful Interception involves explicit, described interfaces, so that all parties involved can do the work.

The systems with LI interfaces also often end up in jurisdictions that simultaneously put high penalties on giving access to them without specific authorizations - I know, I had to sign some really interesting legalese once due to working in environment where we had to balance both Lawful Interception, post-facto access to data, and telecommunications privacy laws.

Leaving backdoors like that is for Unlawful Interception, and the danger of such approaches is greatly exposed in form of Chinese intelligence services exploiting NSA backdoor in Juniper routers (infamous DRBG_EC RNG)

matheusmoreira

2 replies

6h9m

2024-07-18 12:22:17 UTC

you better actually use IOMMU

Is this feature commonly present on PC hardware? I've only ever read about it in the context of smartphone security. I've also read that nvidia doesn't like this sort of thing because it allows virtualizing their cards which is supposed to be an "enterprise" feature.

brendank310

1 replies

4h56m

2024-07-18 13:34:46 UTC

Relatively common nowadays. It used to be delineated as a feature in Intel chips as part of their vPro line, but I think it’s baked in. Generally an IOMMU is needed for performant PCI passthrough to VMs, and Windows uses it for DeviceGuard which tries to prevent DMA attacks.

p_l

0 replies

24m

2024-07-18 18:07:30 UTC

Seems to me that Zen 4 has no issues at all, but bridges/switches require additional interfaces to further fan-out access controls.

chillfox

2 replies

17h49m

2024-07-18 00:42:35 UTC

Nvidia has historically given zero fucks about the opinions of their partners.

So my guess is it's to do with LLMs. They are all in on AI, and having more of their code be part of training sets could make tools like ChatGPT/Claude/Copilot better at generating code for Nvidia GPUs.

jmorenoamor

0 replies

12h49m

2024-07-18 05:42:04 UTC

I also see this as the main reason. GPU drivers for Linux, as far as I know, were just a niche use case, maybe CUDA planted a small seed, and the AI hype is the flower. Now the industry, not the users, demand drivers, so this became a demanded feature instead of a niche user wish.

A bit sad, but hey, welcome anyways.

da_chicken

0 replies

6h54m

2024-07-18 11:37:18 UTC

Yup. nVidia wants those fat compute center checks to keep coming in. It's an unsaturated market, unlike gaming consoles, home gaming PCs, and design/production workstations. They got a taste of that blockchain dollar, and now AI looks to double down on the demand.

The best solution is to have the industry eat their dogfood.

nicce

4 replies

23h14m

2024-07-17 19:16:51 UTC

Kernel modules are not user-space drivers which are still proprietary.

bradyriddle

2 replies

19h44m

2024-07-17 22:46:52 UTC

Ooops. Missed that part.

Re-reading that story is kind of wild. I don't know how valuable what they allegedly got would be (silicon, graphics and chipset files) but the hackers accused Nvidia of 'hacking back' and encrypting their data.

Reminds me of a story I heard about Nvidia hiring a private military to guard their cards after entire shipments started getting 'lost' somewhere in asia.

spookie

1 replies

17h18m

2024-07-18 01:13:37 UTC

Wait what? That PMC story got me. Where can I find more info on that lmao?

bradyriddle

0 replies

14h9m

2024-07-18 04:21:42 UTC

I'd heard the story first hand from a guy in san jose. Never looked it up until now. This is the closest thing I could find to it. In which case it sounds like it's been debunked.

[0] https://www.pcgamer.com/no-half-a-million-geforce-rtx-30-ser...

[1] https://www.geeknetic.es/Noticia/20794/Encuentran-en-Corea-5...

porphyra

0 replies

23h9m

2024-07-17 19:22:05 UTC

Much of the black magic has been moved from the drivers to the firmware anyway.

justinclift

3 replies

16h23m

2024-07-18 02:07:55 UTC

For Nvidia, the most likely reason they've strongly avoided Open Sourcing their drivers isn't anything like that.

It's simply a function of their history. They used to have high priced professional level graphics cards ("Nvidia Quadro") using exactly the same chips as their consumer graphics cards.

The BIOS of the cards was different, enabling different features. So people wanting those features cheaply would buy the consumer graphics cards and flash the matching Quadro BIOS to them. Worked perfectly fine.

Nvidia naturally wasn't happy about those "lost sales", so began a game of whack-a-mole to stop BIOS flashing from working. They did stuff like adding resistors to the boards to tell the card whether it was a Geforce or Quadro card, and when that was promptly reverse engineered they started getting creative in other ways.

Meanwhile, they couldn't really Open Source their drivers because then people could see what the "Geforce vs Quadro" software checks were. That would open up software countermeasures being developed.

---

In the most recent few years the professional cards and gaming cards now use different chips. So the BIOS tricks are no longer relevant.

Which means Nvidia can "safely" Open Source their drivers now, and they've begun doing so.

Note that this is a copy of my comment from several months ago, as it's just as relevant now as it was then: https://news.ycombinator.com/item?id=38418278

1oooqooq

1 replies

15h1m

2024-07-18 03:30:11 UTC

interesting timing to recall that story. now the same trick is used for h100 vs whatever the throttled-for-embargo-wink-wink Chinese version is called.

but those companies are really adverse to open sourcing because they can't be sure they own all the code. it's decades of copy pasting reference implementations after all

rfoo

0 replies

12h2m

2024-07-18 06:28:47 UTC

now the same trick is used for h100 vs whatever the throttled-for-embargo-wink-wink Chinese version

No. H20 is a different chip designed to be less compute-dense (by having different combinations of SM/L2$/HBM controller). It is not a throttled chip.

A800 and H800 are A100/H100 with some area of the chip physically blown up and reconfigured. They are also not simply throttled.

SuperNinKenDo

0 replies

14h49m

2024-07-18 03:42:32 UTC

Very interesting, thanks for the perspective. I suspect all the recent loss of face they experienced with the transition to Wayland happening around the time that this motivation evaporated also probably plays a part too though.

I swore off ever again buying Nvidia, or any laptops that come with Nvidia, after all this. Maybe in 10 years they'll have managed to right the brand perceptions of people like myself.

nicman23

0 replies

7h24m

2024-07-18 11:06:52 UTC

they did release it. a magic drive i have seen, but totally do not own, has it

creata

19 replies

23h25m

2024-07-17 19:06:25 UTC

Huh. Sway and Wayland was such a nightmare on Nvidia that it convinced me to switch to AMD. I wonder if it's better now.

(IIRC the main issue was https://gitlab.freedesktop.org/xorg/xserver/-/issues/1317 , which is now complete.)

snailmailman

12 replies

23h18m

2024-07-17 19:13:40 UTC

Better as of extremely recently. Explicit sync fixes most of the issues with flickering that I’ve had on Wayland. I’ve been using the latest (beta?) driver for a while because of it.

I’m using Hyprland though so explicit sync support isn’t entirely there for me yet. It’s actively being worked on. But in the last few months it’s gotten a lot better

JasonSage

11 replies

22h14m

2024-07-17 20:17:11 UTC

Better as of extremely recently.

Yup. Anecdotally, I see a lot of folks trying to run wine/games on Wayland reporting flickering issues that are gone as of version 555, which is the most recent release save for 560 coming out this week. It's a good time to be on the bleeding edge.

hulitu

3 replies

21h57m

2024-07-17 20:33:51 UTC

You can always use X11. /s

bornfreddy

2 replies

11h20m

2024-07-18 07:10:48 UTC

I know that was a joke, but - as someone who is still on X, what am I missing? Any practical advantages to using Wayland when using a single monitor on desktop computer?

vetinari

1 replies

3h39m

2024-07-18 14:52:36 UTC

Even that single monitor can be hidpi, vrr or hdr (this one is still wip).

Arch-TK

0 replies

31m

2024-07-18 18:00:39 UTC

I have a 165 DPI monitor. This honestly just works with far less hassle on X. I don't have to listen to anyone try to explain to me how fractional scaling doesn't make sense (real explanation for why it wasn't supported). I don't have to deal with some silly explanation for why XWayland applications just can't be non-blurry with a fractional or non-1 scaling factor. I can just set the DPI to the value I calculated and things work in 99% of cases. In 0.9% of the remaining cases I need to set an environment variable or pass a flag to fix a buggy application and in the 0.1% of cases I need to make a change to the code.

VRR has always worked for me on single monitor X. I use it on my gaming computer (so about twice a year).

Fr0styMatt88

3 replies

16h53m

2024-07-18 01:38:23 UTC

On latest NixOS unstable and KDE + Wayland is still a bit of a dumpster fire for me (3070 + latest NV drivers). In particular there’s a buffer wait bug in EGL that needs fixing on the Nvidia side that causes the Plasma UI to become unresponsive. Panels are also broken for me, with icons not showing.

Having said that, the latest is a pain on X11 right now as well, with frequent crashing of Plasma, which atleast restarts itself.

There’s a lot of bleeding on the bleeding edge right at this moment :)

JasonSage

2 replies

16h48m

2024-07-18 01:42:57 UTC

That's interesting, maybe it's hardware-dependent? I'm doing nixos + KDE + Wayland and I've had almost no issues in day-to-day usage and productivity.

I agree with you that there's a lot of bleeding. Linux is nicer than it used to be and there's less fiddling required to get to a usable base, but still plenty of fiddling as you get into more niche usage, especially when it involves any GPU hardware/software. Yet somehow one can run Elden Ring on Steam via Proton with a few mouse clicks and no issues, which would've been inconceivable to me only a few years ago.

Fr0styMatt88

1 replies

12h36m

2024-07-18 05:55:37 UTC

Yeah it’s pretty awesome overall. I think the issues are from a few things on my end:

- I’ve upgraded through a few iterations starting with Plasma 6, so my dotfiles might be a bit wonky. I’m not using Home Manager so my dotfiles are stateful.

- Could be very particular to my dock setup as I have two docks + one of the clock widgets.

- Could be the particular wallpaper I’m using (it’s one of the dynamic ones that comes with KDE).

- It wouldn’t surprise me if it’s related to audio somehow as I have Bluetooth set-up for when I need it.

I’m sure it’ll settle soon enough :)

postcert

0 replies

2h34m

2024-07-18 15:57:20 UTC

I've been having a similar flakiness with plasma on Nixos (proprietary + 3070 as well). Sadly can't say whether it did{n't} happen on another distro as I last used Arch around the v535 driver.

I found it funny how silently it would fail at times. After coming out of a game or focusing on something I'd scratch my head as to where did the docks/background went. I'd say you're lucky in that it recovered itself, generally I needed to run `plasmashell` in the alt+f2 run prompt.

asyx

2 replies

21h55m

2024-07-17 20:36:29 UTC

I think it's X11 stuff that is using Vulkan for rendering that is still flickering in 555. This probably affects pretty much all of Proton / Wine gaming.

doix

1 replies

4h31m

2024-07-18 13:59:42 UTC

Any specific examples that you know should be broken? I am on X11 with 555 drivers and an nvidia gpu. I don't have any flickering when I'm gaming, it's actually why I stay on X11 instead of transitioning to wayland.

johnny22

0 replies

39m

2024-07-18 17:52:33 UTC

They are probably talking about running the game in a wayland session via xwayland, since wine's wayland driver is not part of proton yet.

modzu

3 replies

10h4m

2024-07-18 08:27:22 UTC

why switch to amd and not just switch to X? :D

whalesalad

1 replies

6h56m

2024-07-18 11:34:51 UTC

once you go Wayland you usually don’t go back :)

kiney

0 replies

6h4m

2024-07-18 12:27:14 UTC

I tested wayland for a while to see what the hype is about. No uoside lits of small workflows broken. Back to Xorg it was.

account42

0 replies

9h33m

2024-07-18 08:57:41 UTC

Why not both?

joecool1029

0 replies

21h36m

2024-07-17 20:55:33 UTC

It's buggy still with sway on nvidia. I really thought the 555 driver would wrinkle out last of the issues but it still has further to go. Switched to kde plasma 6 on wayland since then and it's been great, not buggy at all.

XorNot

0 replies

15h50m

2024-07-18 02:40:48 UTC

Easy Linux use is what keeps me firmly on AMD. This move may earn them a customer.

berkeleyjunk

16 replies

23h34m

2024-07-17 18:57:02 UTC

As someone who is pretty skeptical and reads the fine print, I think this is a good move and I really do not see a downside (other than the fact that this probably strengthens the nVidia monoculture).

vlovich123

15 replies

23h23m

2024-07-17 19:08:38 UTC

AFAIK I believe all they did was move the closed source user space driver code to their opaque firmware blob leaving a thin shim in the kernel.

In essence I don’t believe that much has really changed here.

stkdump

8 replies

23h12m

2024-07-17 19:19:38 UTC

But the firmware runs directly on the hardware, right? So they effectively rearchitected their system to move what used to be 'above' the kernel to 'below' the kernel, which seems like a huge effort.

vlovich123

5 replies

22h56m

2024-07-17 19:35:34 UTC

It’s some effort but I bet they added a classical serial CPU to run the existing code. In fact, [1] suggests that’s exactly what they did. I suspect they had other reasons to add the GSP so the amortized cost of moving the driver code to firmware was actually not that large all things considered and in the long term reduces their costs (eg they reduce the burden further of supporting multiple OSes, they can improve performance further theoretically, etc etc)

[1] https://download.nvidia.com/XFree86/Linux-x86_64/525.78.01/R...

p_l

4 replies

22h12m

2024-07-17 20:18:58 UTC

That's exactly what happened - Turing microarchitecture brought in new[1] "GSP" which is capable enough to run the task. Similar architecture happens AFAIK on Apple M-series where the GPU runs its own instance of RTOS talking with "application OS" over RPC.

[1] Turing GSP is not the first "classical serial CPU" in nvidia chips, it's just first that has enough juice to do the task. Unfortunately without recalling the name of the component it seems impossible to find it again thanks to search results being full of nvidia ARM and GSP pages...

mepian

1 replies

21h58m

2024-07-17 20:33:33 UTC

the name of the component

Falcon?

p_l

0 replies

20h50m

2024-07-17 21:41:19 UTC

THANK YOU, that was the name I was forgetting :)

here's[1] a presentation from nvidia regarding (unsure if done or not) plan for replacing Falcon with RISC-V, [2] suggests the GSP is in fact the "NV-RISC" mentioned in [1]. Some work on reversing Falcon was apparently done for Switch hacking[3]?

[1] https://riscv.org/wp-content/uploads/2016/07/Tue1100_Nvidia_... [2] https://www.techpowerup.com/291088/nvidia-unlocks-gpu-system... [3] https://github.com/vbe0201/faucon

knotimpressed

1 replies

19h48m

2024-07-17 22:43:33 UTC

Would you happen to have a source or any further readings about Apple M-series GPUs running their own RTOS instance?

p_l

0 replies

19h8m

2024-07-17 23:22:51 UTC

Asahi Linux documentation has pretty good writeup.

The GPU is described here[1] and the mailbox interface used generally between various components is described here [2]

[1] https://github.com/AsahiLinux/docs/wiki/HW%3AAGX#overview

[2] https://github.com/AsahiLinux/docs/wiki/HW%3AASC

imtringued

1 replies

22h43m

2024-07-17 19:48:21 UTC

Why? It should make it much easier to support Nvidia GPUs on Windows, Linux, Arm/x86/RISC-V and more OSes with a single firmware codebase per GPU now.

stkdump

0 replies

22h19m

2024-07-17 20:11:50 UTC

Yes makes sense, in the long run it should make their life easier. I just suspect that the move itself was a big effort. But probably they can afford that nowadays.

adrian_b

5 replies

13h5m

2024-07-18 05:25:59 UTC

Having as open-source all the kernel, more precisely all the privileged code, is much more important for security than having as open-source all the firmware of the peripheral devices.

Any closed-source privileged code cannot be audited and it may contain either intentional backdoors, or, more likely, bugs that can cause various undesirable effects, like crashes or privilege escalation.

On the other hand, in a properly designed modern computer any bad firmware of a peripheral device cannot have a worse effect than making that peripheral unusable.

The kernel should take care, e.g. by using the I/O MMU, that the peripheral cannot access anything where it could do damage, like the DRAM not assigned to it or the non-volatile memory (e.g. SSDs) or the network interfaces for communicating with external parties.

Even when the peripheral is so important as the display, a crash in its firmware would have no effect if the kernel had reserved some key combination to reset the GPU (while I am not aware of such a useful feature in Linux, its effect can frequently be achieved by switching, e.g. with Alt+F1, to a virtual console and then back to the GUI, the saving and restoring of the GPU state together with the switching of the video modes being enough to clear some corruption caused by a buggy GPU driver or a buggy mouse or keyboard driver).

In conclusion, making the NVIDIA kernel driver as open source does not deserve to have its importance minimized. It is an important contribution to a more secure OS kernel.

The only closed-source firmware that must be feared is that which comes from the CPU manufacturer, e.g. from Intel, AMD, Apple or Qualcomm.

All such firmware currently includes various features for remote management that are not publicly documented, so you can never be sure if they can be properly disabled, especially when the remote management can be done wirelessly, like through the WiFi interface of the Intel laptop CPUs, so you cannot interpose an external firewall to filter the network traffic of any "magic" packets.

A paranoid laptop user can circumvent the lack of control over the firmware blobs from the CPU manufacturer by disconnecting the internal antennas and using an external cheap and small single-board computer for all wired and wireless network access, which must run a firewall with tight rules. Such a SBC should be chosen among those for which complete hardware documentation is provided, i.e. including its schematics.

stragies

2 replies

7h59m

2024-07-18 10:32:10 UTC

Everything you wrote assumes the IOMMUs across the board to be 100% correctly implemented without errors/bugdoors.

People used to believe similar things about Hyperthreading, glitchability, ME, Cisco, boot-loaders, ... the list goes on.

adrian_b

1 replies

4h39m

2024-07-18 13:52:13 UTC

There still is a huge difference between running privileged code on the CPU, for which there is nothing limiting what it can do, and code that runs on a device, which should normally be contained by the I/O MMU, except if the I/O MMU is buggy.

The functions of an I/O MMU for checking and filtering the transfers are very simple, so the probability of non-intentional bugs is extremely small in comparison with the other things enumerated by you.

stragies

0 replies

4h16m

2024-07-18 14:15:35 UTC

Agreed, that the feature-set of IOMMU is fairly small, but is this function not usually included in one of the Chipset ICs, which do run a lot other code/functions alongside a (hopefully) faithful correct IOMMU routine?

Which -to my eyes- would increase the possibility of other system parts mucking with IOMMU restrictions, and/or triggering bugs.

saagarjha

1 replies

10h44m

2024-07-18 07:47:09 UTC

Did you run this through a LLM? I'm not sure what the point is of arguing with yourself and bringing up points that seem tangential to what you started off talking about (…security of GPUs?)

adrian_b

0 replies

8h25m

2024-07-18 10:05:45 UTC

I have not argued with myself. I do not see what made you believe this.

I have argued with "I don’t believe that much has really changed here", which is the text to which I have replied.

As I have explained, an open-source kernel module, even together with closed-source device firmware, is much more secure than a closed-source kernel module.

Therefore the truth is that a lot has changed here, contrary to the statement to which I have replied, as this change makes the OS kernel much more secure.

shanoaice

9 replies

6h57m

2024-07-18 11:34:34 UTC

There is little meaning for NVIDIA to open-source only the driver portion of their cards, since they heavily rely on proprietary firmware and userspace lib (most important!) to do the real job. Firmware is a relatively small issue - this is mostly same for AMD and Intel, since encapsulation reduces work done on driver side and open-sourcing firmware could allow people to do some really unanticipated modification which might heavily threaten even commercial card sale. Nonetheless at least for AMD they still keep a fair share of work done by driver compared to Nvidia. Userspace library is the worst problem, since they handle a lot of GPU control related functionality and graphics API, which is still kept closed-source.

The best thing we can hope is improvement on NVK and RedHat's Nova Driver can put pressure on NVIDIA releasing their user space components.

gpderetta

4 replies

5h10m

2024-07-18 13:21:20 UTC

It is meaningful because, as you note, it enables a fully opensource userspace driver. Of course the firmware is still proprietary and it increasingly contains more and more logic.

sscarduzio

0 replies

4h44m

2024-07-18 13:46:46 UTC

Which in a way is good because the hardware will more and more perform identically on Linux as on Windows.

pabs3

0 replies

4h21m

2024-07-18 14:10:27 UTC

The firmware is also signed, so you can't even do reverse engineering to replace it.

matheusmoreira

0 replies

4h11m

2024-07-18 14:20:04 UTC

Doesn't seem like a bad tradeoff so long as the proprietary stuff is kept completely isolated with no access to any other parts of my system.

bayindirh

0 replies

2h43m

2024-07-18 15:48:02 UTC

The GLX libraries are the elephant(s) in the room. Open source kernel modules mean nothing without these libraries. On the other hand AMD and Intel uses "pltform GLX" natively, and with great success.

matheusmoreira

1 replies

4h14m

2024-07-18 14:16:53 UTC

Why is the user space component required? Won't they provide sysfs interfaces to control the hardware?

cesarb

0 replies

4h4m

2024-07-18 14:27:24 UTC

It's something common to all modern GPUs, not just NVIDIA: most of the logic is in a user space library loaded by the OpenGL or Vulkan loader into each program. That library writes a stream of commands into a buffer (plus all the necessary data) directly into memory accessible to the GPU, and there's a single system call at the end to ask the operating system kernel to tell the GPU to start reading from that command buffer. That is, other than memory allocation and a few other privileged operations, the user space programs talk directly to the GPU.

AshamedCaptain

1 replies

4h13m

2024-07-18 14:18:20 UTC

I really don't know where this crap about "Moving everything to the firmware" is coming from. The kernel part of the nvidia driver has always been small, and this is the only thing they are open-sourcing (they have been announcing it for months now......). The immense majority of the user-space driver is still closed and no one has seen any indications that this may change.

I see no indications either that either nvidia nor any of the rest of the manufacturers has moved any respectable amount of functionality to the firmware. If you look at the opensource drivers you can even confirm by yourself that the firmware does practically nothing -- the size of the binary blobs of AMD cards are minuscule for example, and long are the times of ATOMBIOS. The drivers are literally generating bytecode-level binaries for the shader units in the GPU, what do you expect the firmware could even do at this point? Re-optimize the compiler output?

There was an example of a GPU that did move everything to the firmware -- the videocore on the raspberry pi, and it was clearly a completely distinct paradigm, as the "driver" would almost literally pass through OpenGL calls to a mailbox, read by the secondary ARM core (more powerful than the main ARM core!) that was basically running the actual driver as "firmware". Nothing I see on nvidia indicates a similar trend, otherwise RE-ing it would be trivial, as happened with the VC.

ploxiln

0 replies

4h4m

2024-07-18 14:27:26 UTC

https://lwn.net/Articles/953144/

Recently, though, the company has rearchitected its products, adding a large RISC-V processor (the GPU system processor, or GSP) and moving much of the functionality once handled by drivers into the GSP firmware. The company allows that firmware to be used by Linux and shipped by distributors. This arrangement brings a number of advantages; for example, it is now possible for the kernel to do reclocking of NVIDIA GPUs, running them at full speed just like the proprietary drivers can. It is, he said, a big improvement over the Nouveau-only firmware that was provided previously.

There are a number of disadvantages too, though. The firmware provides no stable ABI, and a lot of the calls it provides are not documented. The firmware files themselves are large, in the range of 20-30MB, and two of them are required for any given device. That significantly bloats a system's /boot directory and initramfs image (which must provide every version of the firmware that the kernel might need), and forces the Nouveau developers to be strict and careful about picking up firmware updates.

floam

6 replies

22h57m

2024-07-17 19:34:36 UTC

NVIDIA Transitions Fully Towards Open-Source GPU Kernel Modules

NVIDIA Transitions Towards Fully Open-Source GPU Kernel Modules?

slashdave

3 replies

21h41m

2024-07-17 20:50:35 UTC

Not much point in a "partially" open-source kernel module.

floam

2 replies

20h3m

2024-07-17 22:27:49 UTC

But “fully towards” is pretty ambiguous, like an entire partial implementation.

Anyhow I read the article, I think they’re saying fully as in exclusively, like there eventually will not be both a closed source and open source driver co-maintained. So “fully open source” does make more sense. The current driver situation IS partially open source, because their offerings currently include open and closed source drivers and in the future the closed source drivers may be deprecated?

einpoklum

1 replies

19h50m

2024-07-17 22:41:24 UTC

See my answer. It's not going to be fully-open-source drivers, it's rather that all drivers will have open-source kernel modules.

slashdave

0 replies

19m

2024-07-18 18:12:20 UTC

You can argue against proprietary firmware, but is this all that different from other types of devices?

j4hdufd8

1 replies

22h30m

2024-07-17 20:01:25 UTC

haven't read it but probably the former

throwadobe

0 replies

22h19m

2024-07-17 20:12:15 UTC

"towards" basically negates the "fully" before it for all real intents and purposes

benjiweber

4 replies

22h33m

2024-07-17 19:57:43 UTC

I wonder if we'll ever get hdcp on nvidia. As much as I enjoy 480p video from streaming services.

viraptor

2 replies

18h42m

2024-07-17 23:49:20 UTC

Which service goes that low? The ones I know limit you from using 4k, but anything up to 1080p works fine.

9991

1 replies

10h0m

2024-07-18 08:30:57 UTC

Nonsense that a 1080p limit is acceptable for (and accepted by) paying customers.

viraptor

0 replies

9h50m

2024-07-18 08:41:36 UTC

Depends. I disagree with HDCP in theory on ideological grounds. In practice, my main movie device is below 720p (projector), so it will take another decade before it affects me in any way.

ozgrakkurt

0 replies

14h22m

2024-07-18 04:09:14 UTC

Just download it to your pc. It is better user experience and costs less

asaiacai

4 replies

20h4m

2024-07-17 22:27:14 UTC

I really hope this makes it easier to install/upgrade NVIDIA drivers on Linux. It's a nightmare to figure out version mismatches between drivers, utils, container-runtime...

riddley

2 replies

17h16m

2024-07-18 01:15:13 UTC

A nightmare how? When i used their cards, I'd just download the .run and run it. Done.

jaimex2

0 replies

16h11m

2024-07-18 02:19:51 UTC

After a reboot of coarse :)

Everything breaks immediately otherwise.

amelius

0 replies

5h28m

2024-07-18 13:02:51 UTC

And when it doesn't work, what do you do then?

Exactly, that's when the nightmare starts.

einpoklum

0 replies

19h52m

2024-07-17 22:38:42 UTC

From my limited experience with their open-sourcing of kernel modules so far: It doesn't make things easier; but - the silver lining is that, for the most part, it doesn't make installation and configuration harder! Which is no small thing actually.

Animats

4 replies

23h13m

2024-07-17 19:18:34 UTC

NVidia revenue is now 78% from "AI" devices.[1] NVidia's market cap is now US$2.92 trillion. (Yes, trillion.) Only Apple and Microsoft can beat that. Their ROI climbed from about 10% to 90% in the last two years. That growth has all been on the AI side.

Open-sourcing graphics drivers may indicate that NVidia is moving away from GPUs for graphics. That's not where the money is now.

[1] https://www.visualcapitalist.com/nvidia-revenue-by-product-l...

[2] https://www.macrotrends.net/stocks/charts/NVDA/nvidia/roi

joe_the_user

2 replies

23h3m

2024-07-17 19:27:44 UTC

Well, Nvidia seems to be claiming in the article that this is everything, not just graphics drivers: "NVIDIA GPUs share a common driver architecture and capability set. The same driver for your desktop or laptop runs the world’s most advanced AI workloads in the cloud. It’s been incredibly important to us that we get it just right."

And For cutting-edge platforms such as NVIDIA Grace Hopper or NVIDIA Blackwell, you must use the open-source GPU kernel modules. The proprietary drivers are unsupported on these platforms. (These are two most advanced NVIDIA architectures currently)

Animats

1 replies

22h30m

2024-07-17 20:00:58 UTC

That's interesting. I've been expecting the AI cards to diverge more from the graphics cards. AI doesn't need triangle fill, Z-buffering, HDMI out, etc. 16 bit 4x4 multiply/add units are probably enough. What's going on in that area?

p_l

0 replies

19h16m

2024-07-17 23:15:40 UTC

TL;DR - there seems to be not that much improvement from dropping the "graphics-only" parts of the chip if you already have a GPU instead of breaking into AI market as your first product.

1. nVidia compute dominance is not due to hyperfocus on AI (that's Google's TPU for you, or things like intel's NPU in Meteor Lake), but because CUDA offers considerable general purpose compute. In fact, considerable revenue came and still comes from non-AI compute. This also means that if you figure out a novel mechanism for AI that isn't based around 4x4 matrix addition, or which mixes it with various other operations, you can do them inline. This also includes any pre and post processing you might want to do on the data.

2. The whole advantage they have in software ecosystem builds upon their PTX assembly. Having it compile to CPU and only implement the specific variant of one or two instructions that map to "tensor cores" would be pretty much nonsensical (especially given that AI is not the only market they target with tensor cores - DSP for example is another).

Additionally, a huge part of why nvidia built such a strong ecosystem is that you could take cheapest G80-based card and just start learning CUDA. Only some highest-end features are limited to most expensive cards, like RDMA and NVMe integration.

Compare this with AMD, where for many purposes only the most expensive compute-only cards are really supported. Or specialized AI only chips that are often programmable either in very low-level way or essentially as "set a graph of large-scale matrix operations that are limited subset of operations exposed by Torch/Tensorflow" (Google TPU, Intel Meteor Lake NPU, etc).

3. CUDA literally began with how evolution of shader model led to general purpose "shader processor" instead of specialized vector and pixel processors. The space taken by specialized hardware for graphics that isn't also usable for general purpose compute is pretty minimal, although some of it is omitted, AFAIK, in compute only cards.

In fact, some of the "graphics only" things like Z-buffering are done by the same logic that is used for compute (with limited amount of operations done by fixed-function ROP block), and certain fixed-function graphical components like texture mapping units are also used for high-performance array access.

4. Simplified manufacturing and logistics - nVidia uses essentially the same chips in most compute and graphics cards, possibly with minor changes achieved by changing chicken bits to route pins to different functions (as you mentioned, you don't need DP-outs of RTX4090 on an L40 card, but you can probably reuse the SERDES units to run NVLink on the same pins).

orbital-decay

0 replies

22h54m

2024-07-17 19:37:38 UTC

It indicates nothing; they started it a few years ago, before that. They just transferred the most important parts of their driver to the (closed source) firmware, to be handled by the onboard ARM CPU, and open sourced the rest.

sillywalk

3 replies

22h48m

2024-07-17 19:43:12 UTC

From the github repo[0]:

Most of NVIDIA's kernel modules are split into two components:

    An "OS-agnostic" component: this is the component of each kernel module that is independent of operating system.

    A "kernel interface layer": this is the component of each kernel module that is specific to the Linux kernel version and configuration.

When packaged in the NVIDIA .run installation package, the OS-agnostic component is provided as a binary:

[0] https://github.com/NVIDIA/open-gpu-kernel-modules

p_l

2 replies

22h25m

2024-07-17 20:06:23 UTC

That was the "classic" drivers.

The new open source ones effectively move majority of the OS-agnostic component to run as blob on-GPU.

arghwhat

1 replies

21h50m

2024-07-17 20:40:41 UTC

Not quite - it moves some logic to the GSP firmware, but the user-space driver is still a significant portion of code.

The exciting bits there is the work on NVK.

p_l

0 replies

20h58m

2024-07-17 21:33:02 UTC

Yes, I was not including userspace driver in this, as a bit "out of scope" for the conversation :D

brrrrrm

3 replies

23h35m

2024-07-17 18:56:08 UTC

Kernel is an overloaded term for GPUs. This is about the linux kernel

karamanolev

1 replies

23h23m

2024-07-17 19:08:35 UTC

"... Linux GPU Kernel Modules" is pretty unambiguous to me.

brrrrrm

0 replies

21h22m

2024-07-17 21:09:39 UTC

Yep the title was updated.

brrrrrm

0 replies

4h47m

2024-07-18 13:43:49 UTC

Guh, wish i could delete this now that the title was updated. the original title (shown on the linked page) wasn't super clear

muhehe

2 replies

7h27m

2024-07-18 11:04:14 UTC

What is GPU kernel module? Is it something like a driver for GPU?

qalmakka

1 replies

5h42m

2024-07-18 12:48:47 UTC

Yes. In modern operating systems, GPU drivers usually consist in kernel component that is loaded inside of the kernel or in a privileged context, and a userspace component that talks with it and implements the GPU-specific part of the APIs that the windowing system and applications use. In the case of NVIDIA, they have decided to drop their proprietary kernel module in favour of an open one. Unfortunately, it's out of tree.

In Linux and BSD, you usually get all of your drivers with the system; you don't have to install anything, it's all mostly plug and play. For instance, this has been the case for AMD and Intel GPUs, which have a 100% open source stack. NVIDIA is particularly annoying due to the need to install the drivers separately and the fact they've got different implementations of things compared to anyone else, so NVIDIA users are often left behind by FOSS projects due to GeForce cards being more annoying to work with.

muhehe

0 replies

5h5m

2024-07-18 13:26:24 UTC

Thanks. I'm not well versed in these things. It sounded like something you load into GPU (it reminded me old hp printer, which required firmware upload after start).

jdonaldson

2 replies

12h35m

2024-07-18 05:56:14 UTC

It’s kind of surprising that these haven’t just been reverse engineered yet by language models.

special-K

1 replies

12h28m

2024-07-18 06:02:57 UTC

That's simply not how LLMs work, and are actually awful at reverse engineering of any kind.

jdonaldson

0 replies

4h24m

2024-07-18 14:06:45 UTC

Are you saying that they cant explain the contents of machine code in human readable format? Are you saying that they can’t be used in a system that iteratively evaluates combinations of inputs and check their results?

jcalvinowens

2 replies

20h32m

2024-07-17 21:59:04 UTC

Throwing the tarball over the wall and saying "fetch!" is meaningless to me. Until they actually contribute a driver to the upstream kernel, I'll be buying AMD.

aseipp

1 replies

6h54m

2024-07-18 11:36:45 UTC

You can just use Nouveau and NVK for that if you just need workstation graphics (and the open-gpu-modules source code/separate GSP release has been a big uplift to Nouveau too, at least.)

jcalvinowens

0 replies

3h59m

2024-07-18 14:32:04 UTC

Nouveau is great, and I absolutely admire what the community around it has been able to achieve. But I can't imagine choosing that over AMD's first class upstream driver support today.

enoeht

2 replies

23h29m

2024-07-17 19:02:10 UTC

didn't they say that many times before?

vlovich123

1 replies

23h21m

2024-07-17 19:10:40 UTC

Not sure but with the Turing series they support having a cryptographically signed binary blob that they load on the GPU. So before where their kernel driver was a thin shim for the user space driver, now it’s a thin shim for the black box firmware loaded on the GPU

p_l

0 replies

19h5m

2024-07-17 23:26:02 UTC

the scope of what the kernel interface provides didn't change, but what was previously a blob wrapped by source-provided "os interface layer" is now moved to run on GSP (RISC-V based) inside the GPU.

xyst

1 replies

20h25m

2024-07-17 22:06:38 UTC

Nvidia has finally realize they couldn’t write drivers for their own hardware, especially for Linux.

Never thought I would see the day.

TeMPOraL

0 replies

20h19m

2024-07-17 22:12:02 UTC

Suddenly they went from powering gaming to being the winners of the AI revolution; AI is Serious Cloud Stuff, and Serious Cloud Stuff means Linux, so...

smcleod

1 replies

22h20m

2024-07-17 20:11:36 UTC

So does this mean actually getting rid of the binary blobs of microcode that are in their current ‘open’ drivers?

p_l

0 replies

22h12m

2024-07-17 20:19:35 UTC

No, it means the blob from the "closed" drivers is moved to run on GSP.

pluto_modadic

1 replies

22h53m

2024-07-17 19:38:02 UTC

damn, only for new GPUs.

mynameisvlad

0 replies

22h26m

2024-07-17 20:05:01 UTC

For varying definitions of "new". It supports Turing and up, which was released in 2018 with the 20xx line. That's two generations back at this point.

magicloop

1 replies

20h35m

2024-07-17 21:56:26 UTC

Remember that time when Linus looked at the camera and gave Nvidia the finger. Has that time now passed? Is it time to reconcile? Or are there still some gotchas?

jaimex2

0 replies

16h13m

2024-07-18 02:18:10 UTC

These are kernel modules not the actual drivers. So the finger remains up.

Varloom

1 replies

7h50m

2024-07-18 10:41:26 UTC

They know CUDA monopoly won't last forever.

aseipp

0 replies

6h53m

2024-07-18 11:38:00 UTC

CUDA lives in userspace; this kernel driver release does not contain any of that. It's still very useful to release an open source DKMS driver, but this doesn't change anything at all about the CUDA situation.

v3ss0n

0 replies

6h41m

2024-07-18 11:50:20 UTC

Thank You Nvidia hacker! You did it! The Lapasu$ team threaten a few years back that if nvidia is not going to release nvidia opensource they are gonna release their code. That lead nvidia to releasing first kernel opensource module in a few months later but it was quite incomplete. Now it seems they are opensourcing fully more.

sylware

0 replies

5h58m

2024-07-18 12:32:47 UTC

Hopefully, we get a plain and simple C99 user space vulkan implementation.

shmerl

0 replies

20h20m

2024-07-17 22:11:26 UTC

That's not upstream yet. But they supposedly showed some interesting in nova too.

rldjbpin

0 replies

10h46m

2024-07-18 07:44:42 UTC

mind the wording they've used here - "fully towards open-source" and not "towards fully open-source".

big difference. almost nobody is going to give you the sauce hidden behind blobs. but i hope the dumb issues of the past (imagine using it on laptops with switchable graphics) go away slowly with this and it is not only for pleasing the enterprise crowd.

risho

0 replies

22h15m

2024-07-17 20:16:31 UTC

does this mean you will be able to use NVK/Mesa and CUDA at the same time? The non mesa proprietary side of nvidia's linux drivers are such a mess and NVK is improving by the day, but I really need cuda.

resource_waste

0 replies

3h12m

2024-07-18 15:19:24 UTC

This means Fedora can bundle it?

qalmakka

0 replies

19h41m

2024-07-17 22:50:33 UTC

Well, it is something, even if it's still only the kernel module, and it will be probably never upstreamed anyway.

nikolayasdf123

0 replies

9h32m

2024-07-18 08:59:27 UTC

hope linux gets first class open source gpu drivers.. and dare I hope that Go adds native support for GPUs too

nicman23

0 replies

7h24m

2024-07-18 11:07:18 UTC

they are worthless. the main code is in the userspace

n3storm

0 replies

9h54m

2024-07-18 08:37:07 UTC

I read "NVIDIA transitions fully Torvalds..."

matheusmoreira

0 replies

11h6m

2024-07-18 07:25:17 UTC

Transition is not done until their drivers are upstreamed into the mainline kernel and ALL features work out of the box, especially power management and hybrid graphics.

john2x

0 replies

21h59m

2024-07-17 20:32:37 UTC

Maybe that’s one way to retain engineers who are effectively millionaires.

gorkish

0 replies

3h8m

2024-07-18 15:23:03 UTC

This is great. I've been having to build my own .debs of the OSS driver for some time because of the crapola NVIDIA puts in their proprietary driver that prevents it from working in a VM as a passthrough device. (just a regular whole-card passthru, not trying to use GRID/vGPU on a consumer card or anything)

NVIDIA can no longer get away with that nonsense when they have to show their code.

gigatexal

0 replies

9h3m

2024-07-18 09:27:57 UTC

will this mean that we'll be able to remove the arbitrary distinctions between quadro and geforce cards maybe by hacking some configs or such in the drivers?

exabrial

0 replies

5h42m

2024-07-18 12:48:41 UTC

Are Nvidia grace CPUs even available? I thought it was interesting they mentioned that.

einpoklum

0 replies

19h54m

2024-07-17 22:37:02 UTC

The title of this statement is misleading:

NVIDIA is not transitioning to open-source drivers for its GPUs; most or all user-space parts of the drivers (and most importantly for me, libcuda.so) are closed-source; and as I understand from others, most of the logic is now in a binary blob that gets sent to the GPU.

Now, I'm sure this open-sourcing has its uses, but for people who want to do something like a different hardware backend for CUDA with the same API, or to clear up "corners" of the API semantics, or to write things in a different-language without going through the C API - this does not help us.

doctoboggan

0 replies

15h24m

2024-07-18 03:06:54 UTC

My guess is Meta and/or Amazon told Nvidia that they would contribute considerable resources to development as long as the results were open source. Both companies bottom lines would benefit from improved kernel modules, and like another commenter said elsewhere, Nvidia doesn't have much to lose.

aussieguy1234

0 replies

13h2m

2024-07-18 05:28:51 UTC

I'll update as soon at its in NixOS unstable. Hopefully this will change the mind of the sway maintainers to start supporting Nvidia cards, I'm using i3 and X but would like to try out Wayland.

Narhem

0 replies

13h5m

2024-07-18 05:25:58 UTC

I cant wait to use linux without having to spend multiple weekends trying to get the right drivers to work.

CivBase

0 replies

15h40m

2024-07-18 02:51:28 UTC

Too late for me. I tried switching to Linux years ago but failed because of the awful state of NVIDIA's drivers. Switched to AMD least year and it's been a breeze ever since.

Gaming on Linux with an NVIDIA card (especially an old one) is awful. Of course Linux gamers aren't the demographic driving this recent change of heart so I expect it to stay awful for a while yet.