return to table of content

Nvtop: Linux Task Monitor for Nvidia, AMD and Intel GPUs

visarga
19 replies
5h41m

apt install is not working for me, is this by design?

nvtop : Depends: libnvidia-compute-418 but it is not going to be installed E: Unable to correct problems, you have held broken packages.

<rant>I find broken installs a huge turnoff, especially those related to NVIDIA. With their 2.3T market cap they can't afford someone to write an universal point and click install script for ML usage? Every time I reinstall Linux I have to spend a whole day sorting NVIDIA out. Why do they have so many layers - driver, cuda, cuda toolkit, cudnn with conflicting versioning - it's a total mess. Instead of a nice install script we have a million install guides 10 pages long, all outdated.</>

bayindirh
11 replies
4h48m

Because all these problems don’t hinder their bottom line.

Cluster admins or Ph.D. students handle these problems, allowing people to work. All this infra is already buried under Conda, Jupyter, etc. for most people already.

Sincerely,

Your friendly HPC admin.

VHRanger
10 replies
4h30m

That's just because they don't have competition.

Back in the dark days of 2015 we used to spend a day or two just getting tensorflow working on a GPU because of all the install issues, driver issues, etc. Theano was no better, but it was academic research code, we didn't expect better.

Once pytorch started gaining ground, it forced to adapt - Keras was written to hide tensorflow's awfulness. Then Google realized it's an unrecoverable situation of technical debt and they started building JAX.

With AMD, Intel, Tenstorrent, and several other AI chip specialists coming with pytorch compatibility, NVIDIA will eventually have to adapt. They still have the advantage of 15 years of CUDA code already written, but pytorch as ab abstraction layer can make the switch easier.

LoganDark
8 replies
3h47m

PyTorch and CUDA solve completely different problems. CUDA is a general purpose programming environment. PyTorch is for machine learning. PyTorch won't ever displace CUDA because there are things other than machine learning models that GPUs are good at accelerating.

dylan604
5 replies
3h26m

Yeah, the amount of tunnel vision from AI/ML users thinking that Nvidia exists solely for their use is funny to watch. Try writing anything other than ML in pytorch. You can't? You can in CUDA. There's a much bigger world than ML out there.

VHRanger
3 replies
3h21m

Nvidia's stock price isn't at an all time price because of all the people writing fluid dynamics in CUDA.

Nor is it because of all the tensorflow models people are writing, to be honest.

dylan604
2 replies
3h18m

Of course it's all of the mining, but that's not using pytorch either. It's using CUDA

cwbriscoe
0 replies
2h46m

GPU mining went waaay down since Ethereum went POS (Proof of Stake) almost 2 years ago. Does BTC even use GPU's for mining? I am pretty sure they use ASICS.

Eisenstein
0 replies
2h45m

What is being mined using CUDA?

frantathefranta
0 replies
1h58m

And similarly from people who consider Nvidia to be the "Gaming GPU" company, not understanding why it's so big now.

robertlagrant
1 replies
3h39m

It was an example.

LoganDark
0 replies
39m

With AMD, Intel, Tenstorrent, and several other AI chip specialists coming with pytorch compatibility, NVIDIA will eventually have to adapt.

I don't see how Nvidia has to do anything since PyTorch works just fine on their GPUs, thanks to CUDA. If anything, they're still one of the best platforms and that's definitely not because CUDA isn't competitive.

I hate stuff that only works on certain GPUs as much as the next person, but sadly competition has only really started to catch up to CUDA very recently.

thomastjeffery
0 replies
1h3m

The problem is that NVidia is a single company participating with multiple interdependent markets. They are participating with the market of hardware specification, and they are participating in the market of driver implementation, and they are participating in the market of userland software. This is called "vertical integration".

Because of copyright, NVidia gets an explicit government-enforced monopoly over the driver implementation market. Sure, 3rd-party projects like nouveau get to "compete", but NVidia is given free reign to cripple that competition, simply by refusing to share necessary hardware (and firmware) specs; and also by compelling experienced engineers (anyone who works on NVidia's driver implementation) to sign NDAs, legally enforcing the secrecy of their specs.

On top of this, NVidia gets to be anti-competitive with the driver-compatibility of its userland software, including CUDA, GSync, DLSS, etc.

When a company's market participation is vertically integrated, that participation becomes anticompetitive. The only way we can resolve this problem is be dissolving the company into multiple market-specific companies.

rez9x
2 replies
4h27m

I recently traded a friend my Nvidia 3070 for his Radeon 6700 XT, because I'd returned to Linux a few months ago and was tired of Nvidia. Nvidia should will likely get much better as NVK grows, but I think it's better to just not use their products unless you want to have Microsoft spywareOS installed on your computer.

jlarocco
1 replies
2h21m

Everybody's experience is different.

I've had one or two upgrade problems in the last 10 years, but otherwise the Nvidia drivers have worked great for me. My biggest complaint is they dropped support for the GPU in my Macbook, and I had to install the nouveau drivers (which I can never spell correctly).

DonHopkins
0 replies
50m

At least it's not from the FSF, and GPUs aren't gendered, or you'd have to choose from multiple gendered drivers:

    - "gnuveau" for one masculine GPU.
    - "gnuvelle" for one feminine GPU.
    - "gnuveaux" for multiple masculine GPUs.
    - "gnuvelles" for multiple feminine GPUs.

KeplerBoy
1 replies
3h43m

Installing CUDA is not that hard? You follow the official instructions and it's done within minutes.

Never had a problem with the initial installation, updates can get messy, but asides from that it's pretty much smooth sailing.

IanOzsvald
0 replies
2h38m

I've just spent the morning uninstalling and reinstalling different versions of Nvidia driver (Linux) to get nvcc back for llama.cpp after Linux Mint did an update - I had CUDA 12.3 and 12.4 (5GB each), in conflict, with no guidance. 550 was the charm, not 535 that was fine in January. This is the third time I'm going this since December. It is painful. I'm not in a hurry to return to my cuDF experiments as I'm pretty sure that'll be broken too (as it has been in the past). I'm the co author of O'Reilly's High Performance Python book and this experience mirrors what I was having with pyCUDA a decade back.

jlarocco
0 replies
2h30m

There are plenty of problems with NVIDIA on Linux, but I'm sad to tell you I think this one is your own fault.

The error message is telling you that you've held back broken packages that are conflicting with dependencies nvtop is trying to install. If you sort that out, nvtop should install.

I have nvtop installed on Debian via apt, and it works just fine.

ahartmetz
0 replies
1h44m

Nah, screw install scripts, too. If I didn't prefer AMD anyway, I'd want an apt repo.

notorandit
9 replies
11h47m

It is nice. But the status of GPU support in Linux is rather poor.

I can see it working only with VLC.

Firefox has some support while chromium based browsers have that only formally.

In the real world you never see the video hw acceleration kicking in, neither with webrtc nor with videos.

It is a pity.

KeplerBoy
6 replies
11h45m

it's a valuable tool wherever GPU compute is used. Not sure if video decoding would even register as utilization since that is dedicated hardware.

usr1106
3 replies
11h15m

Yes, you can see video decoding, but it's indeed only one detail of the data presented, as it's only a part of the HW.

Not at my computer now, can't tell what metric to watch.

notorandit
1 replies
9h22m

Which software would use video decoding?

adrian_b
0 replies
2h51m

Pretty much everything except the browsers.

I use several video players, including mpv, vlc and ffmpeg, and they have always used without problems the hardware video decoding and encoding on all kinds of GPUs.

Only with Firefox and Chrome/Chromium in most versions the hardware acceleration is broken, even if there have been some versions where it worked fine (on NVIDIA), but at the next browser upgrade it was broken again.

This does not bother me much, because I do not like to watch video files in a browser anyway. I always download them first and I play them locally.

KeplerBoy
0 replies
10h33m

ah yes, you can configure nvtop to display encode and decode loads. Interesting to watch how playback speed in VLC is directly reflected in the metric.

notorandit
0 replies
9h23m

Yes, indeed. But 90% of my very own GPU usage is for WebRTC based applications and web-based videos.

Which leads to what I said: no hw-assisted encoding/decoding actually available.

I use intel_gpu_top (from intel-gpu-tools) to monitor my GPU usage: only VLC shows usage.

Anyone has success with browsers?

jandrese
0 replies
1h1m

Pretty much every time I've used nvtop it has been while doing CUDA stuff, mostly to see if it was going to blow out the memory but also to spot check that the model is actually using the GPU. I've had times where it said it was using the GPU, but it only did so for the first part of the work and then dropped down to the CPU for the rest.

weberer
0 replies
5h57m

I think this is more for people developing applications with CUDA and the like.

pjmlp
0 replies
9h21m

Can confirm with my devices.

hughesjj
8 replies
15h54m

Nvtop+bottom are my favorite resource monitors for Linux today, but TIL nvtop also works for non-nvidia devices

weinzierl
3 replies
8h35m

I use htop but often want to focus on a process, its children and all their threads. Is there a htop replacement that can;

- Show a thread with all children an threads, but nothing else

- Show the whole tree but keep the selected (Shift + Space) process in a fixed screen position.

- Bubbles rows up and down into their new positions instead of having them jump around all over the place.

themoonisachees
2 replies
7h51m

Btop does all that and is available in your package manager. It also looks beautiful.

weinzierl
1 replies
6h30m

Has a nice 90s vibe. I have not figured out how to just expand the selected process. I could collapse everything except the selected process manually, but that'd be tedious.

DonHopkins
0 replies
36m

90's rave scene! ;)

freedomben
3 replies
14h29m

wow, bottom[1] is awesome! This is now my favorite monitor It's not in the Fedora repos but there's a COPR for it. To install:

    sudo dnf copr enable atim/bottom
    sudo dnf install bottom
[1] https://github.com/ClementTsang/bottom

freedomben
0 replies
4h24m

Amazing, thank you!

Cu3PO42
0 replies
3h0m

And it also supports GPU usage monitoring!

formalsystem
8 replies
15h47m

nvtop or nvidia-smi gives you a good macro overview but I personally have found that utilization (EDIT: As reported by nvidia-smi) is actually a poor proxy for how fast your workload can be outside of just ensuring that a GPU is indeed being used

If you're here because you're interested in AI performance I'd recommend instead https://docs.nvidia.com/nsight-compute/NsightComputeCli/inde... to profile individual kernels. Nsight systems for a macro view https://developer.nvidia.com/nsight-systems and the PyTorch profiler if you're not authoring kernels directly but using something PyTorch https://pytorch.org/tutorials/recipes/recipes/profiler_recip...

samstave
2 replies
14h17m

If you install Docker Desktop with WSL2 checked, it automatically lets you run Nvidia-SMI in your WSL ubuntu environ on Windows:

https://i.imgur.com/C24EV5U.png

then sudo apt install nvtop

https://i.imgur.com/SOoCdvR.png

EDIT:

Thanks, Some people were having random problems installing WSL on their systems and I found this was the easiest solution (but based on their card models, they appeared to have much older machines.

acka
1 replies
13h30m

There is no need to install Docker Desktop just to run nvidia-smi in WSL; the Windows directory containing the nvidia-smi binary is mounted inside a WSL instance and added to PATH automatically by WSL on instance startup.

As an aside: there is no need to install Docker Desktop just to use Docker containers in WSL either, unless you want a Windows GUI to manage your containers. Just follow the official documentation for installing Docker in your Linux distro of choice, or simply run `sudo apt install docker.io` in the default WSL Ubuntu distro. Docker will work just fine with an up-to-date WSL.

8A51C
0 replies
8h35m

Further aside, it's possible to have both Docker Desktop and the normal linux Docker.io installed on WSL. They work in isolation, the easy way to know which is active is to check if Docker Desktop is running or not. I wouldn't recommend this set up...

refibrillator
1 replies
15h31m

FLOPs utilization is arguably the industry standard metric for efficiency right now and it should be a good first approximation of how much performance is left on the table.

But if you mean the reported utilization in nvtop is misleading I completely agree (as someone who uses it daily).

I’ve been meaning to dig into the source/docs to see what’s going on. The power usage seems to be a more reliable indicator of actual hardware utilization, at least on nvidia gear.

VHRanger
0 replies
4h20m

FLOPs utilization is arguably the industry standard metric for efficiency right now

I'd argue GB/s memory bandwidth is more worried about at the moment.

ipsum2
1 replies
15h6m

I've been going off of power draw in nvidia-smi as a proxy of util, doesn't require additional setup or code changes.

KeplerBoy
0 replies
13h20m

That's hard to argue with. Of course power draw is a direct measure of hardware utilization, but it doesn't translate very well to a measure of GPU Code efficiency.

Often you can squeeze out another order of magnitude of performance by rewriting the kernel and the power draw will always stay capped at whatever the maximum is. I'd say GPU power consumption is interesting if you're CPU bound and struggling to feed the GPU enough data and/or tasks.

pama
0 replies
15h4m

I agree that utilization by nvidia-smi is a poor proxy for performance. FWIW, I’ve found that for the same architecture the power consumption reported in nvtop very often correlates super nicely with the training performance and the peak performance is always at peak power consumption. Agreed on your advice for getting to tune your architecture details, but once that’s fixed and you have simple things to debug like memory usage, batch size, dataloading bottlenecks the raw power metric is typically a quick proxy. I find the temperature is a second useful macro metric that; you want to be at max power draw and max allowed temp at all times but not exceed the temperature where you throttle.

sa-code
1 replies
11h0m

The downside with nvitop is that it's written in python, which means having it in your environment can cause dependency conflicts. It's either that or you have a separate venv just for it. Maybe it's fine for personal use but sysadmins would prefer nvtop

freedomben
1 replies
14h17m

Does nvitop support AMD cards?

malux85
0 replies
12h2m

Not sure sorry, I only have nvidia :<

brcmthrowaway
3 replies
14h46m

Anything for MACOS?

itsgrimetime
0 replies
12h56m

its not a terminal app like bottom or nvtop but I use https://github.com/exelban/stats and it has iGPU stats

kiraaa
0 replies
10h38m

super easy to install and use

cassianoleal
0 replies
8h27m

Seems to be for NVidia only, whereas OP claims:

Currently supported vendors are AMD (Linux amdgpu driver), Apple (limited M1 & M2 support), Huawei (Ascend), Intel (Linux i915 driver), NVIDIA (Linux proprietary divers), Qualcomm Adreno (Linux MSM driver).
vilunov
1 replies
8h40m

It doesn't work with mesa, does it? I'm using the new nvk driver and it shows me no GPU to monitor.

thangngoc89
0 replies
7h23m

Of course it doesn't. `nvidia-smi` needs to be available

cyberax
1 replies
15h8m

Is there a description of the wire protocol between the driver and NVidia library?

Context: I'd love to have a native Go-language library to read the GPU utilization for containerized workloads.

blagie
1 replies
10h10m

Now that I use Home Assistant, I want all my data sources to plug into there. It can handle the rendering for me as I see fit, and it's where data comes to integrate.

It's one of those things which I wish existed, but I can't imagine anyone would have written. Until I do a web search.

https://github.com/koriwi/sensors2mqtt/tree/main

I have not used it yet, but that seems like how I'd want to do it.

feitingen
0 replies
8h11m

Collectd can also output stats to mqtt, from sensors, disks, network and others.

Ycros
1 replies
9h59m

I prefer btop, it does all the usual process monitoring as well as gpus in the latest versions.

notorandit
0 replies
9h21m

Really? Mine is v1.3.2 and doesn't show Intel Iris Xe Graphics!

{UPDATE} I see: no Intel GPU support yet!

thangngoc89
0 replies
7h14m

My favorite would be gpustat [1]. This shows the bare minimum amount of information to let's me know that the training has problems/running well

[1] https://github.com/wookayin/gpustat

sylware
0 replies
7h21m

cmake... erk...

I would have prefered a simple and brutal shell script (not bash of course) to build it on elf/linux.

It is worth trashing cmake, always, so I'll write it if I end up using that GPU monitoring tool.

superkuh
0 replies
13h58m

radeontop is the same sort of thing if you live in amdgpu-ville and want something easy to compile. I was able to use it to show that with kernel 5.x admgpu vulkan when a process is pushed out of vram into gtt it'll never reload and get stuck in a 'slow' state.

stuaxo
0 replies
9h59m

Its good to see linux graphics card utilities going multi platform, instead of the old way of being per-driver.

mrlonglong
0 replies
5h50m

It'd be nice if it supported Nouveau.

collsni
0 replies
15h52m

Amdgpu_top is another cool usage statistic/monitor for and GPUs

bvaisvil
0 replies
1h9m

Zenith is my project which combines NVIDIA GPU monitoring with disk, CPU, and Top-like capabilities. https://github.com/bvaisvil/zenith