HN comments for: Nvtop: Linux Task Monitor for Nvidia, AMD and Intel GPUs

visarga

19 replies

5h41m

2024-03-13 12:44:24 UTC

apt install is not working for me, is this by design?

nvtop : Depends: libnvidia-compute-418 but it is not going to be installed E: Unable to correct problems, you have held broken packages.

<rant>I find broken installs a huge turnoff, especially those related to NVIDIA. With their 2.3T market cap they can't afford someone to write an universal point and click install script for ML usage? Every time I reinstall Linux I have to spend a whole day sorting NVIDIA out. Why do they have so many layers - driver, cuda, cuda toolkit, cudnn with conflicting versioning - it's a total mess. Instead of a nice install script we have a million install guides 10 pages long, all outdated.</>

bayindirh

11 replies

4h48m

2024-03-13 13:36:47 UTC

Because all these problems don’t hinder their bottom line.

Cluster admins or Ph.D. students handle these problems, allowing people to work. All this infra is already buried under Conda, Jupyter, etc. for most people already.

Sincerely,

Your friendly HPC admin.

VHRanger

10 replies

4h30m

2024-03-13 13:54:33 UTC

That's just because they don't have competition.

Back in the dark days of 2015 we used to spend a day or two just getting tensorflow working on a GPU because of all the install issues, driver issues, etc. Theano was no better, but it was academic research code, we didn't expect better.

Once pytorch started gaining ground, it forced to adapt - Keras was written to hide tensorflow's awfulness. Then Google realized it's an unrecoverable situation of technical debt and they started building JAX.

With AMD, Intel, Tenstorrent, and several other AI chip specialists coming with pytorch compatibility, NVIDIA will eventually have to adapt. They still have the advantage of 15 years of CUDA code already written, but pytorch as ab abstraction layer can make the switch easier.

LoganDark

8 replies

3h47m

2024-03-13 14:38:01 UTC

PyTorch and CUDA solve completely different problems. CUDA is a general purpose programming environment. PyTorch is for machine learning. PyTorch won't ever displace CUDA because there are things other than machine learning models that GPUs are good at accelerating.

dylan604

5 replies

3h26m

2024-03-13 14:59:15 UTC

Yeah, the amount of tunnel vision from AI/ML users thinking that Nvidia exists solely for their use is funny to watch. Try writing anything other than ML in pytorch. You can't? You can in CUDA. There's a much bigger world than ML out there.

VHRanger

3 replies

3h21m

2024-03-13 15:03:56 UTC

Nvidia's stock price isn't at an all time price because of all the people writing fluid dynamics in CUDA.

Nor is it because of all the tensorflow models people are writing, to be honest.

dylan604

2 replies

3h18m

2024-03-13 15:06:42 UTC

Of course it's all of the mining, but that's not using pytorch either. It's using CUDA

cwbriscoe

0 replies

2h46m

2024-03-13 15:38:57 UTC

GPU mining went waaay down since Ethereum went POS (Proof of Stake) almost 2 years ago. Does BTC even use GPU's for mining? I am pretty sure they use ASICS.

Eisenstein

0 replies

2h45m

2024-03-13 15:39:45 UTC

What is being mined using CUDA?

frantathefranta

0 replies

1h58m

2024-03-13 16:26:43 UTC

And similarly from people who consider Nvidia to be the "Gaming GPU" company, not understanding why it's so big now.

robertlagrant

1 replies

3h39m

2024-03-13 14:45:33 UTC

It was an example.

LoganDark

0 replies

39m

2024-03-13 17:45:48 UTC

With AMD, Intel, Tenstorrent, and several other AI chip specialists coming with pytorch compatibility, NVIDIA will eventually have to adapt.

I don't see how Nvidia has to do anything since PyTorch works just fine on their GPUs, thanks to CUDA. If anything, they're still one of the best platforms and that's definitely not because CUDA isn't competitive.

I hate stuff that only works on certain GPUs as much as the next person, but sadly competition has only really started to catch up to CUDA very recently.

thomastjeffery

0 replies

1h3m

2024-03-13 17:21:44 UTC

The problem is that NVidia is a single company participating with multiple interdependent markets. They are participating with the market of hardware specification, and they are participating in the market of driver implementation, and they are participating in the market of userland software. This is called "vertical integration".

Because of copyright, NVidia gets an explicit government-enforced monopoly over the driver implementation market. Sure, 3rd-party projects like nouveau get to "compete", but NVidia is given free reign to cripple that competition, simply by refusing to share necessary hardware (and firmware) specs; and also by compelling experienced engineers (anyone who works on NVidia's driver implementation) to sign NDAs, legally enforcing the secrecy of their specs.

On top of this, NVidia gets to be anti-competitive with the driver-compatibility of its userland software, including CUDA, GSync, DLSS, etc.

When a company's market participation is vertically integrated, that participation becomes anticompetitive. The only way we can resolve this problem is be dissolving the company into multiple market-specific companies.

rez9x

2 replies

4h27m

2024-03-13 13:57:38 UTC

I recently traded a friend my Nvidia 3070 for his Radeon 6700 XT, because I'd returned to Linux a few months ago and was tired of Nvidia. Nvidia should will likely get much better as NVK grows, but I think it's better to just not use their products unless you want to have Microsoft spywareOS installed on your computer.

jlarocco

1 replies

2h21m

2024-03-13 16:03:35 UTC

Everybody's experience is different.

I've had one or two upgrade problems in the last 10 years, but otherwise the Nvidia drivers have worked great for me. My biggest complaint is they dropped support for the GPU in my Macbook, and I had to install the nouveau drivers (which I can never spell correctly).

DonHopkins

0 replies

50m

2024-03-13 17:35:06 UTC

At least it's not from the FSF, and GPUs aren't gendered, or you'd have to choose from multiple gendered drivers:

    - "gnuveau" for one masculine GPU.
    - "gnuvelle" for one feminine GPU.
    - "gnuveaux" for multiple masculine GPUs.
    - "gnuvelles" for multiple feminine GPUs.

KeplerBoy

1 replies

3h43m

2024-03-13 14:42:15 UTC

Installing CUDA is not that hard? You follow the official instructions and it's done within minutes.

Never had a problem with the initial installation, updates can get messy, but asides from that it's pretty much smooth sailing.

IanOzsvald

0 replies

2h38m

2024-03-13 15:46:51 UTC

I've just spent the morning uninstalling and reinstalling different versions of Nvidia driver (Linux) to get nvcc back for llama.cpp after Linux Mint did an update - I had CUDA 12.3 and 12.4 (5GB each), in conflict, with no guidance. 550 was the charm, not 535 that was fine in January. This is the third time I'm going this since December. It is painful. I'm not in a hurry to return to my cuDF experiments as I'm pretty sure that'll be broken too (as it has been in the past). I'm the co author of O'Reilly's High Performance Python book and this experience mirrors what I was having with pyCUDA a decade back.

jlarocco

0 replies

2h30m

2024-03-13 15:54:29 UTC

There are plenty of problems with NVIDIA on Linux, but I'm sad to tell you I think this one is your own fault.

The error message is telling you that you've held back broken packages that are conflicting with dependencies nvtop is trying to install. If you sort that out, nvtop should install.

I have nvtop installed on Debian via apt, and it works just fine.

ahartmetz

0 replies

1h44m

2024-03-13 16:40:30 UTC

Nah, screw install scripts, too. If I didn't prefer AMD anyway, I'd want an apt repo.

notorandit

9 replies

11h47m

2024-03-13 06:37:51 UTC

It is nice. But the status of GPU support in Linux is rather poor.

I can see it working only with VLC.

Firefox has some support while chromium based browsers have that only formally.

In the real world you never see the video hw acceleration kicking in, neither with webrtc nor with videos.

It is a pity.

KeplerBoy

6 replies

11h45m

2024-03-13 06:40:20 UTC

it's a valuable tool wherever GPU compute is used. Not sure if video decoding would even register as utilization since that is dedicated hardware.

usr1106

3 replies

11h15m

2024-03-13 07:10:05 UTC

Yes, you can see video decoding, but it's indeed only one detail of the data presented, as it's only a part of the HW.

Not at my computer now, can't tell what metric to watch.

notorandit

1 replies

9h22m

2024-03-13 09:02:58 UTC

Which software would use video decoding?

adrian_b

0 replies

2h51m

2024-03-13 15:33:39 UTC

Pretty much everything except the browsers.

I use several video players, including mpv, vlc and ffmpeg, and they have always used without problems the hardware video decoding and encoding on all kinds of GPUs.

Only with Firefox and Chrome/Chromium in most versions the hardware acceleration is broken, even if there have been some versions where it worked fine (on NVIDIA), but at the next browser upgrade it was broken again.

This does not bother me much, because I do not like to watch video files in a browser anyway. I always download them first and I play them locally.

KeplerBoy

0 replies

10h33m

2024-03-13 07:51:32 UTC

ah yes, you can configure nvtop to display encode and decode loads. Interesting to watch how playback speed in VLC is directly reflected in the metric.

notorandit

0 replies

9h23m

2024-03-13 09:01:52 UTC

Yes, indeed. But 90% of my very own GPU usage is for WebRTC based applications and web-based videos.

Which leads to what I said: no hw-assisted encoding/decoding actually available.

I use intel_gpu_top (from intel-gpu-tools) to monitor my GPU usage: only VLC shows usage.

Anyone has success with browsers?

jandrese

0 replies

1h1m

2024-03-13 17:24:23 UTC

Pretty much every time I've used nvtop it has been while doing CUDA stuff, mostly to see if it was going to blow out the memory but also to spot check that the model is actually using the GPU. I've had times where it said it was using the GPU, but it only did so for the first part of the work and then dropped down to the CPU for the rest.

weberer

0 replies

5h57m

2024-03-13 12:27:46 UTC

I think this is more for people developing applications with CUDA and the like.

pjmlp

0 replies

9h21m

2024-03-13 09:04:03 UTC

Can confirm with my devices.

hughesjj

8 replies

15h54m

2024-03-13 02:30:58 UTC

Nvtop+bottom are my favorite resource monitors for Linux today, but TIL nvtop also works for non-nvidia devices

weinzierl

3 replies

8h35m

2024-03-13 09:50:17 UTC

I use htop but often want to focus on a process, its children and all their threads. Is there a htop replacement that can;

- Show a thread with all children an threads, but nothing else

- Show the whole tree but keep the selected (Shift + Space) process in a fixed screen position.

- Bubbles rows up and down into their new positions instead of having them jump around all over the place.

themoonisachees

2 replies

7h51m

2024-03-13 10:33:28 UTC

Btop does all that and is available in your package manager. It also looks beautiful.

weinzierl

1 replies

6h30m

2024-03-13 11:54:47 UTC

Has a nice 90s vibe. I have not figured out how to just expand the selected process. I could collapse everything except the selected process manually, but that'd be tedious.

DonHopkins

0 replies

36m

2024-03-13 17:48:46 UTC

90's rave scene! ;)

freedomben

3 replies

14h29m

2024-03-13 03:56:06 UTC

wow, bottom[1] is awesome! This is now my favorite monitor It's not in the Fedora repos but there's a COPR for it. To install:

    sudo dnf copr enable atim/bottom
    sudo dnf install bottom

[1] https://github.com/ClementTsang/bottom

petepete

2 replies

13h29m

2024-03-13 04:55:57 UTC

btop++ is also worth a look if you like bottom, another take on a modern htop.

https://github.com/aristocratos/btop

freedomben

0 replies

4h24m

2024-03-13 14:00:38 UTC

Amazing, thank you!

Cu3PO42

0 replies

3h0m

2024-03-13 15:24:33 UTC

And it also supports GPU usage monitoring!

formalsystem

8 replies

15h47m

2024-03-13 02:38:10 UTC

nvtop or nvidia-smi gives you a good macro overview but I personally have found that utilization (EDIT: As reported by nvidia-smi) is actually a poor proxy for how fast your workload can be outside of just ensuring that a GPU is indeed being used

If you're here because you're interested in AI performance I'd recommend instead https://docs.nvidia.com/nsight-compute/NsightComputeCli/inde... to profile individual kernels. Nsight systems for a macro view https://developer.nvidia.com/nsight-systems and the PyTorch profiler if you're not authoring kernels directly but using something PyTorch https://pytorch.org/tutorials/recipes/recipes/profiler_recip...

samstave

2 replies

14h17m

2024-03-13 04:07:57 UTC

If you install Docker Desktop with WSL2 checked, it automatically lets you run Nvidia-SMI in your WSL ubuntu environ on Windows:

https://i.imgur.com/C24EV5U.png

then sudo apt install nvtop

https://i.imgur.com/SOoCdvR.png

EDIT:

Thanks, Some people were having random problems installing WSL on their systems and I found this was the easiest solution (but based on their card models, they appeared to have much older machines.

acka

1 replies

13h30m

2024-03-13 04:55:17 UTC

There is no need to install Docker Desktop just to run nvidia-smi in WSL; the Windows directory containing the nvidia-smi binary is mounted inside a WSL instance and added to PATH automatically by WSL on instance startup.

As an aside: there is no need to install Docker Desktop just to use Docker containers in WSL either, unless you want a Windows GUI to manage your containers. Just follow the official documentation for installing Docker in your Linux distro of choice, or simply run `sudo apt install docker.io` in the default WSL Ubuntu distro. Docker will work just fine with an up-to-date WSL.

8A51C

0 replies

8h35m

2024-03-13 09:49:48 UTC

Further aside, it's possible to have both Docker Desktop and the normal linux Docker.io installed on WSL. They work in isolation, the easy way to know which is active is to check if Docker Desktop is running or not. I wouldn't recommend this set up...

refibrillator

1 replies

15h31m

2024-03-13 02:54:23 UTC

FLOPs utilization is arguably the industry standard metric for efficiency right now and it should be a good first approximation of how much performance is left on the table.

But if you mean the reported utilization in nvtop is misleading I completely agree (as someone who uses it daily).

I’ve been meaning to dig into the source/docs to see what’s going on. The power usage seems to be a more reliable indicator of actual hardware utilization, at least on nvidia gear.

VHRanger

0 replies

4h20m

2024-03-13 14:05:10 UTC

FLOPs utilization is arguably the industry standard metric for efficiency right now

I'd argue GB/s memory bandwidth is more worried about at the moment.

ipsum2

1 replies

15h6m

2024-03-13 03:19:17 UTC

I've been going off of power draw in nvidia-smi as a proxy of util, doesn't require additional setup or code changes.

KeplerBoy

0 replies

13h20m

2024-03-13 05:04:54 UTC

That's hard to argue with. Of course power draw is a direct measure of hardware utilization, but it doesn't translate very well to a measure of GPU Code efficiency.

Often you can squeeze out another order of magnitude of performance by rewriting the kernel and the power draw will always stay capped at whatever the maximum is. I'd say GPU power consumption is interesting if you're CPU bound and struggling to feed the GPU enough data and/or tasks.

pama

0 replies

15h4m

2024-03-13 03:21:08 UTC

I agree that utilization by nvidia-smi is a poor proxy for performance. FWIW, I’ve found that for the same architecture the power consumption reported in nvtop very often correlates super nicely with the training performance and the peak performance is always at peak power consumption. Agreed on your advice for getting to tune your architecture details, but once that’s fixed and you have simple things to debug like memory usage, batch size, dataloading bottlenecks the raw power metric is typically a quick proxy. I find the temperature is a second useful macro metric that; you want to be at max power draw and max allowed temp at all times but not exceed the temperature where you throttle.

malux85

4 replies

15h5m

2024-03-13 03:19:31 UTC

There is also nvitop, which I find better utilizes screen space when > 2 GPUS

https://github.com/XuehaiPan/nvitop

sa-code

1 replies

11h0m

2024-03-13 07:24:26 UTC

The downside with nvitop is that it's written in python, which means having it in your environment can cause dependency conflicts. It's either that or you have a separate venv just for it. Maybe it's fine for personal use but sysadmins would prefer nvtop

cl3misch

0 replies

10h21m

2024-03-13 08:04:20 UTC

That's why the authors recommend pipx for installing nvitop. I am not a sysadmin, but I prefer pipx over relying on the (often outdated) distro sources.

https://github.com/XuehaiPan/nvitop?tab=readme-ov-file#insta...

freedomben

1 replies

14h17m

2024-03-13 04:07:28 UTC

Does nvitop support AMD cards?

malux85

0 replies

12h2m

2024-03-13 06:23:12 UTC

Not sure sorry, I only have nvidia :<

brcmthrowaway

3 replies

14h46m

2024-03-13 03:38:29 UTC

Anything for MACOS?

kernelsanderz

0 replies

12h57m

2024-03-13 05:28:15 UTC

There’s also asitop https://github.com/tlkh/asitop

itsgrimetime

0 replies

12h56m

2024-03-13 05:29:18 UTC

its not a terminal app like bottom or nvtop but I use https://github.com/exelban/stats and it has iGPU stats

evilduck

0 replies

14h11m

2024-03-13 04:14:23 UTC

nvtop supposedly builds for macOS, but https://lib.rs/crates/pumas is similar.

distalx

2 replies

11h1m

2024-03-13 07:23:34 UTC

There is also Nvitop[1], which is more to my liking.

[1] https://github.com/XuehaiPan/nvitop

kiraaa

0 replies

10h38m

2024-03-13 07:47:21 UTC

super easy to install and use

cassianoleal

0 replies

8h27m

2024-03-13 09:58:07 UTC

Seems to be for NVidia only, whereas OP claims:

Currently supported vendors are AMD (Linux amdgpu driver), Apple (limited M1 & M2 support), Huawei (Ascend), Intel (Linux i915 driver), NVIDIA (Linux proprietary divers), Qualcomm Adreno (Linux MSM driver).

vilunov

1 replies

8h40m

2024-03-13 09:45:19 UTC

It doesn't work with mesa, does it? I'm using the new nvk driver and it shows me no GPU to monitor.

thangngoc89

0 replies

7h23m

2024-03-13 11:01:44 UTC

Of course it doesn't. `nvidia-smi` needs to be available

cyberax

1 replies

15h8m

2024-03-13 03:17:00 UTC

Is there a description of the wire protocol between the driver and NVidia library?

Context: I'd love to have a native Go-language library to read the GPU utilization for containerized workloads.

8organicbits

0 replies

8h54m

2024-03-13 09:31:22 UTC

Yes, check out NVML. There's wrappers for a couple languages, not sure about golang.

https://developer.nvidia.com/nvidia-management-library-nvml

blagie

1 replies

10h10m

2024-03-13 08:15:08 UTC

Now that I use Home Assistant, I want all my data sources to plug into there. It can handle the rendering for me as I see fit, and it's where data comes to integrate.

It's one of those things which I wish existed, but I can't imagine anyone would have written. Until I do a web search.

https://github.com/koriwi/sensors2mqtt/tree/main

I have not used it yet, but that seems like how I'd want to do it.

feitingen

0 replies

8h11m

2024-03-13 10:13:44 UTC

Collectd can also output stats to mqtt, from sensors, disks, network and others.

Ycros

1 replies

9h59m

2024-03-13 08:25:41 UTC

I prefer btop, it does all the usual process monitoring as well as gpus in the latest versions.

notorandit

0 replies

9h21m

2024-03-13 09:04:21 UTC

Really? Mine is v1.3.2 and doesn't show Intel Iris Xe Graphics!

{UPDATE} I see: no Intel GPU support yet!

thangngoc89

0 replies

7h14m

2024-03-13 11:10:58 UTC

My favorite would be gpustat [1]. This shows the bare minimum amount of information to let's me know that the training has problems/running well

[1] https://github.com/wookayin/gpustat

sylware

0 replies

7h21m

2024-03-13 11:04:04 UTC

cmake... erk...

I would have prefered a simple and brutal shell script (not bash of course) to build it on elf/linux.

It is worth trashing cmake, always, so I'll write it if I end up using that GPU monitoring tool.

superkuh

0 replies

13h58m

2024-03-13 04:26:40 UTC

radeontop is the same sort of thing if you live in amdgpu-ville and want something easy to compile. I was able to use it to show that with kernel 5.x admgpu vulkan when a process is pushed out of vram into gtt it'll never reload and get stuck in a 'slow' state.

stuaxo

0 replies

9h59m

2024-03-13 08:26:12 UTC

Its good to see linux graphics card utilities going multi platform, instead of the old way of being per-driver.

shmerl

0 replies

13h3m

2024-03-13 05:22:01 UTC

May be you should use Nova instead of NVML for Nvidia?

https://lists.freedesktop.org/archives/nouveau/2024-February...

Other than that - a cool tool!

pavelstoev

0 replies

14h20m

2024-03-13 04:04:47 UTC

you can also profile AI/ML performance without actually running it https://github.com/CentML/DeepView.Profile

mrlonglong

0 replies

5h50m

2024-03-13 12:35:18 UTC

It'd be nice if it supported Nouveau.

collsni

0 replies

15h52m

2024-03-13 02:32:45 UTC

Amdgpu_top is another cool usage statistic/monitor for and GPUs

bvaisvil

0 replies

1h9m

2024-03-13 17:16:01 UTC

Zenith is my project which combines NVIDIA GPU monitoring with disk, CPU, and Top-like capabilities. https://github.com/bvaisvil/zenith