HN comments for: GPUs Go Brrr

Animats

60 replies

18h54m

2024-05-12 23:32:01 UTC

"And we ask: if your matrix multiply is smaller than 16x16, are you sure what you’re doing is AI?

From a philosophical point of view, we think a frame shift is in order. A “register” certainly shouldn’t be a 32-bit word like on the CPUs of old. And a 1024-bit wide vector register, as CUDA uses, is certainly a step in the right direction. But to us a “register” is a 16x16 tile of data. We think AI wants this."

The hardware needs of AI are starting to focus. GPUs, after all, were designed for an entirely different job. They're used for AI because they have good matrix multiply hardware. "AI GPUs" get to leave out some of the stuff in a real GPU (does an H100 even have texture fill units?). Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit? That will settle out at some point. This paper indicates that hardware that likes 16x16 tiles makes a lot of sense. It's certainly possible to build such hardware. Someone reading this is probably writing it in VHDL right now, or will be soon.

Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.

bcatanzaro

19 replies

15h14m

2024-05-13 03:12:26 UTC

GPUs have evolved to be AI machines with as little baggage as possible. People have been arguing GPUs were old technology and therefore unsuited for AI since at least 2014 (when Nervana was founded), but what they perhaps didn’t expect is that the GPU would evolve so quickly to be an AI machine.

celrod

18 replies

10h6m

2024-05-13 08:19:51 UTC

Bill Dally from Nvidia argues that there is "no gain in building a specialized accelerator", in part because current overhead on top of the arithmetic is in the ballpark of 20% (16% of IMMA and 22% for HMMA units) https://www.youtube.com/watch?v=gofI47kfD28

AnthonyMouse

17 replies

9h8m

2024-05-13 09:18:16 UTC

There does seem to be a somewhat obvious advantage: If all it has to do is matrix multiplication and not every other thing a general purpose GPU has to be good at then it costs less to design. So now someone other than Nvidia or AMD can do it, and then very easily distinguish themselves by just sticking a ton of VRAM on it. Which is currently reserved for GPUs that are extraordinarily expensive, even though the extra VRAM doesn't cost a fraction of the price difference between those and an ordinary consumer GPU.

papruapap

9 replies

8h40m

2024-05-13 09:46:39 UTC

I really hope we see AI-PU (or with some other name, INT16PU, why not) for the consumer market sometime soon. Or been able to expand GPU memory using a pcie socket (not sure if technically possible).

PeterisP

6 replies

8h9m

2024-05-13 10:16:52 UTC

The while point of GPU memory is that it's faster to access than going to memory (like your main RAM) through the PCIe bottleneck.

throwaway4aday

5 replies

5h17m

2024-05-13 13:09:36 UTC

My uninformed question about this is why can't we make the VRAM on GPUs expandable? I know that you need to avoid having the data traverse some kind of bus that trades overhead for wide compatibility like PCIe but if you only want to use it for more RAM then can't you just add more sockets whose traces go directly to where they're needed? Even if it's only compatible with a specific type of chip it would seem worthwhile for the customer to buy a base GPU and add on however much VRAM they need. I've heard of people replacing existing RAM chips on their GPUs[0] so why can't this be built in as a socket like motherboards use for RAM and CPUs?

[0] https://www.tomshardware.com/news/16gb-rtx-3070-mod

giobox

1 replies

1h58m

2024-05-13 16:27:55 UTC

Expandable VRAM on GPUs has been tried before - the industry just hates it. It's like Apple devices - want more internal storage? Buy a new computer so we can have the fat margins.

The original REV A iMac in late 90s had slotted memory for its ATI card, as one example - shipped with 2mb, could be upgraded to 6mb after the fact with a 4MB SGRAM DIMM. There are also a handful of more recent examples floating around.

While I'm sure there are also packaging advantages to be had by directly soldering memory chips instead of slotting them etc, I strongly suspect the desire to keep buyers upgrading the whole card ($$$) every few years trumps this massively if you are a GPU vendor.

Put another way, what's in it for the GPU vendor to offer memory slots? Possibly reduced revenue, if it became industry norm.

Majromax

0 replies

49m

2024-05-13 17:37:28 UTC

Expansion has to answer one fundamental question: if you're likely to need more X tomorrow, why aren't you just buying it today?

The answer to this question almost has to be "because it will be cheaper to buy it tomorrow." However, GPUs bundle together RAM and compute. If RAM is likely to be cheaper tomorrow, isn't compute also probably going to be cheaper?

If both RAM and compute are likely cheaper tomorrow, then the calculus still probably points towards a wholesale replacement. Why not run/train models twice as quickly alongside the RAM upgrades?

I strongly suspect the desire to keep buyers upgrading the whole card ($$$) every few years trumps this massively if you are a GPU vendor.

Remember as well that expandable RAM doesn't unlock higher-bandwidth interconnects. If you could take the card from five years ago and load it up with 80 GB of VRAM, you'd still not see the memory bandwidth of a newly-bought H100.

If instead you just need the VRAM and don't care much about bandwidth/latency, then it seems like you'd be better off using unified memory and having system RAM be the ultimate expansion.

throwaway48476

0 replies

32m

2024-05-13 17:53:53 UTC

Its a minor technical challenge with no financial benefit for the GPU makers.

carbotaniuman

0 replies

3h39m

2024-05-13 14:46:54 UTC

Replacing RAM chips on GPUs involves resoldering and similar things - those (for the most part) maintain the signal integrity and performance characteristics of the original RAM. Adding sockets complicates the signal path (iirc), so it's harder for the traces to go where they're needed, and realistically given a trade-off between speed/bandwidth and expandability I think the market goes with the former.

PeterisP

0 replies

1h24m

2024-05-13 17:02:48 UTC

Technically we definitely can, but are there sufficiently many people willing to pay a sufficiently high premium for that feature? How much more would you be willing to pay for an otherwise identical card that has the option to expand RAM, and do you expect that a significant portion of buyers would want to pay a non-trivial up-front cost for that possibility?

rdsubhas

0 replies

35m

2024-05-13 17:51:13 UTC

Isn't that what NPUs are technically?

https://en.m.wikipedia.org/wiki/AI_accelerator

hhsectech

0 replies

8h20m

2024-05-13 10:05:54 UTC

Isn't this what resizeable BAR and direct storage are for?

WithinReason

3 replies

2h1m

2024-05-13 16:25:37 UTC

Designing it is easy and always has been. Programming it is the bottleneck. Otherwise Nvidia wouldn't be in the lead.

markhahn

2 replies

1h41m

2024-05-13 16:45:34 UTC

but programming it is "import pytorch" - nothing nvidia-specific there.

the mass press is very impressed by Cuda, but at least if we're talking AI (and this article is, exclusively), it's not the right interface.

and in fact, Nv's lead, if it exists, is because they pushed tensor hardware earlier.

achierius

0 replies

1h11m

2024-05-13 17:14:58 UTC

Someone does, in fact, have to implement everything underneath that `import` call, and that work is _very_ hard to do for things that don't closely match Nvidia's SIMT architecture. There's a reason people don't like using dataflow architectures, even though from a pure hardware PoV they're very powerful -- you can't map CUDA's, or Pytorch's, or Tensorflow's model of the world onto them.

WithinReason

0 replies

1h14m

2024-05-13 17:11:53 UTC

I'm talking about adding Pytorch support for your special hardware.

Nv's lead is due to them having Pytorch support.

bjornsing

2 replies

8h54m

2024-05-13 09:32:14 UTC

Exactly. And that means you not only save the 22% but also a large chunk of the Nvidia margin.

Animats

1 replies

7h26m

2024-05-13 11:00:36 UTC

And, sure enough, there's a new AI chip from Intellifusion in China that's supposed to be 90% cheaper. 48 TOPS in int8 training performance for US$140.[1]

[1] https://www.tomshardware.com/tech-industry/artificial-intell...

pfdietz

0 replies

7h18m

2024-05-13 11:08:46 UTC

I wonder what the cost of power to run these chips is. If the power cost ends up being large compared to the hardware cost, it could make sense to buy more chips and run them when power is cheap. They could become a large source of dispatchable demand.

dvt

13 replies

18h11m

2024-05-13 00:14:54 UTC

Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.

Apple has already been doing this for a few years now. The NPU is totally different from the GPU or CPU on the die itself[1]. Nvidia is likely working on this as well, but I think a device that's a gaming/entertainment/crypto/AI bundle (i.e. sticking with the video card) is probably a better business move.

[1] https://github.com/hollance/neural-engine/blob/master/docs/a...

talldayo

10 replies

16h57m

2024-05-13 01:29:13 UTC

The NPUs on a lot of different systems occupy an awkward spot. For extremely small models, they're the way to go for low-power inference. But once you reach LLM or vision transformer size, it makes a lot more sense to switch to GPU shaders for that extra bit of large-model performance. For stuff like Llama and Stable Diffusion, those Neural Engines are practically wasted silicon. The biggest saving grace is projects like ONNX attempting to sew them into a unified non-15-competing-standards API, but even that won't change how underpowered they are.

Nvidia escapes this by designing their GPU architecture to incorporate NPU concepts at a fundamental level. It's less redundant silicon and enables you to scale a single architecture instead of flip-flopping to whichever one is most convenient.

nxobject

5 replies

12h3m

2024-05-13 06:23:40 UTC

It's currently doable for Apple – I think their strategy is to slowly enhance iPhones, bit by bit, with special-purpose models for dealing with media like photo subject identification, OCR (in every language!), voice transcription, etc. Apple's currently learning from Microsoft's attempts to make AI stick everywhere.

joquarky

2 replies

11h48m

2024-05-13 06:38:46 UTC

Soon our phones will dream beside us every night (integrating new data into our personal model while on the charger)

serialx

1 replies

8h14m

2024-05-13 10:12:31 UTC

Well, iPhone already does that with photos. :)

pierrefermat1

0 replies

3h47m

2024-05-13 14:38:58 UTC

Do you have a link to where they breakdown what inference for photos happens in realtime vs overnight/charging?

everforward

1 replies

3h1m

2024-05-13 15:25:09 UTC

I think Apple is more interested in features that work consistently than in giving power users the ability to play with essentially alpha or beta AI features.

I would guess that their strategy is to not include powerful client-side hardware, and supplement that with some kind of "AiCloud" subscription to do the battery-draining, heat-generating stuff on their cloud. They're trading off their branding as a privacy focused company under the (probably correct) belief that people will be more willing to upload their data to iCloud's AI than Microsoft's.

Fwiw, I think they're probably correct. It has always struck me as odd that people want to run AI on their phone. My impression of AI is that it creates very generalized solutions to problems that would be difficult to code, at the cost of being very compute inefficient.

I don't really want code like that running on my phone; it's a poor platform for it. Thermal dissipation and form factor limit the available processing power, and batteries limit how long you can use the processing power you have. I don't really want to waste either trying to do subject identification locally. I'm going to upload the photos to iCloud anyways; let me pay an extra $1/month or whatever to have that identification happen in the cloud, on a server built for it that has data center thermal dissipation and is plugged into the wall.

talldayo

0 replies

2h14m

2024-05-13 16:12:39 UTC

The pinch (as far as I can see it) is that you're right, and Apple can't sell a freestanding service to save their life. If we do get an AppleGPT pay-as-you-go service, it's certain to be extraordinarily censored and locked-down as the exclusive first-party option on iPhone. It will feature "vertical integration" that no other AI can have, alongside censorship so prudish that it would make Maurey Povich gasp.

So... I think users will be stuck. They'll want to run uncensored models on their phone, but Apple will want to keep them in the walled garden at any cost. It feels like the whole "Fortnite" situation all over again, where users can agree they want something but Apple can't decide.

WhitneyLand

3 replies

3h16m

2024-05-13 15:10:40 UTC

Anyone checked out the NPU on the new iPad? It’s supposed to be a bazillion times better according to Apple but I haven’t had a chance to dig into the reality.

I guess we can assume this is going to be what’s used in what’s being called Apple’s first AI phone, iPhone 16.

fassssst

1 replies

2h41m

2024-05-13 15:45:11 UTC

It has 38 TOPS of INT8 performance. Not very remarkable compared to consumer Nvidia GPU’s which are like one or two orders of magnitude faster.

talldayo

0 replies

2h19m

2024-05-13 16:07:09 UTC

For reference, Nvidia's Jetson Orin NX robotics platform is 35-50 TOPS on average. Apple is catching up, but Nvidia still has by-far the more flexible (and better scaled) platform.

numpad0

0 replies

2h21m

2024-05-13 16:05:26 UTC

That 38 TOPS figure was a bit weird, it's literally below baseline(45 TOPS) for "AI PC" branding Qualcomm/Intel/Microsoft is launching this June, and also 10x less than typical GPUs. I think it was just a clever marketing exploiting the fact that "AI PC" branding hasn't launched yet.

yosefk

0 replies

14h23m

2024-05-13 04:02:55 UTC

For inference, Nvidia has DLA since 2017-ish if I remember correctly, which is completely separate from the GPU.

eru

0 replies

15h50m

2024-05-13 02:36:13 UTC

And Google has their TPUs.

WanderPanda

8 replies

18h11m

2024-05-13 00:15:38 UTC

Wait but nvidia tensor-cores are exactly the hardware that likes 16x16 tiles, no? I thought that was the whole point? The hardware is already here and I'm sceptical if there is another order of magnitude in performance to be gained from even more specialized designs.

wtallis

7 replies

17h4m

2024-05-13 01:22:01 UTC

What's the ratio of tensor cores to regular SIMD compute ("CUDA cores") on NVIDIA's current chips?

creato

6 replies

13h45m

2024-05-13 04:41:27 UTC

This is in the article: if you aren't using the tensor cores, you aren't utilizing ~94% of the FLOPs available.

wtallis

4 replies

11h18m

2024-05-13 07:08:13 UTC

Knowing what portion of the FLOPs are in the tensor cores isn't quite the right thing to be looking at. The key question is how much more tensor core performance can be gained by reducing or eliminating the dies area devoted to non-tensor compute and higher precision arithmetic. Most of NVIDIA's GPUs are still designed primarily for graphics: they have some fixed function units that can be deleted in an AI-only chip, and a lot of die space devoted to non-tensor compute because the tensor cores don't naturally lend themselves to graphics work (though NVIDIA has spent years coming up with ways to not leave the tensor cores dark during graphics work, most notably DLSS).

So the claims that NVIDIA's GPUs are already thoroughly optimized for AI and that there's no low-hanging fruit for further specialization don't seem too plausible, unless you're only talking about the part of the datacenter lineup that has already had nearly all fixed-function graphics hardware excised. And even for Hopper and Blackwell, there's some fat to be trimmed if you can narrow your requirements.

smallmancontrov

1 replies

4h21m

2024-05-13 14:05:18 UTC

Mind the Dark Silicon Fraction.

Some fraction of your transistors MUST go unused on average or you melt the silicon. This was already a thing in the 20nm days and I'm sure it has only gotten worse. 100% TDP utilization might correspond to 60% device utilization.

wtallis

0 replies

3h54m

2024-05-13 14:32:10 UTC

That's true for CPUs. Does it really apply to GPUs and other accelerators for embarrassingly parallel problems where going slower but wider is always a valid option?

incrudible

1 replies

8h22m

2024-05-13 10:03:50 UTC

There is not a lot of fixed function left in the modern graphics pipeline, economics of scale dictate that there is no net benefit in trimming it.

wtallis

0 replies

3h43m

2024-05-13 14:43:27 UTC

And yet, even NVIDIA does trim it from chips like the H100, which has no display outputs, RT cores, or video encoders (though they keep the decoders), and only has ROPs for two of the 72 TPCs.

Sharlin

0 replies

10h5m

2024-05-13 08:21:11 UTC

On the H100 specifically. The figure is likely different on consumer cards.

mvkel

5 replies

18h27m

2024-05-12 23:59:31 UTC

Would you say this is ultimately "ASICs for AI"?

dartos

4 replies

18h12m

2024-05-13 00:14:05 UTC

In the same way that CPUs are ASICs for integer operations, that makes sense to me.

saagarjha

3 replies

12h23m

2024-05-13 06:03:38 UTC

Most CPUs do just fine on floating point too.

dartos

1 replies

7h8m

2024-05-13 11:18:21 UTC

Floating point arithmetic _is_ integer arithmetic on the cpu level because of how floating point numbers work.

fwip

0 replies

1h2m

2024-05-13 17:24:18 UTC

That's a good point - floating point operations are implemented with integer-math circuits (or at least can be - I'm not privy to how modern chip manufacturers implement them). E.g: your ALU may have an 11-bit adder specifically to add your f64 exponents.

Some slides to get the gist of it: https://users.encs.concordia.ca/~asim/COEN_6501/Lecture_Note...

actionfromafar

0 replies

10h58m

2024-05-13 07:28:47 UTC

I'm still getting used to that.

choppaface

3 replies

17h4m

2024-05-13 01:22:15 UTC

“NVidia’s LIES..

On kernels such as flash attention, TMA and the L2 cache are both fast enough so as to hide these problems reasonably well. But to make the full use of the hardware, memory request must be coalesced and bank conflicts avoided ”

The depth of the competition is also starting to become apparent. There’s no way the documentation error was totally an accident. Diagrams are the easiest to steal / copy and there must have been some utility for nvidia to have left this in place. Remember when Naveen Rao’s Nervana was writing NVidia Maxwell drivers that out-performed NVidia’s own? Not every documentation mishap in a high-growth product is a competition counter-measure, but given that the researchers spent so long reverse-engineering wgmma and given the China-US political situation of the H100 in particular, it seems NVidia is up to its old tricks to protect its moat.

So don’t over-study the H100 peculiarities, as “what hardware does AI want?” really encompasses the commercial situation as well.

wiz21c

2 replies

11h23m

2024-05-13 07:03:37 UTC

I don't understand. If they document their stuff with errors, it will hurt users, be they chinese or US ? Or is it expected that US users will call Nvidia's to ask for the correct documentation ?

choppaface

0 replies

1h47m

2024-05-13 16:39:14 UTC

The vast majority of users use NVidia’s own kernels versus optimize their own. And those who do write custom kernels are typically not trying to compete with NVidia’s own GMM.

acka

0 replies

9h42m

2024-05-13 08:43:50 UTC

It could be a case of classic market segmentation. The lower tier customers get the incomplete or error-ridden documentation, and the upper tier trusted customers^W'partners' get access to the juicy stuff: complete and mostly correct documentation, including stuff intentionally left out of the lower tier package like application notes containing secret hardware handshakes to unlock hidden features, all under strict NDA of course.

muyuu

2 replies

10h54m

2024-05-13 07:32:29 UTC

it's going to be awkward in consumer hardware either way

if you segregate AI units from the GPU, the thing is both AI and GPUs will continue to need massive amounts of matrix multiplication and as little memory latency as possible

the move to have more of it wrapped in the GPU makes sense but at least in the short and medium term, most devices won't be able to justify the gargantuan silicon wafer space/die growth that this would entail - also currently Nvidia's tech is ahead and they don't make state of the art x86 or ARM CPUs

for the time being I think the current paradigm makes the most sense, with small compute devices making inroads in the consumer markets as non-generalist computers - note that more AI-oriented pseudo-GPUs already exist and are successful since the earlier Nvidia Tesla lineup and then the so-called "Nvidia Data Center GPUs"

rfoo

1 replies

9h55m

2024-05-13 08:31:22 UTC

as little memory latency as possible

Should be "as much memory bandwidth as possible". GPUs are designed to be (relatively) more insensitive to memory latency than CPU.

muyuu

0 replies

7h34m

2024-05-13 10:51:53 UTC

yep that's true, although AI compute modules do get significant benefit from low latency cache as well

jiveturkey

1 replies

15h21m

2024-05-13 03:05:31 UTC

hasn't google been building such devices for a decade now?

yayr

0 replies

10h40m

2024-05-13 07:45:55 UTC

yep, and the main engineers have founded groq.com with an architecture that among others precisely solved the memory management issues

UncleOxidant

0 replies

56m

2024-05-13 17:30:47 UTC

Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit?

There was that recent paper titled "The Era of 1-bit LLMs" [0] which was actually suggeting a 1.58 bit LLM (2 bits in practice).

Someone reading this is probably writing it in VHDL right now, or will be soon.

Yeah, I think I'm in the "will be soon" camp - FPGA board has been ordered. Especially with the 2-bit data types outlined in that paper [0] and more details in [1]. There's really a need for custom hardware to do that 2-bit math efficiently. Customizing one of the simpler open source RISC-V integer implementations seems like something to try here adding in the tiled matrix registers and custom instructions for dealing with them (with the 2 bit data types).

[0] https://arxiv.org/abs/2402.17764 [1] https://github.com/microsoft/unilm/blob/master/bitnet/The-Er...

FuriouslyAdrift

0 replies

4h57m

2024-05-13 13:28:56 UTC

AMD is already in their second generation of of Versal line.

https://www.amd.com/en/products/accelerators/alveo/v80.html

XDNA Architecture

https://www.amd.com/en/technologies/xdna.html

brcmthrowaway

20 replies

14h37m

2024-05-13 03:48:53 UTC

NVIDIA needs to be broken up

latchkey

15 replies

13h46m

2024-05-13 04:39:56 UTC

The better alternative is to root for AMD and others to develop their own products so that regardless of breaking NV up or not, there are alternative solutions for people to use. They all leapfrog each other with new releases now any way. Why put all your eggs into one basket.

simondotau

12 replies

13h19m

2024-05-13 05:07:08 UTC

George Hotz went down the AMD rabbit hole for a while and concluded that the driver software — more precisely the firmware which runs on the cards themselves — is so badly written that there's no hope of them becoming serious contenders in AI without some major changes in AMD's priorities.

latchkey

7 replies

12h53m

2024-05-13 05:33:48 UTC

I'm not defending their software. It does honestly have a ton of issues.

George Hotz tried to get a consumer card to work. He also refused my public invitations to have free time on my enterprise cards, calling me an AMD shill.

AMD listened and responded to him and gave him even the difficult things that he was demanding. He has the tools to make it work now and if he needs more, AMD already seems willing to give it. That is progress.

To simply throw out George as the be-all and end-all of a $245B company... frankly absurd.

creato

3 replies

11h51m

2024-05-13 06:34:59 UTC

The fact that consumer and "pro"(?) GPUs don't use (mostly) the same software is not confidence inspiring. It means that AMD's already apparently limited capacity for software development is stretched thinner than it otherwise would be.

Also, if the consumer GPUs are hopelessly broken but the enterprise GPUs are fine, that greatly limits the number of people that can contribute to making the AMD AI software ecosystem better. How much of the utility of the NVIDIA software ecosystem comes from gaming GPU owners tinkering in their free time? Or grad students doing small scale research?

I think these kinds of things are a big part of why NVIDIA's software is so much better than AMD right now.

wruza

1 replies

11h14m

2024-05-13 07:12:33 UTC

that greatly limits the number of people that can contribute to making the AMD AI software ecosystem better

I’d say it simply dials it down to zero. No one’s gonna buy an enterprise AMD card for playing with AI, so no one’s gonna contribute to that either. As a local AI enthusiast, this “but he used consumer card” complaint makes no sense to me.

latchkey

0 replies

3h7m

2024-05-13 15:19:29 UTC

No one’s gonna buy an enterprise AMD card for playing with AI

My hypothesis is that the buying mentality stems from the inability to rent. Hence, me opening up a rental business.

Today, you can buy 7900's and they work with ROCm. As George pointed out, there are some low level issues with them, that AMD is working with him to resolve. That doesn't mean they absolutely don't work.

https://rocm.docs.amd.com/projects/install-on-linux/en/lates...

latchkey

0 replies

3h0m

2024-05-13 15:26:13 UTC

Agreed that AMD needs to work on the developer flywheel. Again, not defending their software.

One way to improve the flywheel and make the ecosystem better, is to make their hardware available for rent. Something that previously was not available outside of hyperscalers and HPC.

simondotau

1 replies

9h9m

2024-05-13 09:17:09 UTC

To simply throw out George as the be-all and end-all of a $245B company... frankly absurd.

I didn't do that, and I don't appreciate this misreading of my post. Please don't drag me into whatever drama is/was going on between you two.

The only point I was making was that George's experience with AMD products reflected poorly on AMD software engineering circa 2023. Whether George is ultimately successful in convincing AMD to publicly release what he needs is beside the point. Whether he is ultimately successful convincing their GPUs to perform his expectations is beside the point.

latchkey

0 replies

3h3m

2024-05-13 15:22:58 UTC

The only point I was making was that George's experience with AMD products reflected poorly on AMD software engineering circa 2023.

Except that isn't the point you said...

"there's no hope of them becoming serious contenders in AI without some major changes in AMD's priorities"

My point in showing you (not dragging you into) the drama, is to tell you that George is not a credible witness for your beliefs.

shmerl

0 replies

11h58m

2024-05-13 06:28:34 UTC

Indeed, AMD willing to open firmware is something Nvidia never has done.

callalex

3 replies

11h50m

2024-05-13 06:36:40 UTC

Egohotz is brilliant in many ways, but taking him at his word when it comes to working with others has been a mistake since at least around 2010. This is well documented.

imtringued

1 replies

1h27m

2024-05-13 16:59:00 UTC

I can reliably crash my system using kobold.cpp with Vulkan running an AMD GPU. All it takes is a slightly too high batch size.

latchkey

0 replies

1h8m

2024-05-13 17:18:39 UTC

What is slightly too high of a batch size? If max size is 100 and you're at 99, of course 100 will crash it.

simondotau

0 replies

9h25m

2024-05-13 09:01:03 UTC

Who said anything about taking him at his word? Everything he has done regarding AMD GPUs has been in public. I'm sure there are plenty of valid criticisms one can make of his skills/strategy/attitude/approach, but accusing him of being generally untrustworthy in this endeavour is utterly nonsensical.

PeterisP

1 replies

1h17m

2024-05-13 17:09:06 UTC

We've rooted for that for years, but looking at what AMD does and doesn't do, I've lost hope for this. ĀMD don't seem to want to do what it takes; it's not that they're trying and failing, but they're simply not even committing to attempt to do the same things that nVidia does for their software infrastructure.

latchkey

0 replies

1h10m

2024-05-13 17:16:26 UTC

We are still early. I started my bet on Lisa Su around August of last year... she publicly doubled down on AI around October/November. Dec 6th, MI300x was announced.

Big ships take time to course correct. Look at their hiring for AI related positions and release schedule for ROCm. As well as multiple companies like mine springing up to purchase MI300x and satisfy rental demand.

It is only May. We didn't even receive our AIA's until April. Another company just announced their MI300x hardware server offering today.

silveraxe93

1 replies

10h33m

2024-05-13 07:52:53 UTC

NVIDIA is so damn good at its job that it took over the market. There's no regulatory or similar barriers to entry. It's literally that they do a damn good job and the competition can't be as good.

You look at that and want to take a sledgehammer to a golden goose? I don't get these people

michaelt

0 replies

8h26m

2024-05-13 10:00:37 UTC

True: nvidia has been consistently investing for over a decade.

They saw there was nascent compute use of GPUs, using programmable shaders. They produced CUDA, made it accessible on every one of their GPUs (not just the high-markup professional products) and they put resources into it year after year after year.

Not just investing in the product, also the support tools (e.g. a full graphical profiler for your kernels) and training materials (e.g. providing free cloud GPU credits for Udacity courses) and libraries and open source contributions.

This is what it looks like when a company has a vision, plans beyond the next quarter, and makes long-term investments.

huhlig

1 replies

14h12m

2024-05-13 04:14:41 UTC

Into what? Where would you draw such lines?

robocat

0 replies

13h13m

2024-05-13 05:13:28 UTC

Into tiles ;-p

GPU compute is already broken up - there is a supply chain of other cooperating players that work together to deliver GPU compute to end users:

TSMC, SK hynix, Synopsys, cloud providers (Azure/Amazon etcetera), model providers (OpenAI/Anthropic etcetera).

Why single out NVidia in the chain? Plus the different critical parts of the chain are in different jurisdictions. Split up NVidia and somebody else will take over that spot in the ecosystem. This interview with Synopsys is rather enlightening: https://www.acquired.fm/episodes/the-software-behind-silicon...

How does the profit currently get split between the different links? Profit is the forcing variable for market cap and profit is the indicator of advantage. Break up NVidia and where does the profit move?

phinnaeus

17 replies

18h1m

2024-05-13 00:25:35 UTC

FYI the caption of the "spirit animals" image says "canadian goose" instead of "Canada Goose".

adrian_b

3 replies

11h10m

2024-05-13 07:16:32 UTC

I consider bad the habit of English to use nouns also as adjectives, because it causes many ambiguities, some of which can be very annoying, even if they are a rich source of jokes and word plays.

In most languages the use of a noun as an adjective is marked, by a particle or by an affix or at least by a different stress pattern (like moving the stress to the last syllable), which removes the ambiguities.

So for most non-native speakers "Canadian goose" makes much more sense than "Canada goose" (which may feel like "Canada and a goose" or "a goose that is also Canada" and not like "a goose from Canada").

p0w3n3d

0 replies

10h48m

2024-05-13 07:38:03 UTC

always the former noun is describing the latter. Butter fly is not a flying butter (as my children's teacher told them to make a joke about butterfly) but a fly made of butter instead.

kitd

0 replies

10h53m

2024-05-13 07:33:03 UTC

"Canada" isn't being used as an adjective though. The name of the species is "Canada Goose", like "Long Island Shellfish" or "Dublin Bay Prawns".

actionfromafar

0 replies

10h54m

2024-05-13 07:32:47 UTC

Now you made me think of ways to English Adjective my text for word play... make it stop.

silisili

2 replies

10h36m

2024-05-13 07:50:04 UTC

I've only heard people in my entire lifetime call them Canadian Geese.

The only time I've ever even seen or heard of Canada Goose/Geese are people on the internet telling others they are wrong.

I think it's time to just accept it as correct.

FearNotDaniel

1 replies

9h35m

2024-05-13 08:51:09 UTC

Absolutely, it's like living in London and eventually having to accept that tourists will always say "Big Ben" when they mean the clock tower of the Palace of Westminster, which encloses the bell whose actual name is Big Ben. The name of the tower is, de facto, Big Ben, and life gets so much easier when you drop the urge to tell people they are wrong all the time...

Edit: TIL the tower was properly renamed "Elizabeth Tower" in 2012 [0] but I seriously doubt a single person in the last 12 years has ever used that name...

[0] https://en.wikipedia.org/wiki/Big_Ben

globular-toast

0 replies

8h24m

2024-05-13 10:02:14 UTC

I wouldn't put that in the same category. If you say Canada Goose everyone still knows what you mean. If you say Elizabeth Tower, they probably don't.

xarope

1 replies

13h2m

2024-05-13 05:24:46 UTC

I am missing the reference to the canadian goose and the retriever puppy as spirit animals. Is that to say the H100 is an ornery thing, but the RTX4090 is friendly?

Mtinie

0 replies

8h20m

2024-05-13 10:06:04 UTC

I’d assumed (like you) it meant that the H100 is ornery AND pickier about what it consumes, while the RTX4090 is playful and will eat damn near anything within reach of its mouth (with its sharp, velociraptor-like puppy teeth), whether you want it to or not.

But that may be straining the meme somewhat. :)

downrightmike

1 replies

17h12m

2024-05-13 01:14:24 UTC

Don't worry, the Geese are en route to location, resolution incoming. Stand by.

hoherd

0 replies

5h5m

2024-05-13 13:21:48 UTC

In my experience, Canadian geese are never en route to anywhere. They stay next to the pond year round and crap everywhere you might want to step. EG: https://sanjosespotlight.com/going-to-santa-clara-central-pa...

bombcar

1 replies

14h14m

2024-05-13 04:12:06 UTC

It’s a Canada Goose from Canada. A Canadian Canada Goose, or Canadian Goose.

gosub100

0 replies

13h22m

2024-05-13 05:04:19 UTC

https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...

wglb

0 replies

17h15m

2024-05-13 01:10:57 UTC

An error too often made.

fastball

0 replies

17h46m

2024-05-13 00:39:54 UTC

Canadian goose seems better in [current year], to avoid confusion with the clothing brand.

bn-l

0 replies

11h2m

2024-05-13 07:24:45 UTC

Who cares

adzm

0 replies

17h4m

2024-05-13 01:22:27 UTC

Likely a regional thing; they are consistently called Canadian Geese where I grew up and where I currently live.

winternewt

13 replies

9h8m

2024-05-13 09:17:57 UTC

I believe that reducing the power consumption and increasing the speed of AI inference will be best served by switching to analog, approximate circuits. We don't need perfect floating-point multiplication and addition, we just need something that takes an two input voltages and produces an output voltage that is close enough to what multiplying the input voltages would yield.

danielheath

2 replies

3h23m

2024-05-13 15:03:10 UTC

I know someone working in this direction; they've described the big challenges as:

  * Finding ways to use extant chip fab technology to produce something that can do analog logic. I've heard CMOS flash presented a plausible option.
  * Designing something that isn't an antenna.
  * You would likely have to finetune your model for each physical chip you're running it on (the manufacturing tolerances aren't going to give exact results)

The big advantage is that instead of using 16 wires to represent a float16, you use the voltage on 1 wire to represent that number (which plausibly has far more precision than a float32). Additionally, you can e.g. wire two values directly together rather than loading numbers into an ALU, so the die space & power savings are potentially many, many orders of magnitude.

tasty_freeze

0 replies

1h42m

2024-05-13 16:44:07 UTC

which plausibly has far more precision than a float32

If that was true, then a DRAM cell could represent 32 bits instead of one bit. But the analog world is noisy and lossy, so you couldn't get anywhere near 32 bits of precision/accuracy.

Yes, very carefully designed analog circuits can get over 20 bits of precision, say A/D converters, but they are huge (relative to digital circuits), consume a lot of power, have low bandwidth as compared to GHz digital circuits, and require lots of shielding and power supply filtering.

This is spit-balling, but the types of circuits you can create for a neural network type chip is certainly under 8 bits, maybe 6 bits. But it gets worse. Unlike digital circuits where signal can be copied losslessly, a chain of analog circuits compounds the noise and accuracy losses stage by stage. To make it work you'd need frequent requantization to prevent getting nothing but mud out.

bobmcnamara

0 replies

3h14m

2024-05-13 15:12:37 UTC

which plausibly has far more precision than a float32

+/- 1e-45 to 3.4e38. granted, roughly half of that is between -1 and 1.

When we worked with low power silicon, much of the optimization was running with minimal headroom - no point railing the bits 0/1 when .4/.6 will do just fine.

Additionally, you can e.g. wire two values directly together rather than loading numbers into an ALU

You may want an adder. Wiring two circuit outputs directly together makes them fight, which is usually bad for signals.

brazzy

2 replies

7h57m

2024-05-13 10:29:42 UTC

Sounds pretty impossble to me do that with a sufficient combination of range and precision.

atoav

1 replies

7h32m

2024-05-13 10:54:38 UTC

What do you mean with inpossible? You are aware that what radio equipment does is often equivalent of analog operations like multiplication, addition, etc. just at high frequencies?

Sure accuracy is an issue, but this is not as impossible as you may think it would be. The main question will be if the benefits by going analog outweigh the issues arising from it.

Symmetry

0 replies

5h48m

2024-05-13 12:38:04 UTC

In general the problem with analog is that every sequential operation introduces noise. If you're just doing a couple of multiplications to frequency shift a signal up and down that's fine. But if you've got hundreds of steps and you're also trying to pack huge numbers of parallel steps into a very small physical area.

brap

1 replies

8h56m

2024-05-13 09:30:03 UTC

I don’t know why you’re being downvoted, that’s an active area of research AFAIK

gitfan86

0 replies

8h43m

2024-05-13 09:43:26 UTC

Maybe because that is a VERY different problem than the one discussed here.

Building a single analog chip with 1 billion neurons would cost billions of dollars in a best case scenario. A Nvidia card with 1 billion digital neurons is in the hundreds of dollars of range.

Those costs could come down eventually, but at that point CUDA may be long gone.

rsp1984

0 replies

3h32m

2024-05-13 14:53:55 UTC

TBH that sounds like a nightmare to debug.

jkaptur

0 replies

3h38m

2024-05-13 14:48:22 UTC

Maybe a silly question (I don't know anything about this) - how do you program / reprogram it?

dnedic

0 replies

6h11m

2024-05-13 12:15:17 UTC

How do you inspect what is happening then without having ADCs sampling every weight, taking up huge die area?

cptroot

0 replies

2h25m

2024-05-13 16:01:48 UTC

Here's an example of Veritasium talking about this from 2022: https://www.youtube.com/watch?v=GVsUOuSjvcg

Symmetry

0 replies

33m

2024-05-13 17:52:58 UTC

I think we're far away from analog circuits being practically useful, but one place that where we might embrace the tolerance for imprecision is in noisy digital circuits. Accepting that one in a million, say, bits in an output will be flipped to achieve a better performance/power ratio. Probably not when working with float32s where a single infinity[1] could totally mess things but for int8s the occasional 128 when you wanted a 0 seems like something that should be tolerable.

[1] Are H100s' maxtrix floating point units actually IEEE 754 compliant? I don't actually know.

latchkey

13 replies

15h58m

2024-05-13 02:27:54 UTC

Really impressed by the writing style of this post and very much looking forward to this on AMD MI300x. Let me know if you want some time on mine.

globular-toast

6 replies

10h36m

2024-05-13 07:49:53 UTC

Good writing is clear and unambiguous. With speech there is an opportunity to interrupt and ask for clarification. Writing has one chance to get the message across. A reader shouldn't have to consult knowyourmeme.com to figure out what the heck the authors are trying to say. I don't even know what the title means here. That's how far they've missed the mark.

aetimmes

4 replies

6h50m

2024-05-13 11:35:58 UTC

Even if you're not familiar with the "go brrr" meme (which is the only use of meme-idiom in the article and is used exactly twice), its meaning is easily inferred via context clues from the opening paragraphs.

Good writing is also entertaining and engaging.

throwaway1492

2 replies

4h25m

2024-05-13 14:00:56 UTC

As someone who witnessed A-10 CAS fuck some stuff up in a combat zone ie the real “brrrrt” I’ve been mystified by the meme and current useage. No one knows where it comes from nor the slaughter it represents.

onemiketwelve

0 replies

26m

2024-05-13 18:00:17 UTC

as intense as a a10 might be, it's short lived and only affects a few dudes on the receiving end. When the federal reserve goes brrr, it has far reaching impact that affects every single person in the global economy.

https://brrr.money/

aoeusnth1

0 replies

4h6m

2024-05-13 14:20:33 UTC

You’re mistaken, the “go brrr” format comes from the money printer meme in 2020.

globular-toast

0 replies

5h19m

2024-05-13 13:07:03 UTC

Keyword being also.

_obviously

0 replies

7h17m

2024-05-13 11:09:48 UTC

Wow that really sucks for you. I just read it in 5 minutes and feel much more informed about the subject pf nvidia memory twizzlization. It's kind of funny to me that presumably young college guys are writing in a style that's very readable for my old ass.

jsemrau

3 replies

13h27m

2024-05-13 04:59:43 UTC

Really? It gives me PTSD from the Wallstreetbets days.

forrestthewoods

1 replies

13h7m

2024-05-13 05:19:48 UTC

I also enjoyed the article's style. I utterly despise "academic paper speak". It is, imho, not the most effective style to communicate complex ideas. I find it so much easier to learn from a more casual "blog post" or in-person presentation over stiff, rigid academic speak.

kaycey2022

0 replies

12h42m

2024-05-13 05:44:46 UTC

I find both to be useful in different stages. The casual style is very helpful when starting out. But once I have put in a few weeks or months of study in, then the rigor and preciseness of academic style is good as well.

I agree with you in the sense that something has "died" in writings the follow academic paper speak these days. Just yesterday I saw an ancient article surfaced by Scientific American and Peter Norvig on System Analysis by Strachey. It uses quite a bit of formal language but is super approachable at the same time. That kind of skill is rarely seen these days.

david927

0 replies

2h9m

2024-05-13 16:17:35 UTC

the Wallstreetbets days.

https://twitter.com/TheRoaringKitty/status/17900418133798504...

tracker1

1 replies

2h37m

2024-05-13 15:48:51 UTC

Have you done much AI work against AMD products? I'm not going to plunk down $2500+ for an RTX 4090, but have been considering an RX 7900XTX for playing around with, or at least getting started. Just curious how well it will or won't work in practice, or if saving a bit more and getting a 7900 XT over the XTX might be a better option, and how much less vram might impact usefulness in practice.

latchkey

0 replies

2h26m

2024-05-13 16:00:48 UTC

My only work with consumer AMD GPUs was mining ethereum, I had 150,000 of them.

If you want to use enterprise AMD gpus, I'm renting them. That said, I haven't even had a chance to run/play with them myself yet, they have been rented since I got them last month.

Yes, we are getting more.

jokoon

11 replies

18h22m

2024-05-13 00:04:45 UTC

this is why people should better study neuroscience, psychology if they want to advance research in AI.

also things related to graph topology in neural networks maybe, but probably not related to artificial NN.

I was given this video, which I found was pretty interesting: https://www.youtube.com/watch?v=nkdZRBFtqSs (How Developers might stop worrying about AI taking software jobs and Learn to Profit from LLMs - YouTube)

dartos

8 replies

18h6m

2024-05-13 00:19:49 UTC

I don’t think psychology will have any bearing on AI.

I doubt neuroscience will either, but I’m not as sure on that.

The more impressive AI systems we have moved further away from the neuron analogy that came from perceptions.

The whole “intelligence” and “neural” part of AI is a red herring imo. Really poor ambiguous word choice for a specific, technical idea.

fastball

2 replies

17h44m

2024-05-13 00:42:12 UTC

*perceptrons

dartos

1 replies

14h22m

2024-05-13 04:04:04 UTC

Darn autocorrect. Thank you.

actionfromafar

0 replies

10h49m

2024-05-13 07:37:41 UTC

Haha, I didn't get it when I read "perceptions". Thought ... of what? :-D

sva_

1 replies

17h50m

2024-05-13 00:36:27 UTC

I doubt neuroscience will either, but I’m not as sure on that

The stuff on spiking networks and neuromorphic computing is definitely interesting and inspired by neuroscience, but it currently seems mostly like vaporware

dartos

0 replies

14h23m

2024-05-13 04:03:35 UTC

Yep, I’ve heard about spiking networks, but haven’t read into them much yet.

nradov

1 replies

17h30m

2024-05-13 00:56:21 UTC

The question is whether current AI technologies represent any progress towards a true human equivalent artificial general intelligence. Most likely not, but no one knows for sure. If the answer turns out to be no then real progress will likely require theoretical insights from psychology, neuroscience, and other fields.

dartos

0 replies

14h20m

2024-05-13 04:06:35 UTC

Fwiw, I don’t think we’re any closer to general intelligence then we were 5 years ago.

Other than that, I agree, especially since you added “and other fields.” Psychology might eventually give us a useful definition of “intelligence,” so that’d be something.

Obviously all research can influence other areas of research.

Symmetry

0 replies

5h32m

2024-05-13 12:54:36 UTC

It's easy to overstate, but shouldn't be understated either with, as an example, solving problems with learning in AI providing insights into how dopamine works in brains.

https://www.technologyreview.com/2020/01/15/130868/deepmind-...

There are obvious, huge differences between what goes on in a computer and what happens in a a brain. Neurons can't do back propagation is a glaring one. But they do do something that ends up being analogous to back propagation and you can't tell a priori whether some property of AI or neuroscience might be applicable to the other or not.

The best way to learn about AI isn't to learn neuroscience. it's to learn AI. But if I were an AI lab I'd still hire someone to read neuroscience papers and check to see whether they might have something useful in them.

renewiltord

0 replies

17h5m

2024-05-13 01:21:38 UTC

There are loads of psychologists and neuroscientists today. Has any of them in the last few years produced anything advancing AI? The proof of the pudding is in the eating so if they have at a higher rate than just straight CS/Mathematics and related then there’s probably some truth to it.

chmod775

0 replies

15h59m

2024-05-13 02:27:04 UTC

I can't seem to figure out the connection between this comment and the article at hand, except that they're both about AI.

renonce

7 replies

17h31m

2024-05-13 00:55:37 UTC

NVIDIA’s lies. This is an extraordinarily misleading representation of the actual 128b swizzled wgmma layout. This diagram cost us three weeks of life that we will not get back, hence the public shaming.

Wondering if anyone would be surprised that a huge amount of progress in AI is on the engineering side (optimizing matmuls), and that a huge portion of the engineering is about reverse engineering NVIDIA chips

DeathArrow

6 replies

11h8m

2024-05-13 07:18:02 UTC

Architecture doesn't make a difference. Big enough models trained with big enough data tend to give the same results regardless of architecture. So yes, most advances in AI are mostly due to the fact we can now multiply matrices very fast.

elcomet

4 replies

10h39m

2024-05-13 07:46:52 UTC

That's not completely true. The architecture must behave well for scaling, which is not trivial. Basic multi-layer perceptrons do not scale well for example, the gradient will vanish or explode deeper in the network.

Tarrosion

1 replies

2h0m

2024-05-13 16:26:00 UTC

How do modern foundation models avoid multi-layer perceptron scaling issues? Don't they have big feed-forward components in addition to the transformers?

heavenlyblue

0 replies

1h50m

2024-05-13 16:36:22 UTC

They don't do global optimisation of all layers at the same time, instead training all layers independently of each other.

3abiton

1 replies

10h11m

2024-05-13 08:15:26 UTC

And data quality. Ensuring the sourcing and quality is very important to get a good model.

fleischhauf

0 replies

7h13m

2024-05-13 11:12:51 UTC

this, if you have money to spend in improving your model, more training data is the first thing I'd take a look at

rfoo

0 replies

9h20m

2024-05-13 09:06:14 UTC

idk, they do give the same results, but given the memory bottleneck it feels like we are at a point when architecture innovations matter again, for example check out DeepSeek V2 tech report, they modded model arch specifically for lower cost inference (by making k/v cache smaller)

diginova

7 replies

11h29m

2024-05-13 06:57:26 UTC

What should I do if I want to understand such articles in complete? where to start on the roadmap?

abstractcontrol

5 replies

7h15m

2024-05-13 11:11:33 UTC

For a deep dive, maybe take a look at the Spiral matrix multiplication playlist: https://www.youtube.com/playlist?list=PL04PGV4cTuIWT_NXvvZsn...

I spent 2 months implementing a matmult kernel in Spiral and optimizing it.

selimthegrim

2 replies

3h12m

2024-05-13 15:13:56 UTC

Are Winograd’s algorithms useful to implement as a learning exercise?

abstractcontrol

1 replies

1h50m

2024-05-13 16:36:13 UTC

Never tried those, so I couldn't say. I guess it would.

Even so, creating all the abstractions needed to implement even regular matrix multiplication in Spiral in a generic fashion took me two months, so I'd consider that good enough exercise.

You could do it a lot faster by specializing for specific matrix sizes, like in the Cuda examples repo by Nvidia, but then you'd miss the opportunity to do the tensor magic that I did in the playlist.

selimthegrim

0 replies

57m

2024-05-13 17:28:51 UTC

You are the author of the playlist/maker of the videos?

justplay

1 replies

4h27m

2024-05-13 13:59:04 UTC

sorry for noob question, how gpu programming is helpful ?

abstractcontrol

0 replies

1h39m

2024-05-13 16:47:37 UTC

NNs for example are (mostly) a sequence of matrix multiplication operations, and GPUs are very good at those. Much better than CPUs. AI is hot at the moment, and Nvidia is producing the kind of hardware that can run large models efficiently which is why it's a 2 trillion-dollar company right now.

However, in the Spiral series, I aim to go beyond just making an ML library for running NN models and break new ground.

Newer GPUs actually support dynamic memory allocation, recursion, and the GPU threads have their own stacks, so you could in fact treat them as sequential devices and write games and simulators directly on them. I think once I finish the NL Holdem game, I'll be able to get over 100x fold improvements by running the whole program on the GPU versus the old approach of writing the sequential part on a CPU and only using the GPU to accelerate a NN model powering the computer agents.

I am not sure if this is a good answer, but this is how GPU programming would be helpful to me. It all comes down to performance.

The problem with programming them is that the program you are trying to speed up needs to be specially structured, so it utilizes the full capacity of the device.

kolinko

0 replies

11h10m

2024-05-13 07:16:40 UTC

This is a good course on gpu programming. Around 4.0 lesson you’ll get the required basics: https://youtube.com/playlist?list=PLzn6LN6WhlN06hIOA_ge6Srgd...

Also, write your own cuda kernel to do vector-matrix multiplication (if you use pycuda, you can focus on the kernel, and write everything else with python). Just tell chatgpt that you want to write your own implementation that multiplies a 4000-element vector by 4000x12000 matrix, and to guide you through the whole process.

For renting gpus, runpods is great - right now they have everything from lower tier gpus to h100s. You can start with a lesser gpu at the beginning.

imiric

6 replies

10h31m

2024-05-13 07:55:07 UTC

Hasn't this research been done by teams building NPUs today? E.g. chips built by Groq use an architecture built specifically for AI, which is why they're able to deliver the performance they do. On the consumer side, Apple silicon is also quite capable.

I'm not in this field at all, but it seems to me that using general purpose processors that communicate over (relatively) slow lanes can only get us so far. Rethinking the design at the hardware level, and eventually bringing the price down for the consumer market seems like a better long-term strategy.

resource_waste

5 replies

6h11m

2024-05-13 12:15:27 UTC

On the consumer side, Apple silicon is also quite capable.

I am not sure that is true. A glance/or long stay at the reddit localllama subreddit basically has a bunch of frustrated CPU users trying their absolute best to get anything to work at useful speeds.

When you can get an Nvidia GPU for a few hundred dollars or a full blown gaming laptop with a 4050 6gb vram for $900, its hard to call a CPU based AI capable.

Heck we don't have GPUs at work, and CPU based is just not really reasonable without using tiny models and waiting. We ended up requesting GPU computers.

I think there is a 'this is technically possible', and there is a 'this is really nice'. Nvidia has been really nice to use. CPU has been miserable and frustrating.

serialx

3 replies

5h30m

2024-05-13 12:56:03 UTC

Actually, llama.cpp running on Apple silicon uses GPU(Metal Compute Shader) to inference LLM models. Token generation is also very memory bandwidth bottlenecked. On high end Apple silicon it's about 400MB/s to 800MB/s, comparable to NVIDIA RTX 4090, which has memory bandwidth of 1000MB/s. Not to mention that Apple silicon has unified memory architecture and has high memory models (128GB, up to 192GB), which is necessary to run large LLMs like Llama 3 70B, which roughly takes 40~75GB of RAM to work reasonably.

resource_waste

2 replies

5h6m

2024-05-13 13:20:37 UTC

These are really nice rants of techno-blabble.

The reality of things: Its not useful.

No one actually uses it.

You can post Apple's official tech specs, but it doesnt change that people aren't using it because it doesnt work. (or at least isnt as cost effective)

Not to mention that Apple silicon has unified memory architecture and has high memory models (128GB, up to 192GB)

This NEEDS to end. This integrated GPU nonsense is not equivalent and is disinformation. It is immoral to continue to push this narrative.

Also, 128GB isnt high memory. 512GB is high memory.

imtringued

0 replies

1h29m

2024-05-13 16:56:51 UTC

The number of people running llama3 70b on NVidia gaming GPUs is absolutely tiny. You're going to need at least two of the highest end 24 GB VRAM GPUs and even then you are still reliant on 4 bit quantization with almost nothing left for your context window.

brrrrrm

0 replies

4h7m

2024-05-13 14:19:45 UTC

I use it all the time?

imiric

0 replies

5h43m

2024-05-13 12:42:58 UTC

I don't think NVIDIA's reign will last long. The recent AI resurgence is not even a decade old. We can't expect the entire industry to shift overnight, but we are seeing rapid improvements in the capability of non-GPU hardware to run AI workloads. The architecture change has been instrumental for this, and Apple is well positioned to move the field forward, even if their current gen hardware is lacking compared to traditional GPUs. Their silicon is not even 5 years old, yet it's unbeatable for traditional workloads and power efficiency, and competitive for AI ones. What do you think it will be capable of in 5 years from now? Same for Groq, and other NPU manufacturers. Betting on NVIDIA doesn't seem like a good long-term strategy, unless they also shift their architecture.

behnamoh

6 replies

18h55m

2024-05-12 23:31:23 UTC

tangential: When @sama talks about "Universal Basic Compute" (UBC) as a substitute for Universal Basic Income, obviously he means GPU, right? Who's going to benefit from such policies? Only nvidia? It just seems such a dystopian future to live in: imagine you can sell your UBC to others who know better how to use it, or you can use it to mine bitcoin or whatever. But all the compute is actually created by one company.

There are many reasons to hate nvidia, but honestly if this UBC policy is even remotely being considered in some circles, I'd join Linus Torvalds and say "nvidia, fuck you".

jra101

4 replies

18h32m

2024-05-12 23:54:37 UTC

You're blaming NVIDIA for Sam Altman's dumb idea?

behnamoh

2 replies

18h24m

2024-05-13 00:02:28 UTC

nvidia's CEO literally keeps saying "the more you buy GPUs, the more you save"—it's hard to believe nvidia has nothing to do with such ideas.

coffeebeqn

0 replies

18h4m

2024-05-13 00:21:49 UTC

GPU CEO wants to sell more GPUs? What on earth

WanderPanda

0 replies

18h7m

2024-05-13 00:19:47 UTC

Him saying this always puts me off. Gives hard old sales-guy vibes. I really wonder who/which demographic is influenced in nvidias favor by this rethoric.

WanderPanda

0 replies

18h8m

2024-05-13 00:17:51 UTC

One's "dumb idea" is another marketers "genius stroke". Seems like he is playing the media puppets while he can

callalex

0 replies

11h52m

2024-05-13 06:34:00 UTC

You’re looking for logic. The only logic is “when a sucker buys WorldCoin, sama bank account go brrrr”.

That’s the whole logic.

uyzstvqs

4 replies

18h38m

2024-05-12 23:48:28 UTC

What is needed are true NPUs as dedicated co-processors, especially for prosumer desktop systems (devs, other professionals, gamers). GPUs work in the enterprise, but they're a hassle to use for AI on the personal computing side of the market. Especially VRAM limitations, but also the lack of a standard open API other than Vulkan (again, using video stuff for AI).

dartos

3 replies

18h10m

2024-05-13 00:16:16 UTC

Fwiw, Vulkan isn’t specifically a graphics api and has had compute specific features for a while now. (Potentially since its inception)

the__alchemist

2 replies

16h48m

2024-05-13 01:38:41 UTC

Compared to CUDA, Vulkan is... not fun to code compute in! The serialization bridge and duplicating data structures and functions between CPU and GPU is tedious.

dartos

1 replies

14h14m

2024-05-13 04:11:53 UTC

I hear both CUDA and Vulkan are not fun to code in.

But yeah Vulkan is famously verbose. It takes about 1000 LoC to draw a triangle

KeplerBoy

0 replies

11h10m

2024-05-13 07:15:51 UTC

CUDA is very much fun to code in!

Nvidia provides devs with great tools (Nsight Systems and Nsight Compute), so you know where you have to optimize.

badgersnake

4 replies

6h33m

2024-05-13 11:52:56 UTC

That’s the whole point, VCs invested heavily in GPUs anticipating a crypto boom and when that never happened they had to find some other snake oil to peddle that happened to require GPUs.

verbify

3 replies

6h27m

2024-05-13 11:59:46 UTC

My experience is that when crypto was in the news, my non-technical friends, family, and colleagues would ask me what is bitcoin and were generally confused.

My experience with the AI boom couldn't be more different - everyone from my colleagues to my mum are using chatgpt as a daily tool.

I really don't think that AI and crypto are comparable in terms of their current practical usage.

kkielhofner

1 replies

3h26m

2024-05-13 15:00:31 UTC

Comparing crypto and AI is really tired and you make the best point - real people are using these GPUs to actually do things of value and improve their daily lives.

At the peak of the crypto boom/hype cycle I took on a little project to look at the top 10 blockchain networks/coins/whatever.

From what I could tell a very, very, very generous estimate is that crypto at best has MAUs in the low tens of millions.

ChatGPT alone got to 100 million MAUs within a year of release and has only grown since.

ChatGPT 10x'd actual real world usage of GPUs (and resulting power and other resources) in a year vs ~15 years for crypto.

I really don't think that AI and crypto are comparable in terms of their current practical usage.

A massive understatement!

latchkey

0 replies

32m

2024-05-13 17:54:10 UTC

GPUs stopped being used for crypto because Ethereum switched from PoW to PoS and that decimated the whole gpu mining industry. Ethereum was the only profitable thing to mine, that also had a usecase. The rest of the chains dumped in price and became unprofitable to mine at scale. Not enough market depth to unload the tokens at scale.

In other words, it has nothing to do with AI.

adra

0 replies

1h19m

2024-05-13 17:07:05 UTC

Wow, what a different in perspective. I've met maybe a few people period that have at least mentioned that they've ever used AI tools (ever) in their personal lives, frequency be damned. Maybe you're just a lot more insistent in weaving in questions about using AI tools in daily conversation.

At work setting in a tech company, there seems to be a handful that are very in love with AI, a bunch that use it here or there, and a large majority that (at least publically) don't even use it. It's be interesting to see what company enforced spyware would say about ai uptake though for real.

jauntywundrkind

3 replies

19h24m

2024-05-12 23:02:29 UTC

The ThunderKittens mascot has great kitten/Sony-Aibo vibes. Nicely generated, AI (I presume). https://github.com/HazyResearch/ThunderKittens

layer8

2 replies

18h59m

2024-05-12 23:27:10 UTC

It looks off because the head isn’t centered on the neck.

john_minsk

0 replies

14h5m

2024-05-13 04:21:14 UTC

Great attention to detail! I, like the parent, was surprised by the quality as well. However now I can't unsee it:-)

Satam

0 replies

12h47m

2024-05-13 05:39:06 UTC

Easy fix: https://imgur.com/a/Ahwt6tr (although not sure which one is actually better)

panki27

2 replies

4h57m

2024-05-13 13:29:41 UTC

Warp scheduler, 4 quadrants, tensor memory accelerator, unswizzled wgmma layouts...

The line between GPU lingo and Star Trek technobabble fades away further and further.

araes

0 replies

57m

2024-05-13 17:28:50 UTC

There was some awareness reading the article, yet "we're warping through the quadrant in our tensor accelerator" is pretty Trek.

Have had that thought occasionally with some of the other articles. What it must read like to somebody who gets a ref link for an article over here. Wandered into some Trek nerd convention discussing warp cores.

Agentlien

0 replies

2h42m

2024-05-13 15:44:05 UTC

Your comment prompted me to take a step back and look at these terms with new eyes. That made me smile, because you're so right.

joaquincabezas

2 replies

6h57m

2024-05-13 11:29:45 UTC

wow their graphs at the GitHub README (https://github.com/HazyResearch/ThunderKittens/blob/main/att...) make me extremely dizzy. Are these wavy bars even legal? :P

hoosieree

0 replies

5h44m

2024-05-13 12:42:39 UTC

It looks like the xkcd theme for matplotlib[1]. But I agree the waves are too extreme.

[1]: https://matplotlib.org/stable/gallery/showcase/xkcd.html#sph...

bogtog

0 replies

6h45m

2024-05-13 11:41:28 UTC

I second this. It's like they're trying to incorporate some optical illusion. I'd even prefer just seeing numbers without any bars

perfmode

1 replies

19h15m

2024-05-12 23:11:16 UTC

This article rekindles the joy I experienced during CS 149 Parallel Programming.

figbert

0 replies

12h48m

2024-05-13 05:38:32 UTC

Appreciate the recommendation, will check out the course!

chefandy

1 replies

3h4m

2024-05-13 15:21:59 UTC

One of my biggest struggles in doing AI stuff on consumer hardware is heat. I noticed zero discussion of this so I assume it's an implementation detail on small systems that doesn't really factor into more robust setups. Is that the really case, or is this just diving into the comp sci layer of hardware utilization and ignoring things like heat because it's not salient to this subtopic?

nostrebored

0 replies

3h0m

2024-05-13 15:25:53 UTC

It factors into robust setups but is part and parcel of doing any HPC where you're pushing through a ton of TFLOPS. It's a problem that is assumed to have been solved when you're doing this kind of work.

DeathArrow

1 replies

8h38m

2024-05-13 09:48:04 UTC

It would be nice if such improvements find their way in pytorch and scikit-learn.

kmacdough

0 replies

7h50m

2024-05-13 10:36:16 UTC

I'm sure they will. Right now it's, though, it's bleeding edge and it'll take some time for these ideas to mature and be adapted to the particular idioms of these more stable packages.

weinzierl

0 replies

4h23m

2024-05-13 14:03:01 UTC

"For this post, we’re going to focus on the NVIDIA H100 [... because] we think the trends it implies are going to continue in future generations, and probably from other manufacturers, too."

Is it though? Wouldn't we expect to see more advanced packaging technology eventually?

If that happens the increased memory bandwidth could be an enabler for a unified memory architecture like in the Nvidia Jetson line. In turn that would make a lot of what the article says make GPU go Brr today moot.

roschdal

0 replies

9h14m

2024-05-13 09:12:21 UTC

ChatGPT - the largest electricity bill in the world.

nmstoker

0 replies

20h17m

2024-05-12 22:09:11 UTC

Some related material here too:

https://twitter.com/bfspector/status/1789749117104894179?t=k...

lucidrains

0 replies

3h16m

2024-05-13 15:10:02 UTC

would be interested to see thunderkittens (great name!) tackle the flash attention backwards pass, which is an order of magnitude harder than the forward

danjl

0 replies

1h42m

2024-05-13 16:44:08 UTC

I bet traditional image processing would love to be implemented in ThunderKitten.

cl3misch

0 replies

11h4m

2024-05-13 07:22:05 UTC

The unswizzled shared memory layouts suffer from very poor coalescing

If I didn't know any better I'd consider it technobabble

bombela

0 replies

7h31m

2024-05-13 10:55:20 UTC

I cannot tell for sure if units are really all power of 10.

I found some datasheet that states 80GB of VRAM, and a BAR of 80GiB. All caches are also in power of two. The bandwidth are all power of 10 though.

https://www.nvidia.com/content/dam/en-zz/Solutions/gtcs22/da...

apsec112

0 replies

17h45m

2024-05-13 00:41:39 UTC

Interesting! Would this support fp8? Does anyone know how it would compare to Triton?

_spl

0 replies

8h58m

2024-05-13 09:28:39 UTC

It reminds me of when I first read about superscalar CPU architecture and was amazed. GPUs are really next level.

WanderPanda

0 replies

18h6m

2024-05-13 00:20:22 UTC

Is this "just" CUTLASS in user friendly?

LordShredda

0 replies

3h4m

2024-05-13 15:21:55 UTC

Standford research team just published an article with a wojak in it. That by itself is bigger news than AI

DeathArrow

0 replies

11h6m

2024-05-13 07:20:14 UTC

So do their kernels and library also speed up RTX 4090?