"And we ask: if your matrix multiply is smaller than 16x16, are you sure what you’re doing is AI?
From a philosophical point of view, we think a frame shift is in order. A “register” certainly shouldn’t be a 32-bit word like on the CPUs of old. And a 1024-bit wide vector register, as CUDA uses, is certainly a step in the right direction. But to us a “register” is a 16x16 tile of data. We think AI wants this."
The hardware needs of AI are starting to focus. GPUs, after all, were designed for an entirely different job. They're used for AI because they have good matrix multiply hardware. "AI GPUs" get to leave out some of the stuff in a real GPU (does an H100 even have texture fill units?). Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit? That will settle out at some point. This paper indicates that hardware that likes 16x16 tiles makes a lot of sense. It's certainly possible to build such hardware. Someone reading this is probably writing it in VHDL right now, or will be soon.
Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.
GPUs have evolved to be AI machines with as little baggage as possible. People have been arguing GPUs were old technology and therefore unsuited for AI since at least 2014 (when Nervana was founded), but what they perhaps didn’t expect is that the GPU would evolve so quickly to be an AI machine.
Bill Dally from Nvidia argues that there is "no gain in building a specialized accelerator", in part because current overhead on top of the arithmetic is in the ballpark of 20% (16% of IMMA and 22% for HMMA units) https://www.youtube.com/watch?v=gofI47kfD28
There does seem to be a somewhat obvious advantage: If all it has to do is matrix multiplication and not every other thing a general purpose GPU has to be good at then it costs less to design. So now someone other than Nvidia or AMD can do it, and then very easily distinguish themselves by just sticking a ton of VRAM on it. Which is currently reserved for GPUs that are extraordinarily expensive, even though the extra VRAM doesn't cost a fraction of the price difference between those and an ordinary consumer GPU.
I really hope we see AI-PU (or with some other name, INT16PU, why not) for the consumer market sometime soon. Or been able to expand GPU memory using a pcie socket (not sure if technically possible).
The while point of GPU memory is that it's faster to access than going to memory (like your main RAM) through the PCIe bottleneck.
My uninformed question about this is why can't we make the VRAM on GPUs expandable? I know that you need to avoid having the data traverse some kind of bus that trades overhead for wide compatibility like PCIe but if you only want to use it for more RAM then can't you just add more sockets whose traces go directly to where they're needed? Even if it's only compatible with a specific type of chip it would seem worthwhile for the customer to buy a base GPU and add on however much VRAM they need. I've heard of people replacing existing RAM chips on their GPUs[0] so why can't this be built in as a socket like motherboards use for RAM and CPUs?
[0] https://www.tomshardware.com/news/16gb-rtx-3070-mod
Expandable VRAM on GPUs has been tried before - the industry just hates it. It's like Apple devices - want more internal storage? Buy a new computer so we can have the fat margins.
The original REV A iMac in late 90s had slotted memory for its ATI card, as one example - shipped with 2mb, could be upgraded to 6mb after the fact with a 4MB SGRAM DIMM. There are also a handful of more recent examples floating around.
While I'm sure there are also packaging advantages to be had by directly soldering memory chips instead of slotting them etc, I strongly suspect the desire to keep buyers upgrading the whole card ($$$) every few years trumps this massively if you are a GPU vendor.
Put another way, what's in it for the GPU vendor to offer memory slots? Possibly reduced revenue, if it became industry norm.
Expansion has to answer one fundamental question: if you're likely to need more X tomorrow, why aren't you just buying it today?
The answer to this question almost has to be "because it will be cheaper to buy it tomorrow." However, GPUs bundle together RAM and compute. If RAM is likely to be cheaper tomorrow, isn't compute also probably going to be cheaper?
If both RAM and compute are likely cheaper tomorrow, then the calculus still probably points towards a wholesale replacement. Why not run/train models twice as quickly alongside the RAM upgrades?
Remember as well that expandable RAM doesn't unlock higher-bandwidth interconnects. If you could take the card from five years ago and load it up with 80 GB of VRAM, you'd still not see the memory bandwidth of a newly-bought H100.
If instead you just need the VRAM and don't care much about bandwidth/latency, then it seems like you'd be better off using unified memory and having system RAM be the ultimate expansion.
Its a minor technical challenge with no financial benefit for the GPU makers.
Replacing RAM chips on GPUs involves resoldering and similar things - those (for the most part) maintain the signal integrity and performance characteristics of the original RAM. Adding sockets complicates the signal path (iirc), so it's harder for the traces to go where they're needed, and realistically given a trade-off between speed/bandwidth and expandability I think the market goes with the former.
Technically we definitely can, but are there sufficiently many people willing to pay a sufficiently high premium for that feature? How much more would you be willing to pay for an otherwise identical card that has the option to expand RAM, and do you expect that a significant portion of buyers would want to pay a non-trivial up-front cost for that possibility?
Isn't that what NPUs are technically?
https://en.m.wikipedia.org/wiki/AI_accelerator
Isn't this what resizeable BAR and direct storage are for?
Designing it is easy and always has been. Programming it is the bottleneck. Otherwise Nvidia wouldn't be in the lead.
but programming it is "import pytorch" - nothing nvidia-specific there.
the mass press is very impressed by Cuda, but at least if we're talking AI (and this article is, exclusively), it's not the right interface.
and in fact, Nv's lead, if it exists, is because they pushed tensor hardware earlier.
Someone does, in fact, have to implement everything underneath that `import` call, and that work is _very_ hard to do for things that don't closely match Nvidia's SIMT architecture. There's a reason people don't like using dataflow architectures, even though from a pure hardware PoV they're very powerful -- you can't map CUDA's, or Pytorch's, or Tensorflow's model of the world onto them.
I'm talking about adding Pytorch support for your special hardware.
Nv's lead is due to them having Pytorch support.
Exactly. And that means you not only save the 22% but also a large chunk of the Nvidia margin.
And, sure enough, there's a new AI chip from Intellifusion in China that's supposed to be 90% cheaper. 48 TOPS in int8 training performance for US$140.[1]
[1] https://www.tomshardware.com/tech-industry/artificial-intell...
I wonder what the cost of power to run these chips is. If the power cost ends up being large compared to the hardware cost, it could make sense to buy more chips and run them when power is cheap. They could become a large source of dispatchable demand.
Apple has already been doing this for a few years now. The NPU is totally different from the GPU or CPU on the die itself[1]. Nvidia is likely working on this as well, but I think a device that's a gaming/entertainment/crypto/AI bundle (i.e. sticking with the video card) is probably a better business move.
[1] https://github.com/hollance/neural-engine/blob/master/docs/a...
The NPUs on a lot of different systems occupy an awkward spot. For extremely small models, they're the way to go for low-power inference. But once you reach LLM or vision transformer size, it makes a lot more sense to switch to GPU shaders for that extra bit of large-model performance. For stuff like Llama and Stable Diffusion, those Neural Engines are practically wasted silicon. The biggest saving grace is projects like ONNX attempting to sew them into a unified non-15-competing-standards API, but even that won't change how underpowered they are.
Nvidia escapes this by designing their GPU architecture to incorporate NPU concepts at a fundamental level. It's less redundant silicon and enables you to scale a single architecture instead of flip-flopping to whichever one is most convenient.
It's currently doable for Apple – I think their strategy is to slowly enhance iPhones, bit by bit, with special-purpose models for dealing with media like photo subject identification, OCR (in every language!), voice transcription, etc. Apple's currently learning from Microsoft's attempts to make AI stick everywhere.
Soon our phones will dream beside us every night (integrating new data into our personal model while on the charger)
Well, iPhone already does that with photos. :)
Do you have a link to where they breakdown what inference for photos happens in realtime vs overnight/charging?
I think Apple is more interested in features that work consistently than in giving power users the ability to play with essentially alpha or beta AI features.
I would guess that their strategy is to not include powerful client-side hardware, and supplement that with some kind of "AiCloud" subscription to do the battery-draining, heat-generating stuff on their cloud. They're trading off their branding as a privacy focused company under the (probably correct) belief that people will be more willing to upload their data to iCloud's AI than Microsoft's.
Fwiw, I think they're probably correct. It has always struck me as odd that people want to run AI on their phone. My impression of AI is that it creates very generalized solutions to problems that would be difficult to code, at the cost of being very compute inefficient.
I don't really want code like that running on my phone; it's a poor platform for it. Thermal dissipation and form factor limit the available processing power, and batteries limit how long you can use the processing power you have. I don't really want to waste either trying to do subject identification locally. I'm going to upload the photos to iCloud anyways; let me pay an extra $1/month or whatever to have that identification happen in the cloud, on a server built for it that has data center thermal dissipation and is plugged into the wall.
The pinch (as far as I can see it) is that you're right, and Apple can't sell a freestanding service to save their life. If we do get an AppleGPT pay-as-you-go service, it's certain to be extraordinarily censored and locked-down as the exclusive first-party option on iPhone. It will feature "vertical integration" that no other AI can have, alongside censorship so prudish that it would make Maurey Povich gasp.
So... I think users will be stuck. They'll want to run uncensored models on their phone, but Apple will want to keep them in the walled garden at any cost. It feels like the whole "Fortnite" situation all over again, where users can agree they want something but Apple can't decide.
Anyone checked out the NPU on the new iPad? It’s supposed to be a bazillion times better according to Apple but I haven’t had a chance to dig into the reality.
I guess we can assume this is going to be what’s used in what’s being called Apple’s first AI phone, iPhone 16.
It has 38 TOPS of INT8 performance. Not very remarkable compared to consumer Nvidia GPU’s which are like one or two orders of magnitude faster.
For reference, Nvidia's Jetson Orin NX robotics platform is 35-50 TOPS on average. Apple is catching up, but Nvidia still has by-far the more flexible (and better scaled) platform.
That 38 TOPS figure was a bit weird, it's literally below baseline(45 TOPS) for "AI PC" branding Qualcomm/Intel/Microsoft is launching this June, and also 10x less than typical GPUs. I think it was just a clever marketing exploiting the fact that "AI PC" branding hasn't launched yet.
For inference, Nvidia has DLA since 2017-ish if I remember correctly, which is completely separate from the GPU.
And Google has their TPUs.
Wait but nvidia tensor-cores are exactly the hardware that likes 16x16 tiles, no? I thought that was the whole point? The hardware is already here and I'm sceptical if there is another order of magnitude in performance to be gained from even more specialized designs.
What's the ratio of tensor cores to regular SIMD compute ("CUDA cores") on NVIDIA's current chips?
This is in the article: if you aren't using the tensor cores, you aren't utilizing ~94% of the FLOPs available.
Knowing what portion of the FLOPs are in the tensor cores isn't quite the right thing to be looking at. The key question is how much more tensor core performance can be gained by reducing or eliminating the dies area devoted to non-tensor compute and higher precision arithmetic. Most of NVIDIA's GPUs are still designed primarily for graphics: they have some fixed function units that can be deleted in an AI-only chip, and a lot of die space devoted to non-tensor compute because the tensor cores don't naturally lend themselves to graphics work (though NVIDIA has spent years coming up with ways to not leave the tensor cores dark during graphics work, most notably DLSS).
So the claims that NVIDIA's GPUs are already thoroughly optimized for AI and that there's no low-hanging fruit for further specialization don't seem too plausible, unless you're only talking about the part of the datacenter lineup that has already had nearly all fixed-function graphics hardware excised. And even for Hopper and Blackwell, there's some fat to be trimmed if you can narrow your requirements.
Mind the Dark Silicon Fraction.
Some fraction of your transistors MUST go unused on average or you melt the silicon. This was already a thing in the 20nm days and I'm sure it has only gotten worse. 100% TDP utilization might correspond to 60% device utilization.
That's true for CPUs. Does it really apply to GPUs and other accelerators for embarrassingly parallel problems where going slower but wider is always a valid option?
There is not a lot of fixed function left in the modern graphics pipeline, economics of scale dictate that there is no net benefit in trimming it.
And yet, even NVIDIA does trim it from chips like the H100, which has no display outputs, RT cores, or video encoders (though they keep the decoders), and only has ROPs for two of the 72 TPCs.
On the H100 specifically. The figure is likely different on consumer cards.
Would you say this is ultimately "ASICs for AI"?
In the same way that CPUs are ASICs for integer operations, that makes sense to me.
Most CPUs do just fine on floating point too.
Floating point arithmetic _is_ integer arithmetic on the cpu level because of how floating point numbers work.
That's a good point - floating point operations are implemented with integer-math circuits (or at least can be - I'm not privy to how modern chip manufacturers implement them). E.g: your ALU may have an 11-bit adder specifically to add your f64 exponents.
Some slides to get the gist of it: https://users.encs.concordia.ca/~asim/COEN_6501/Lecture_Note...
I'm still getting used to that.
“NVidia’s LIES..
On kernels such as flash attention, TMA and the L2 cache are both fast enough so as to hide these problems reasonably well. But to make the full use of the hardware, memory request must be coalesced and bank conflicts avoided ”
The depth of the competition is also starting to become apparent. There’s no way the documentation error was totally an accident. Diagrams are the easiest to steal / copy and there must have been some utility for nvidia to have left this in place. Remember when Naveen Rao’s Nervana was writing NVidia Maxwell drivers that out-performed NVidia’s own? Not every documentation mishap in a high-growth product is a competition counter-measure, but given that the researchers spent so long reverse-engineering wgmma and given the China-US political situation of the H100 in particular, it seems NVidia is up to its old tricks to protect its moat.
So don’t over-study the H100 peculiarities, as “what hardware does AI want?” really encompasses the commercial situation as well.
I don't understand. If they document their stuff with errors, it will hurt users, be they chinese or US ? Or is it expected that US users will call Nvidia's to ask for the correct documentation ?
The vast majority of users use NVidia’s own kernels versus optimize their own. And those who do write custom kernels are typically not trying to compete with NVidia’s own GMM.
It could be a case of classic market segmentation. The lower tier customers get the incomplete or error-ridden documentation, and the upper tier trusted customers^W'partners' get access to the juicy stuff: complete and mostly correct documentation, including stuff intentionally left out of the lower tier package like application notes containing secret hardware handshakes to unlock hidden features, all under strict NDA of course.
it's going to be awkward in consumer hardware either way
if you segregate AI units from the GPU, the thing is both AI and GPUs will continue to need massive amounts of matrix multiplication and as little memory latency as possible
the move to have more of it wrapped in the GPU makes sense but at least in the short and medium term, most devices won't be able to justify the gargantuan silicon wafer space/die growth that this would entail - also currently Nvidia's tech is ahead and they don't make state of the art x86 or ARM CPUs
for the time being I think the current paradigm makes the most sense, with small compute devices making inroads in the consumer markets as non-generalist computers - note that more AI-oriented pseudo-GPUs already exist and are successful since the earlier Nvidia Tesla lineup and then the so-called "Nvidia Data Center GPUs"
Should be "as much memory bandwidth as possible". GPUs are designed to be (relatively) more insensitive to memory latency than CPU.
yep that's true, although AI compute modules do get significant benefit from low latency cache as well
hasn't google been building such devices for a decade now?
yep, and the main engineers have founded groq.com with an architecture that among others precisely solved the memory management issues
There was that recent paper titled "The Era of 1-bit LLMs" [0] which was actually suggeting a 1.58 bit LLM (2 bits in practice).
Yeah, I think I'm in the "will be soon" camp - FPGA board has been ordered. Especially with the 2-bit data types outlined in that paper [0] and more details in [1]. There's really a need for custom hardware to do that 2-bit math efficiently. Customizing one of the simpler open source RISC-V integer implementations seems like something to try here adding in the tiled matrix registers and custom instructions for dealing with them (with the 2 bit data types).
[0] https://arxiv.org/abs/2402.17764 [1] https://github.com/microsoft/unilm/blob/master/bitnet/The-Er...
AMD is already in their second generation of of Versal line.
https://www.amd.com/en/products/accelerators/alveo/v80.html
XDNA Architecture
https://www.amd.com/en/technologies/xdna.html