On the podcast interview now Groq CEO Jonathon Ross did[1] he talked about the creation of the original TPUs (which he built at Google). Apparently originally it was a FPGA he did in his 20% time because he sat near the team who was having inference speed issues.
They got it working, then Jeff Dean did the math and the decided to do an ASIC.
Now of course Google should spin off the TPU team as a separate company. It's the only credible competition NVidia has, and the software support is second only to NVidia.
[1] https://open.spotify.com/episode/0V9kRgNS7Ds6zh3GjdXUAQ?si=q...
The way I see, NVidia only has a few advantages ordered from most important to least:
1. Reserved fab space.
2. Highly integrated software.
3. Hardware architecture that exists today.
4. Customer relationships.
but all of these aspects are weak in one way or another:
For #1, fab space is tight, and NVidia can strangle its consumer GPU market if it means selling more AI chips at a higher price. This advantage is gone if a competitor makes big bets years in advance, or another company that has a lot of fab space (intel?) is willing to change priorities.
2. Life is good when your proprietary software is the industry standard. Whether this actually matters will depend on the use case heavily.
3. A benefit now, but not for long. It's my estimation that the hardware design for TPUs is fundamentally much simpler than for GPUs. No need for raytracing, texture samplers, or rasterization. Mostly just needs lots of matrix multiplication and memory. Others moving into the space will be able to catch up quickly.
4. Useful to stay in the conversation, but in a field hungry for any advantage, the hardware vendor with the highest FLOPS (or equivalent) per dollar is going to win enough customers to saturate their manufacturing ability.
So overall, I give them a few years, and then the competition is going to be real quite fast.
Seems you have not worked with ML workloads, but base your comment on "internet wisdom", or worse, business analysts (I am sorry if that's inaccurate).
On GPUs, ML "just works" (inference and training) and are always order of magnitude faster than whatever CPU you have. TPUs work very well for some model architectures (old ones that they were optimized and designed for) and on some novel others can be actually slower than a CPU (because of gathers and similar) - this was my experience working on ML stuff as an ML Researcher at Google till 2022, maybe it got better but I doubt. Older TPUs were ok only for inference of those specific models and useless for training. And anything new I tried (fundamental part of research...) - the compiler would sonetimes just break with an internal error, most of the time just produce terrible and slow code, and bugs filed against it would stay open for years.
GPU is so much more than a matrix multiplier - it's a fully general, programmable processor. With excellent compilers, but most importantly - low level access that you don't need to rely on proprietary compiler engineers (like TPU ones) and anyone can develop something like Flash Attention. And as a side note: while a Transformer might be mostly matrix multiplication, many other models are not.
Also, it's disingenuous to say "there's only 4 things you need to beat NVIDIA" when each of the 4 is an enormous undertaking.
not to mention every not-so-serious, inference heavy ML developers just want something to work to deliver to client. That itself is a semi-moat.
It's been talked to death but non-CUDA implementations have their challenges regardless of use case. That's what first-mover advantage and > 15 years of investment by Nvidia in their overall ecosystem will do for you.
But support for production serving of inference workloads outside of CUDA is universally dismal. This is where I spend most of my time and compared to CUDA anything else is non-existent or a non-starter unless you're all-in on packaged API driven Google/Amazon/etc tooling utilizing their TPUs (or whatever). The most significant vendor/cloud lock-in I think I've ever seen.
Efficient and high-scale serving of inference workloads is THE thing you need to do to serve customers and actually have a chance at ever making any money. It's shocking to me that Nvidia/CUDA has a complete stranglehold on this obvious use case.
A great summary of how unserious NVIDIA's competitors are is how long it took AMD's flagship consumer/retail GPU, the 7900 XT[X], to gain ROCm support.
That's quite literally unacceptable.
For those who don't know - one year after launch.
Meanwhile Nvidia will go as far as to back port Hopper support to CUDA 11.8 so it "just runs" the day of launch with everything you already have.
If you had worked with ML, you'd know that this is not true. It's actually more like the opposite. It also has nothing to do with the chips themselves. Things don't magically work "because GPU", they work because manufacturers spend the time getting their drivers and ecosystems right. That's why for example noone is using AMD GPUs for ML, despite them offering more compute per dollar on paper. Getting the software stack to the point of Nvidia/CUDA, where things really do "just work", is an enormous undertaking. And as someone who has been researching ML for more than a decade now, I can tell you Nvidia also didn't get these things right in the beginning. That's the reason why they have no real competition today (and still won't for quite some time).
Probably bartwr is using "GPUs" to mean NVIDIA GPUs. Seeing as nobody uses AMD GPUs for it, that simplification seems OK.
ML doesn't just work on GPUs. It's not uncommon to have architectures where GPUs don't really work, we just tend not to use those :)
Hey, this is a good comment. I've only toyed with ML stuff, but I've done a lot with GPUs. I hope you can find my "step back" perspective as valuable I find your up close one.
My chief mistake in the above comment was using "TPU", as that's Google's branding. I probably should've used "AI focused co-processor". I'm not talking exclusively about Google's foray into the space, especially as I haven't used TPUs.
My list of things to ditch on GPUs doesn't include cores. My point there is that there's a bunch of components that are needed for graphics programming that are entirely pointless for AI workloads, both inside the core's ALU and as larger board components. The hardware components needed for AI seem relatively well understood at this point (though that's possible to change with some other innovation).
Put another way, my point is this: Historically, the high end GPU market was mostly limited to scientific computing, enthusiast gaming, and some varied professional workloads. Nvidia has long been king here, but with relatively little attempt by others at competition. ML was added to that list in the last decade, but with some few exceptions (Google's TPU), the people who could move into the space haven't. Then chatGPT happened, investment in AI has gone crazy, and suddenly Nvidia is one of the most valuable companies in the world.
However, The list of companies who have proven they can make all the essential components (in my list in the grandparent) isn't large, but it's also not just Nvidia. Basically every computing device with a screen has some measure of GPU components, and now everyone is paying attention to AI. So I think within a few years Nvidia's market leadership will be challenged, and they certainly won't be the only supplier of top of the line AI co-processors by the end of the decade. Whether first mover advantage will keep them in first place, time will tell.
CUDA is absolute shit, segfaults or compiler errors if you look at it wrong.
NVidia's software is the only reason I'm not using GPU's for ML tasks and likely never will.
That's just C. If you're accessing your arrays out of bounds it's going to segfault. hopefully.
Can't blame CUDA for that one.
I'm talking about the compiler segfaulting, not the end-user code.
Skill issue.
No, CUDA's botched gcc implementation segfaulting due to compiler errors during compilation is not a "skill issue".
(Well, a skill issue of whoever is patching gcc on Nvidia's end, I guess.)
Actually their real advantage is the large set of highly optimised CUDA kernels.
This is the thing that lets them outperform AMD chips even on inferior hardware. And the fact that anything new gets written for CUDA first.
There is OpenAI's Triton language for this too and people are beginning to use it (shout out to Unsloth here!).
While this is true, it's worth noting that the inference only Groq chip which gets 2x-5x better LLM inference performance is on a 12nm process.
Honest question: will AI help AMD catch up with optimized CUDA/ROCM kernels of their own?
I’ve spent the last month deep in GPU driver/compiler world and -
AMD or Apple (Metal) or someone (I haven’t tried Intel’s stuff) just needs to have a single guide to installing a driver and compiler that doesn’t segfault if you look at it wrong, and they would sweep the R&D mindshare.
It is insane how bad CUDA is; it’s even more insane how bad their competitors are.
If you work in hardware and are interested in solving this lemme say this
There are billions of dollars waiting for the first person to get this right. The only reason I haven’t jumped on this myself is a lack of familiarity with drivers.
These have always been NVIDIA's "few" advantages and yet they've still dominated for years. It's their relentless pace of innovation that is their advantage. They resemble Intel of old, and despite Intel's same "few" advantages, Intel is still dominant in the PC space (even with recent missteps).
They've dominated for years, but now all big tech companies are using their products in scale not seen before, and all have vested interest in cutting their margins by introducing some real competition.
Nvidia will do good in the future, but perhaps not good enough to justify their stock price.
NVidia's biggest advantage is that AMD is unwilling to pay for top notch software engineers (and unwilling to pay the corresponding increase in hardware engineer salaries this would entail). If you check online you'll see NVidia pays both hardware and software engineers significantly more than AMD does. This is a cultural/management problem, which AMD's unlikely to overcome in the near-term future. Apple so far seems like the only other hardware company that doesn't underpay its engineers, but Apple's unlikely to release a discrete/stand-alone GPU any time soon.
Don’t underestimate CUDA as the moat. It’s been a decade of sheer dominance with multiple attempts to loosen its grip that haven’t been super fruitful.
I’ll also add that their second moat is Mellanox. They have state of the art interconnect and networking that puts them ahead of the competition that are currently focusing just on the single unit.
Nvidia has so much software behind all of this, your list is a tremendes understatement.
Alone how many internal ML things nvidia builds helps them tremendesly to understand the market (what does the market need).
And they use their inventions themselves.
'only has a few' = 'has a handful easy to list but with huge implications which are not easily matched by amd or intel right now'
Nvidia's datacenter AI chips don't have raytracing or rasterization. Heck, for all we know the new blackwell chip is almost exclusively tensor cores. They gave no numbers for regular CUDA perf.
This is wrong, both AMD and Intel (through Habana) have GPUs comparable to H100s in performance.
Yes, but they don't have the custom kernels that CUDA has. TPUs do have some!
They have Vulcan, which is cross-compatible.
And AMD has ROCm. pytorch is standard and pytorch has ROCm support. And the Google TPU v5 also has pytorch support.
We do have a couple of H100's, but I'd love to replace them with AMD's
If AMD fixes or open sources their proprietary firmware blob[0]. Geohot streamed all weekend on Twitch, reverse engineering the AMD firmware. It was quite entertaining learning about how that low level hardware firmware works[1] and his rants about AMD of course.
[0] https://www.phoronix.com/news/Tinybox-Radeon-Again-UMR
[1] https://www.twitch.tv/georgehotz
Geohot doesn't know what he's talking about and I'm kinda ashamed to see this lazy thinking leak onto HN. There was an article a couple weeks back on AMD open sourcing drivers in the Linux kernel tree that you should look into.
Care to explain a bit more? His rant was about the firmware having crashes not the Linux driver.
Firmware crashes => days long "open source it and I'll fix it. no? why does AMD hate its customers?"
I got an appointment and have exactly one minute till I have to leave, apologies for brevity: they can't open source the full driver because then they'd have to release HDMI spec stuff that the consortium says they can't. (I don't support any of that, my only intent is to communicate George isn't really locked in here when he starts casting aspersions or claiming AMD doesn't care)
Geohot is wrangling with unsupported consumer hardware.
The datacenter stuff is on a different architecture and driver stack. The number one supercomputer on the top500 list (frontier at ORNL) is based on AMD GPUs and AMD is probably more invested in supporting that.
I work with Frontier and ORNL/OLCF. They have had and continue to have issues with AMD/ROCm but yes, they do of course get excellent support from AMD. The entire team at OLCF is incredible as well (obviously) and they do amazing work.
Frontier certainly has some unique quirks but the documentation is online[0] and most of these quirks are inherent to the kinds of fundamental issues you'll see on any system in the space (SLURM, etc).
However, most of the issues are fundamentally ROCm and you'll run into them on any MIxxx anywhere. I run into them frequently with supported and unsupported consumer gear all the way up.
[0] - https://docs.olcf.ornl.gov/systems/frontier_user_guide.html
I mean, that's kinda nvidia's whole shtick: anyone can play around synthesizing cat pictures on their gaming GPU and if they make a breakthrough, the same software will transfer to X million dollar supercomputers.
Subscriber only videos, so nobody can confirm that he did that, nor archive whatever valuable information he released. At least not without paying some money in the next 7-14 days before they're deleted.
https://www.youtube.com/@geohotarchive
But they're far behind in adoption in the AI space, while TPUs have both adoption (inside Google and on top) and a very strong software offering (Jax and TF)
There's also Amazon's AWS "Trainium" chips, which is what Anthropic will be using going forward.
If you're talking about training LLMs, involving 10's of thousands of processors, then the specifics of one processor vs another isn't the most important thing - it's the overall architecture and infrastructure in place to manage it.
Given the size of the market and its near-monopoly situation, I strongly think this has the potential to (almost immediately) surpass the Pixel hardware business. But the problem here is that TPU is a relatively scarce computing resource even inside Google and it's very likely that Google has a hard time to meet its internal demands...
I’m surprised they sell any to external customers, to be honest.
They don't sell any TPUs, do they? Besides the, now ancient, coral toy-TPUs.
Has there been any development? The last update is from 2021 [0], but it is not officially killed by google(.com)
[0] https://coral.ai/news/updates-07-2021
My guess is that the "AI" accelerators in Google Tensor phone chips are based on Coral....
Yes.
But imagine how the company would do: they have a guaranteed market at Google say for 3 years, and while yes maybe Google takes 100% of the production in the first 12 months it's not a bad position to start from.
Plus there are other products which they could ship that might not always need to be built on the latest process. I imagine there would be demand for inference only earlier generation TPUs that can run LLMs fast if the power usage is low enough.
Speaking of which, mega props to Groq, they really are awesome, so many startups launch with bullshit and promises, but Groq came to the scene with something awesome already working, which is reason enough to love them. I really respect this company and I say that extremely never-often.
I wouldn't call it awesome. It's just a big chip with lots of cache. You need hundreds of them to sufficiently load any decent model. At which point the cost has skyrocketed.
There seem to be conflicting reports as to who came up with the TPU https://mastodon.social/@danluu/109641269333636407
Amazon acquired Annapurna labs doing the same thing and have their own train,/inferentia silicon, and they definitely have more support than Google.