Posts like this underscore why the smart money is betting on Google as the long term AI winner. Meta, Microsoft, OpenAI, etc. are trying to address problems with consumer video cards and spending billions to try and out bid each other to win Nvidia's favor - while Google is on their 6th generation of custom silicon.
Literally the only thing that can stop Google now is the fact they keep bringing Microsoft and Oracle flunkies into leadership positions.
I think it's likely Nvidia's GPU's, many of which are $50,000+ for a single unit, far surpass Google's custom silicon otherwise why wouldn't Google be selling shovels like Nvidia?
If Google had a better chip, or even a chip that was close, they would sell it to anyone and everyone.
From a quick search I can see Google's custom chips are 15x to 30x slower to train AI compared to Nvidia's current latest gen AI specific GPU's.
They do sell shovels, you can get Google TPUs on Google Cloud.
Exactly and they are still about 1/18ths as good at training llms as a H100.
Maybe they are less than 1/18ths the cost, so google technically have a marginally better unit cost but i doubt it when you consider the R&D cost. They are less bad at inference, but still much worse than even an A100.
I don't see how you can evaluate better and worse for training without doing so on cost basis. If it costs less and eventually finishes then it's better.
This assumes that you can linearly scale up the number of TPUs to get equal performance to Nvidia cards for less cost. Like most things distributed, this is unlikely to be the case.
This is absolutely the case, TPUs scale very well: https://github.com/google/maxtext .
The repo mentiones a Karpathy tweet from Jan 2023. Andrej has recently created llm.c and the same model trained about 32x faster on the same NVidia hardware mentioned in the tweet. I dont think the perfomance estimate that the repo used (based on that early tweet) was accurate for the performance of the NVidia hardware itself.
Time is money. You might be a lab with long queues to train, leaving expensive staff twiddling their thumbs.
Also energy cost. 18 chips vs 1, it's probably costing a lot more to run 18
Google claims the opposite in "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings " https://arxiv.org/abs/2304.01433
Despite various details I don't think that this is an area where Facebook is very different from Google. Both have terrifying amounts of datacenter to play with. Both have long experience making reliable products out of unreliable subsystems. Both have innovative orchestration and storage stacks. Meta hasn't published much or anything about things like reconfigurable optical switches, but that doesn't mean they don't have such a thing.
If you're interested in a peer reviewed scientific comparison, Google writes retrospective papers after contemporary TPUs and GPUs are deployed versus speculation about future products. The most recent compares TPU v4 and A100. (TPU v5 and H100 is for a future paper). Here is a quote from the abstract:
"Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves performance/Watt by 2.7x. ... For similar sized systems, it is ~4.3x--4.5x faster than the Graphcore IPU Bow and is 1.2x--1.7x faster and uses 1.3x--1.9x less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers of Google Cloud use ~2--6x less energy and produce ~20x less CO2e than contemporary DSAs in typical on-premise data centers."
Here is a link to the paper: https://dl.acm.org/doi/pdf/10.1145/3579371.3589350
That quote is referring to the A100... the H100 used ~75% more power to deliver "up to 9x faster AI training and up to 30x faster AI inference speedups on large language models compared to the prior generation A100."[0]
Which sure makes the H100 sound both faster and more efficient (per unit of compute) than the TPU v4, given what was in your quote. I don't think your quote does anything to support the position that TPUs are noticeably better than Nvidia's offerings for this task.
Complicating this is that the TPU v5 generation has already come out, and the Nvidia B100 generation is imminent within a couple of months. (So, no, a comparison of TPUv5 to H100 isn't for a future paper... that future paper should be comparing TPUv5 to B100, not H100.)
[0]: https://developer.nvidia.com/blog/nvidia-hopper-architecture...
Given that Google invented Transformer architecture (and Google AI continues to do foundational R&D on ML architecture) — and that Google's TPUs don't even support the most common ML standards, but require their own training and inference frameworks — I would assume that "the point" of TPUs from Google's perspective, has less to do with running LLMs, and more to do with running weird experimental custom model architectures that don't even exist as journal papers yet.
I would bet money that TPUs are at least better at doing AI research than anything Nvidia will sell you. That alone might be enough for Google to keep getting some new ones fabbed each year. The TPUs you can rent on Google Cloud might very well just be hardware requisitioned by the AI team, for the AI team, that they aren't always using to capacity, and so is "earning out" its CapEx through public rentals.
TPUs are maybe also better at other things Google does internally, too. Running inference on YouTube's audio+video-input timecoded-captions-output model, say.
Wouldn't that be renting a shovel vs selling a shovel?
NVIDIA sells subscriptions...
I'm only aware of Nvidia AI Enterprise and that isn't required to run the GPU.
I think it's aimed at medium to large corporations.
Massive corporations such as Meta and OpenAI would build their own cloud and not rely on this.
The GPU really is a shovel, and can be used without any subscription.
Don't get me wrong, I want there to be competition with Nvidia, I want more access for open source and small players to run and train AI on competitive hardware at our own sites.
But no one is competing, no one has any idea what they're doing. Nvidia has no competition whatsoever, no one is even close.
This lets Nvidia get away with adding more vram onto an AI specific GPU and increase the price by 10x.
This lets Nvidia remove NVLink from current gen consumer cards like the 4090.
This lets Nvidia use their driver licence to prevent cloud platforms from offering consumers cards as a choice in datacenters.
If Nvidia had a shred of competition things would be much better.
I'm not sure why you're getting downvoted. It's very clear that Nvidia is moving towards directly offering cloud services[0].
[0] - https://www.nvidia.com/en-us/data-center/dgx-cloud/
We have almost 400 H100's sitting idle. I wonder how many other companies are buying millions of dollars worth of these chips with the hopes of them being used, but aren't being utilized?
Hello! If you're interested in monetizing those GPUs, I'd be happy to rent them (all 400!) and offer those to customers of the cloud I work at :)
jonathan [at] tensordock.com
If you want 512 H100s connected with infiniband: https://lambdalabs.com/service/gpu-cloud/1-click-clusters
Have you considered sponsoring an open-source project? ;)
Wow, that's a lot of money in inventory. What was the original thought process? Just fomo?
That's insane and incredible all at the same time.
Probably could profit selling them second hand honestly.
Looking to get rid of a few?....
the world would love to buy time on your idle H100's if you're selling.
So you're saying, H100s are the corporate equivalent of Raspberry Pis now? Bought to not miss out, then left to gather dust in a drawer?
I know of many important projects that need GPUs right now and aren’t getting any. You could help motivate the ponydiffusion folks to actually try finetuning of SD3!
They do sell them - but through their struggling cloud business. Either way, Nvidia's margin is google's opportunity to lower costs.
TPUs are designed for inference not training - they're betting that they can serve models to the world at a lower cost structure than their competition. The compute required for inference to serve their billions of customers is far greater than training costs for models - even LLMs. They've been running model inference as a part of production traffic for years.
This breaks my brain, because I know Google trains it models on TPUs and they're seen as faster, and if they're better at inference, and can train, then why is Nvidia in a unique position? My understanding was always it's as simple as it required esoteric tooling
Because people generally don’t use TPUs outside of Google. The tooling is different, the access is metered through GCP, etc.
Nvidia is in a vaguely unique position in that their products have great tooling support and few companies sell silicon at their scale.
Correct, I'm pointing out politely that's in conflict with the person I'm replying to.
Multiple types of TPUs.
(I work for Google, but the above is public information.)
Possibly naive but I very much view CUDA and its integration into ML frameworks being nvidias moat
Google most certainly uses TPUs for training.
Intel, AMD and others also have chips for training that perform close to or sometimes better than Nvidia's. These are already in the market. Two problems: the CUDA moat, and, "noone gets fired for buying green".
That’s not necessarily true. Many companies make chips they won’t sell to support lucrative, proprietary offerings. Mainframe processors are the classic example.
In AI, Google (TPU) and Intel (Gaudi) each have chips they push in cloud offerings. The cloud offerings have cross selling opportunities. That by itself would be a reason to keep it internal at their scale. It might also be easier to support one, or a small set, of deployments that are internal vs the variety that external customers would use.
Is that why apple sell there chips to everyone??
While I do not actually think Google's chips are better or close to being better, I don't think this actually holds?
If the upside of <better chip> is effectively unbounded, it would outweigh the short term benefit of selling them to others, I would think. At least for a company like Google.
Nvidia has decades of experience selling hardware to people with all the pains that entails, support, sales channels, customer acquisition, software, it's something you don't just do overnight, and it does cost money. Google's TPUs get some of their cost efficiency from not supporting COTS use cases and the overhead of selling to people, and the total wall clock time has to also include the total operational costs, which dominate at their size (e.g. if it's 30x slower but 1/50th the TCO then it's a win. I don't know how TPUv5 stacks up against the B200). It's not as simple as "just put it on a shelf and sell it and make a gajillion dollars like nvidia"
TPU v5p is ~2 times slower than H100 at larg(ish)-scale training (order of 10k chips) [1]. And they already have v6 [2]. I think it's safe to say that they are fairly close to Nvidia in terms of performance.
[1] https://mlcommons.org/benchmarks/training/
[2] https://cloud.google.com/blog/products/compute/introducing-t...
Except Microsoft is making their own chips as well?
https://www.theverge.com/2023/11/15/23960345/microsoft-cpu-g...
and so is Meta: https://ai.meta.com/blog/next-generation-meta-training-infer...
Ever since Apple did it everyone has leaped on board. Let's see how things pan out for everyone...
Google introduced their first TPU in 2015...? Long before Apple taped out their first silicon.
Apple’s first in-house designed chip was the A4 in 2010.
if we're talking custom silicon, Google acquired Motorola in 2011, and Apple acquired PA semi in 2008.
The idea is obvious to everybody in the industry, it's a question of money and motivation.
And that's why $ARM is a good buy. Selling swords and steel to all these armies as they go to war.
ARM is just collecting royalties in this space. Their reference designs aren't exactly competitive.
I'm not saying it's a bad buy, either, but if and when they turn the screws, there will be a mass exodus. Solid long term play perhaps, but not going to see Nvidia like price action.
H100s are far from consumer video cards
Yeah, ops comment makes it seem like they are building racks of RTX 4090s, when this isn’t remotely true. Tensor Core performance is far different on the data center class devices vs consumer ones.
They are building racks of 4090s. Nobody can get H100s in any reasonable volume.
Hell, Microsoft is renting GPUs from Oracle Cloud to get enough capacity to run Bing.
Who is "they"?
RTX 4090s are terrible for this task. Off the top of my head:
- VRAM (obviously). Isn't that where the racks come in? Not really. Nvidia famously removed something as basic as NVLink between two cards from the 3090 to the 4090. When it comes to bandwidth between cards (crucial) even 16 lanes of PCIe 4 isn't fast enough. When you start talking about "racks" unless you're running on server grade CPUs (contributing to cost vs power vs density vs perf) you're not going to have nearly enough PCIe lanes to get very far. Even P2P over PCIe requires a hack geohot developed[0] and needless to say that's umm, less than confidence inspiring for what you would lay out ($$$) in terms of hardware, space, cooling, and power. The lack of ECC is a real issue as well.
- Form factor. Remember PCIe lanes, etc? The RTX 4090 is a ~three slot beast when using air cooling and needless to say rigging up something like the dual slot water cooled 4090s I have at scale is another challenge altogether... How are people going to wire this up? What do the enclosures/racks/etc look like? This isn't like crypto mining where cheap 1x PCIe risers can be used without dramatically limiting performance to the point of useless.
- Performance. As grandparent comment noted 4090s are not designed for this workload. In typical usage for training I see them as 10-20% faster than an RTX 3090 at much higher cost. Compared to my H100 with SXM it's ridiculously slow.
- Market segmentation. Nvidia really knows what they're doing here... There are all kinds of limitations you run into with how the hardware is designed (like Tensor Core performance for inference especially).
- Issues at scale. Look at the Meta post - their biggest issues are things that are dramatically worse with consumer cards like the RTX 4090, especially when you're running with some kind of goofy PCIe cabling issue (like risers).
- Power. No matter what power limiting you employ an RTX 4090 is pretty bad for power/performance ratio. The card isn't fundamentally designed for these tasks - it's designed to run screaming for a few hours a day so gamers can push as many FPS at high res as possible. Training, inference, etc is a different beast and the performance vs power ratio for these tasks is terrible compared to A/H100. Now lets talk about the physical cabling, PSU, etc issues. Yes miners had hacks for this as well but it's yet another issue.
- Fan design. There isn't a single "blower" style RTX 4090 on the market. There was a dual-slot RTX 3090 at one point (I have a bunch of them) but Nvidia made Gigabyte pull them from the market because people were using them for this. Figuring out some kind of air-cooling setup with the fan and cooling design of the available RTX 4090 cards sounds like a complete nightmare...
- Licensing issues. Again, laying out the $$$ for this with a deployment that almost certainly violates the Nvidia EULA is a risky investment.
Three RTX 4090s (at 9 slots) to get "only" 72GB of VRAM, talking over PCIe, using 48 PCIe lanes, multi-node over sloooow ethernet (hitting CPU - slower and yet more power), using what likely ends up at ~900 watts (power limited) for significantly reduced throughput and less VRAM is ridiculous. Scaling the kind of ethernet you need for this (100 gig) comes at a very high per-port cost and due to all of these issues the performance would still be terrible.
I'm all for creativity but deploying "racks" of 4090s for AI tasks is (frankly) flat-out stupid.
[0] - https://github.com/tinygrad/open-gpu-kernel-modules
You seem to be trapped in the delusion that this was anyone's first, second, or third choice.
There is workload demand, you can't get H100s, and if you don't start racking up the cards you can get the company will replace you with someone less opinionated.
> The RTX 4090 is a ~three slot beast when using air cooling and needless to say rigging up something like the dual slot water cooled 4090s I have at scale is another challenge altogether... How are people going to wire this up? What do the enclosures/racks/etc look like?
A few years ago, if you wanted a lot of GPU power you would buy something like [1] - a 4/5U server with space for ten dual-slot PCIe x16 cards and quadruple power supplies for 2000W of fully redundant power. And not a PCIe riser in sight.
I share your scepticism about whether it's common to run >2 4090s because nvidia have indeed sought to make it difficult.
But if there was some sort of supply chain issue that meant you had to, and you had plenty of cash to make it happen? It could probably be done.
Some of the more value-oriented GPU cloud suppliers like RunPod offer servers with multiple 4090s and I assume those do something along these lines. With 21 slots in the backplane, you could probably fit 6 air-cooled three-slot GPUs, even if you weren't resorting to water cooling.
[1] https://www.supermicro.com/en/products/system/4U/4028/SYS-40...
There are apparently some 400 of H100s sitting idle somewhere upthread. Yes, I'm having hard time imagining how's that possible too.
Is no one else working on custom silicon?
The problem isn’t just developing your own processor. Nvidia has a huge stack of pretty cutting edge technology including a huge stack from mellanox, an enormous OSS tool chain around CUDA, etc, that people seeking to make comparable products have to overcome.
Are you suggesting Meta or Google - who stand to save billions - won't be able to get top performance from their custom chips because their tooling/hardware won't support CUDA?
No. I’m suggesting they won’t because IP like the mellanox treasure chest they acquired is ridiculously difficult to develop and Nvidia has aggressively exploited it, along with their other already advanced IP in the space of their -core business-.
I understand, especially amongst googlers, there’s a belief there are no others smarter than a googler. But it’s simply not the case. Nvidia is excellent at their core competencies and business, which is making absurdly parallel compute platforms with absurdly powerful interconnects. I’m saying google or meta won’t beat Nvidia at hardware. I’d also point to the fact Nvidias ability to raise capital is the best on earth now, so even money isn’t a barrier.
The advantage CUDA gives is in the tool chains, libraries, research, and all that that tens of thousands of people are contributing to as part of their jobs, research, and hobbies. This is almost -more valuable- than getting top performance. Getting top techniques, top software, top everything by having everyone everywhere working to make the ecosystem of your stuff is invaluable. Google won’t have that. They will just have the hubris of googlers who believe they’re smarter.
I would also note that at this phase of a cycle in tech trying to save billions takes your eye off the prize. Cost optimization comes much later after the market has been fully explored and directions are clean and diminishing returns on R&D kick in. Any company that doesn’t recognize that is run by CPA and deserves the ignominy they’ll face.
The bar for success for Google and Meta is much lower than Nvidia - at least for internal usage. Any dollar amount that Google saves on CapEx or OpEx by using custom silicon instead of buying Nvidia helps bring down the cost of revenue. They don't have to match Nvidia on raw performance, and can aim at being better at performance per watt or performance per dollar (TCO) for larger workloads, and IIRC, Google is already doing for some internal inferencing tasks.
Big Tech companies are conglomerate-ish and can multitask. The search engine folk aren't pushing stuff back onto the backlog to put out fires delaying chip tape-out, and I bet the respective CEOs aren't burning braincycles micromanaging silicon development either; directors 2-3 rungs below the C-suite can motivate and execute on such an undertaking. The answer to "I need a budget of $300M in order to save the company $5-15B over 3 years" is "How soon can you start?"
For training Llama3 Facebook set up two clusters, one using fancy InfiniBand and one just using RoCE over Arista cards: https://engineering.fb.com/2024/03/12/data-center-engineerin... . The latter ended up doing fine, suggesting that all that Mellanox stuff isn't necessary for large-scale training (apparently at a large enough scale ethernet scales better than InfiniBand).
Everyone is.
Apple, AWS, Google, Meta, Microsoft all have custom AI-centric silicon.
Do you really think Google’s hardware expertise is better than Nvidia’s?
If needed these other companies have the $$$ to buy the best chips money can buy from Nvidia. Better chips than Google could ever produce.
If anything, this is why IMO Google will fail.
Yes. Google was building custom HPC hardware 5-8 years before Nvidia decided to expand outside the consumer and "workstation" markets.
Nvidia acquired Mellanox who know far more about custom HPC hardware than Google.
Mellanox and friends not being able to build fast enough switching gear at a reasonable price point was what got Google into the hardware game in the first place ;)
I thought NVIDIA's moat was mostly software/CUDA?
They have by far the fastest chip, the best software with CUDA and feverishly working on next gen chips. They also bought out multiple years worth of global high bandwidth memory manufacturing capacity.
No one will beat them at their game. However if there are any major breakthroughs that might render those processing capacities unneeded, or the major players hitting a wall regarding AI spending, then they will take a massive hit. It will come eventually because the chip business is always in boom/bust cycles.
Much the same way you can have all the best gear and still fail - Google’s primary strength seems to be the DeepMind group. I’m not affiliated with Google, but IMHO the reason they will slowly die is because their engineering culture has taken a backseat due to their broken hiring practices.
Bad hiring practices aren’t exclusive to them, but from all accounts it seems like their internal focus is on optimizing ad revenue over everything else. I could be wrong or misinformed, but it seems to me like they are playing the finite game in the AI space (DeepMind group aside) while FAIR are playing the infinite game.
*meanwhile MSFT are simply trying to buy their way to relevance (e.g. OpenAI investments, etc) and carve out future revenues (Recall) and Jobs-less Apple is building their trademark walled-garden (AppleIntelligence?). Although the use of unified memory in Apple silicon poses some interesting possibilities for enabling the use of sizable models on consumer hardware.
Overall it seems like “big-tech” is by-and-large uninspired and asleep at the wheel save specific teams like those led by Lecun, Hassabis, etc. not sure where that leaves OpenAI now that Karpathy is gone.
What company do you think has better hiring practices, and subsequently a higher talent pool? Meta is pretty similar to Google's (though with an emphasis on speed over creativity). Microsoft is certainly worse at hiring than the two aforementioned...
To be fair, I don’t have any examples of “good practices” readily in-hand. However, I did try to address why I thought others were less impacted by this problem in the second half of my post.
Can't agree. This is like saying $popularApp will fail because they buy expensive hosting at AWS.
Rubbish they will fail because the product didn't fit the market, if they're successful they'll have money to buy servers and colo then drive down cost. If they succeed it will be in large part due to the fact they spent thier capital and more importantly time on code/engineers rather than servers.
Right now companies are searching for a use of AI that will add hundreds of billions to thier market cap. Once they find that they can make TPUs, right now only one thing matters; getting there first.
For any given mobile app startup, AWS is effectively infinite. The more money you throw at it the more doodads you get back. Nvidia's supply chain is not infinite and is the bottle neck for all the non-Google players to fight over.
IF you are training on AWS, its not infinite. worse still you are bidding against other people.
The only thing that can stop Google is Google. Somehow every bet that isn't Search doesn't pan out. And inexplicably, they're working hard to kill Search now. As a shareholder, I hope they succeed. But I am more pessimistic about it than you.
And they missed multiple waves of effectively building on their own in house research.
Reminds me of Xerox Parc vs. Apple. Building successful products is hard.
Don’t forget Apple’s Private Compute Cloud - built on top of Apple Silicon.
Models trained on Google TPUs according to Reuters [0]. Does anyone know the "technical document" the news article references?
[0] https://www.reuters.com/technology/artificial-intelligence/h...
https://machinelearning.apple.com/research/introducing-apple...
Their work on folding and ai could very well be a business worth in the hundreds of billions and they know it as well as many other bets.
Wheres others are playing the llm race to the bottom.
And we'll be writing case studies of how they squandered billions of R&D to help found other companies (kinda like Xerox Parc)
Almost every interesting paper after transformers has had it's authors leave to commercialize their own companies.
after everything I've seen and the litigation coming out of europe, I really can't see AI lasting long after they're obligated to prove rights for the data they're training on.
they can't get away with having scraped people's owned work forever. You can't steal things from workers and then undercut them by selling that hard work for pennies, and not expect everything to collapse. I mean, I know that the folks in charge of this aren't really known for their foresight, especially when stock numbers and venture capital are the entire point, but... surely I hope people can recognize that this can't go on unimpeded.
Eventually they're going to put vision LLMs in robotic bodies and they'll be able to learn just by listening and watching, just like humans, at which stage the idea that they're "stealing" just by viewing content will be seen as absurd.
How do TPUs perform compared to GPUs on LLMs and image generation?
Pretty well. Anthropic runs some of Claude inference on GKE and TPU v5e - this talk at the last GCP conference has some details:
https://youtu.be/b87I1plPeMg?si=T4XSFUzXG8BwpphR
Ecosystem support for GPU is very strong and so smaller teams might find TPUs to have a steeper learning curve. Sophisticated users can definitely get wins today on TPU if they can get capacity.
And it’s not as if nvidia is standing still, so how the strengths of each change over future generations isn’t set either. Ie TPU are “simple” matrix multipliers and also optimized for operation at scale, GPU are more general purpose and have strong ecosystem power.
Disclaimer - work on GKE on enabling AI workloads.
What would any company as "the long term AI winner" look like? What would it mean to be the winner in this context?
The winner is just nvidia, I see this like the battle of gas vs electric cars. Nvidia is basically making the wheels. whichever company wins, you'd still need wheels.
Custom silicon is fantastic when things have stabilised and you know exactly want you want. But things are still evolving fast in real time and in that environment, whoever can move fastest to be ultra flexible and deploy the latest architecture as soon as possible is the winner. I think in a nutshell, that is the story of Nvidia's success here : they created a GPGPU platform with just the right level of abstraction to capture the market for AI research.
None of these companies are using consumer video cards. https://www.nvidia.com/en-us/data-center/h200/
What can stop Google is building the wrong thing, or being so scared to launch anything that they smother their own fledgling products before they are born or before they can mature. Their product and finance and accounting teams should be tossed.
^ How to compact as many mistakes as possible in one single comment.
1. Google's stock didn't siginificantly outperformed Meta, Microsoft, etc, in thet past two years.
2. Meta and Microsoft are trying to make their own chips as well.
3. They're not using "consumer video cards" to train AI. I don't even know if you can call these beasts video cards any more. H100 doesn't have HDMI port.
I wish I had your faith in Google’s ability to refrain from kneecapping their own perfectly fine product.
lolwat.
google is the biggest loser in all of this.
this was the same argument that was presented a decade ago on why Google was supposed to win the cloud because their internal infra was miles ahead of Amazon and Microsoft.
Yet here we are. Will the consumer video cards get cheaper and better faster or will Google's directors' infighting stop first?
You mean the thing that's already stopped them? If they had seriously invested into the TPU ecosystem in 2015, they would already have "won" AI.
Google has been working on TPUs and Tensorflow for a decade with pretty mixed success; I don't think it's clear that they're going to win.
Google’s on their 6th generation and still can’t find anyone to use it. Hmm.
Google is old like MetLife, relative to each's respective industry. Both are carrying too much baggage and are top heavy. As a result, I personally don't think Google will be able to keep pace with OpenAI in the long run.
smart money has a diversified portfolio and isn't betting on any one winner and has invested in all of them, and then some.
That’s how all huge tech companies become dinosaurs though. Upper management that is already stupidly wealthy (and therefore unmotivated) have the funding and patience to hire geniuses to build incredible machines and then constantly tie their shoelaces together while asking them to sprint. Examples include Microsoft and Oracle as you said, and before them IBM, AT$T, TIBCO, Marvell, Motorola, I could go on for a while…
I would call it "stupid" money. This isn't a commodity business. Value of the final product is orthogonal to amount invested in compute. If Google is 10% slower or its product is 10% worse, it can lose all the value. This is like valuing a software company higher because its devs are using cheap PC desktops instead of Mac.
Laughable response when you actually look at the quality of the algorithms being produced by Google. They’re so behind it’s embarrassing.
here is an older take on this same topic..
https://www.yitay.net/blog/training-great-llms-entirely-from...
GPU vs TPU, and good software managing large clusters of them across all sorts of failure.
the funny bit from the above article is the incident when someone forgot about a training job at google, and month later had the model fully trained without an alert of any kind. "outrageously good infra"
Yes, but you are buying access to tested, supported units that are proven to work, don't require custom software, and are almost plug an go. When its time to upgrade, its not that costly.
Designing, fabricating and deploying your own silicon is Expensive, creating software support for it, also more expense. THen there is the opportunity cost of having to optimise the software stack your self.
You're exchanging a large capex, for a similar sized capex plus a fuckton of opex as well.
"Consumer video cards"? Meta's not building their clusters out of 3090s.
They're using advanced cards meant for data centers and machine learning -- almost effectively "custom silicon"
The chips are somewhat irrelevant. It's the overall system architecture, management, and fault recovery that matters.
Until the lion share of AI projects support that custom silicon, I will continue to bet on anyone buying Nvidia GPUs.