return to table of content

How Meta trains large language models at scale

mike_d
109 replies
17h7m

Posts like this underscore why the smart money is betting on Google as the long term AI winner. Meta, Microsoft, OpenAI, etc. are trying to address problems with consumer video cards and spending billions to try and out bid each other to win Nvidia's favor - while Google is on their 6th generation of custom silicon.

Literally the only thing that can stop Google now is the fact they keep bringing Microsoft and Oracle flunkies into leadership positions.

throwaway_ab
40 replies
16h43m

I think it's likely Nvidia's GPU's, many of which are $50,000+ for a single unit, far surpass Google's custom silicon otherwise why wouldn't Google be selling shovels like Nvidia?

If Google had a better chip, or even a chip that was close, they would sell it to anyone and everyone.

From a quick search I can see Google's custom chips are 15x to 30x slower to train AI compared to Nvidia's current latest gen AI specific GPU's.

candiddevmike
15 replies
16h38m

They do sell shovels, you can get Google TPUs on Google Cloud.

matt-p
10 replies
16h29m

Exactly and they are still about 1/18ths as good at training llms as a H100.

Maybe they are less than 1/18ths the cost, so google technically have a marginally better unit cost but i doubt it when you consider the R&D cost. They are less bad at inference, but still much worse than even an A100.

jeffbee
4 replies
16h17m

I don't see how you can evaluate better and worse for training without doing so on cost basis. If it costs less and eventually finishes then it's better.

tmostak
2 replies
15h11m

This assumes that you can linearly scale up the number of TPUs to get equal performance to Nvidia cards for less cost. Like most things distributed, this is unlikely to be the case.

pama
0 replies
11h37m

The repo mentiones a Karpathy tweet from Jan 2023. Andrej has recently created llm.c and the same model trained about 32x faster on the same NVidia hardware mentioned in the tweet. I dont think the perfomance estimate that the repo used (based on that early tweet) was accurate for the performance of the NVidia hardware itself.

fbdab103
0 replies
15h6m

Time is money. You might be a lab with long queues to train, leaving expensive staff twiddling their thumbs.

blharr
1 replies
16h16m

Also energy cost. 18 chips vs 1, it's probably costing a lot more to run 18

jeffbee
0 replies
16h11m

Google claims the opposite in "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings " https://arxiv.org/abs/2304.01433

Despite various details I don't think that this is an area where Facebook is very different from Google. Both have terrifying amounts of datacenter to play with. Both have long experience making reliable products out of unreliable subsystems. Both have innovative orchestration and storage stacks. Meta hasn't published much or anything about things like reconfigurable optical switches, but that doesn't mean they don't have such a thing.

UCBdaPatterson
1 replies
4h38m

If you're interested in a peer reviewed scientific comparison, Google writes retrospective papers after contemporary TPUs and GPUs are deployed versus speculation about future products. The most recent compares TPU v4 and A100. (TPU v5 and H100 is for a future paper). Here is a quote from the abstract:

"Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves performance/Watt by 2.7x. ... For similar sized systems, it is ~4.3x--4.5x faster than the Graphcore IPU Bow and is 1.2x--1.7x faster and uses 1.3x--1.9x less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers of Google Cloud use ~2--6x less energy and produce ~20x less CO2e than contemporary DSAs in typical on-premise data centers."

Here is a link to the paper: https://dl.acm.org/doi/pdf/10.1145/3579371.3589350

coder543
0 replies
3h57m

That quote is referring to the A100... the H100 used ~75% more power to deliver "up to 9x faster AI training and up to 30x faster AI inference speedups on large language models compared to the prior generation A100."[0]

Which sure makes the H100 sound both faster and more efficient (per unit of compute) than the TPU v4, given what was in your quote. I don't think your quote does anything to support the position that TPUs are noticeably better than Nvidia's offerings for this task.

Complicating this is that the TPU v5 generation has already come out, and the Nvidia B100 generation is imminent within a couple of months. (So, no, a comparison of TPUv5 to H100 isn't for a future paper... that future paper should be comparing TPUv5 to B100, not H100.)

[0]: https://developer.nvidia.com/blog/nvidia-hopper-architecture...

derefr
0 replies
14h55m

Given that Google invented Transformer architecture (and Google AI continues to do foundational R&D on ML architecture) — and that Google's TPUs don't even support the most common ML standards, but require their own training and inference frameworks — I would assume that "the point" of TPUs from Google's perspective, has less to do with running LLMs, and more to do with running weird experimental custom model architectures that don't even exist as journal papers yet.

I would bet money that TPUs are at least better at doing AI research than anything Nvidia will sell you. That alone might be enough for Google to keep getting some new ones fabbed each year. The TPUs you can rent on Google Cloud might very well just be hardware requisitioned by the AI team, for the AI team, that they aren't always using to capacity, and so is "earning out" its CapEx through public rentals.

TPUs are maybe also better at other things Google does internally, too. Running inference on YouTube's audio+video-input timecoded-captions-output model, say.

throwaway_ab
3 replies
16h28m

Wouldn't that be renting a shovel vs selling a shovel?

candiddevmike
2 replies
16h26m

NVIDIA sells subscriptions...

throwaway_ab
0 replies
16h15m

I'm only aware of Nvidia AI Enterprise and that isn't required to run the GPU.

I think it's aimed at medium to large corporations.

Massive corporations such as Meta and OpenAI would build their own cloud and not rely on this.

The GPU really is a shovel, and can be used without any subscription.

Don't get me wrong, I want there to be competition with Nvidia, I want more access for open source and small players to run and train AI on competitive hardware at our own sites.

But no one is competing, no one has any idea what they're doing. Nvidia has no competition whatsoever, no one is even close.

This lets Nvidia get away with adding more vram onto an AI specific GPU and increase the price by 10x.

This lets Nvidia remove NVLink from current gen consumer cards like the 4090.

This lets Nvidia use their driver licence to prevent cloud platforms from offering consumers cards as a choice in datacenters.

If Nvidia had a shred of competition things would be much better.

bluedino
10 replies
15h33m

We have almost 400 H100's sitting idle. I wonder how many other companies are buying millions of dollars worth of these chips with the hopes of them being used, but aren't being utilized?

jonathanlei
1 replies
14h23m

Hello! If you're interested in monetizing those GPUs, I'd be happy to rent them (all 400!) and offer those to customers of the cloud I work at :)

jonathan [at] tensordock.com

radq
0 replies
14h43m

Have you considered sponsoring an open-source project? ;)

newswasboring
0 replies
15h23m

Wow, that's a lot of money in inventory. What was the original thought process? Just fomo?

irjustin
0 replies
15h29m

That's insane and incredible all at the same time.

giancarlostoro
0 replies
15h1m

Probably could profit selling them second hand honestly.

gfosco
0 replies
15h18m

Looking to get rid of a few?....

fragmede
0 replies
15h14m

the world would love to buy time on your idle H100's if you're selling.

TeMPOraL
0 replies
12h4m

So you're saying, H100s are the corporate equivalent of Raspberry Pis now? Bought to not miss out, then left to gather dust in a drawer?

Der_Einzige
0 replies
13h31m

I know of many important projects that need GPUs right now and aren’t getting any. You could help motivate the ponydiffusion folks to actually try finetuning of SD3!

vineyardmike
6 replies
13h59m

why wouldn't Google be selling shovels

They do sell them - but through their struggling cloud business. Either way, Nvidia's margin is google's opportunity to lower costs.

I can see Google's custom chips are 15x to 30x slower to train AI

TPUs are designed for inference not training - they're betting that they can serve models to the world at a lower cost structure than their competition. The compute required for inference to serve their billions of customers is far greater than training costs for models - even LLMs. They've been running model inference as a part of production traffic for years.

refulgentis
4 replies
13h10m

This breaks my brain, because I know Google trains it models on TPUs and they're seen as faster, and if they're better at inference, and can train, then why is Nvidia in a unique position? My understanding was always it's as simple as it required esoteric tooling

vineyardmike
1 replies
3h6m

Because people generally don’t use TPUs outside of Google. The tooling is different, the access is metered through GCP, etc.

Nvidia is in a vaguely unique position in that their products have great tooling support and few companies sell silicon at their scale.

refulgentis
0 replies
1h23m

Correct, I'm pointing out politely that's in conflict with the person I'm replying to.

smueller1234
0 replies
13h4m

Multiple types of TPUs.

(I work for Google, but the above is public information.)

koe123
0 replies
1h39m

Possibly naive but I very much view CUDA and its integration into ML frameworks being nvidias moat

uluyol
0 replies
13h10m

Google most certainly uses TPUs for training.

xipix
0 replies
10h13m

Intel, AMD and others also have chips for training that perform close to or sometimes better than Nvidia's. These are already in the market. Two problems: the CUDA moat, and, "noone gets fired for buying green".

nickpsecurity
0 replies
32m

That’s not necessarily true. Many companies make chips they won’t sell to support lucrative, proprietary offerings. Mainframe processors are the classic example.

In AI, Google (TPU) and Intel (Gaudi) each have chips they push in cloud offerings. The cloud offerings have cross selling opportunities. That by itself would be a reason to keep it internal at their scale. It might also be easier to support one, or a small set, of deployments that are internal vs the variety that external customers would use.

megablast
0 replies
9h58m

Is that why apple sell there chips to everyone??

girvo
0 replies
12h5m

If Google had a better chip, or even a chip that was close, they would sell it to anyone and everyone.

While I do not actually think Google's chips are better or close to being better, I don't think this actually holds?

If the upside of <better chip> is effectively unbounded, it would outweigh the short term benefit of selling them to others, I would think. At least for a company like Google.

aseipp
0 replies
16h12m

Nvidia has decades of experience selling hardware to people with all the pains that entails, support, sales channels, customer acquisition, software, it's something you don't just do overnight, and it does cost money. Google's TPUs get some of their cost efficiency from not supporting COTS use cases and the overhead of selling to people, and the total wall clock time has to also include the total operational costs, which dominate at their size (e.g. if it's 30x slower but 1/50th the TCO then it's a win. I don't know how TPUv5 stacks up against the B200). It's not as simple as "just put it on a shelf and sell it and make a gajillion dollars like nvidia"

boringg
5 replies
16h11m

Ever since Apple did it everyone has leaped on board. Let's see how things pan out for everyone...

1024core
2 replies
15h26m

Google introduced their first TPU in 2015...? Long before Apple taped out their first silicon.

mamp
0 replies
14h2m

Apple’s first in-house designed chip was the A4 in 2010.

fragmede
0 replies
15h17m

if we're talking custom silicon, Google acquired Motorola in 2011, and Apple acquired PA semi in 2008.

The idea is obvious to everybody in the industry, it's a question of money and motivation.

jacurtis
1 replies
16h0m

And that's why $ARM is a good buy. Selling swords and steel to all these armies as they go to war.

silisili
0 replies
14h9m

ARM is just collecting royalties in this space. Their reference designs aren't exactly competitive.

I'm not saying it's a bad buy, either, but if and when they turn the screws, there will be a mass exodus. Solid long term play perhaps, but not going to see Nvidia like price action.

zdyn5
6 replies
16h59m

H100s are far from consumer video cards

stygiansonic
5 replies
16h36m

Yeah, ops comment makes it seem like they are building racks of RTX 4090s, when this isn’t remotely true. Tensor Core performance is far different on the data center class devices vs consumer ones.

mike_d
4 replies
12h53m

They are building racks of 4090s. Nobody can get H100s in any reasonable volume.

Hell, Microsoft is renting GPUs from Oracle Cloud to get enough capacity to run Bing.

kkielhofner
2 replies
9h47m

Who is "they"?

RTX 4090s are terrible for this task. Off the top of my head:

- VRAM (obviously). Isn't that where the racks come in? Not really. Nvidia famously removed something as basic as NVLink between two cards from the 3090 to the 4090. When it comes to bandwidth between cards (crucial) even 16 lanes of PCIe 4 isn't fast enough. When you start talking about "racks" unless you're running on server grade CPUs (contributing to cost vs power vs density vs perf) you're not going to have nearly enough PCIe lanes to get very far. Even P2P over PCIe requires a hack geohot developed[0] and needless to say that's umm, less than confidence inspiring for what you would lay out ($$$) in terms of hardware, space, cooling, and power. The lack of ECC is a real issue as well.

- Form factor. Remember PCIe lanes, etc? The RTX 4090 is a ~three slot beast when using air cooling and needless to say rigging up something like the dual slot water cooled 4090s I have at scale is another challenge altogether... How are people going to wire this up? What do the enclosures/racks/etc look like? This isn't like crypto mining where cheap 1x PCIe risers can be used without dramatically limiting performance to the point of useless.

- Performance. As grandparent comment noted 4090s are not designed for this workload. In typical usage for training I see them as 10-20% faster than an RTX 3090 at much higher cost. Compared to my H100 with SXM it's ridiculously slow.

- Market segmentation. Nvidia really knows what they're doing here... There are all kinds of limitations you run into with how the hardware is designed (like Tensor Core performance for inference especially).

- Issues at scale. Look at the Meta post - their biggest issues are things that are dramatically worse with consumer cards like the RTX 4090, especially when you're running with some kind of goofy PCIe cabling issue (like risers).

- Power. No matter what power limiting you employ an RTX 4090 is pretty bad for power/performance ratio. The card isn't fundamentally designed for these tasks - it's designed to run screaming for a few hours a day so gamers can push as many FPS at high res as possible. Training, inference, etc is a different beast and the performance vs power ratio for these tasks is terrible compared to A/H100. Now lets talk about the physical cabling, PSU, etc issues. Yes miners had hacks for this as well but it's yet another issue.

- Fan design. There isn't a single "blower" style RTX 4090 on the market. There was a dual-slot RTX 3090 at one point (I have a bunch of them) but Nvidia made Gigabyte pull them from the market because people were using them for this. Figuring out some kind of air-cooling setup with the fan and cooling design of the available RTX 4090 cards sounds like a complete nightmare...

- Licensing issues. Again, laying out the $$$ for this with a deployment that almost certainly violates the Nvidia EULA is a risky investment.

Three RTX 4090s (at 9 slots) to get "only" 72GB of VRAM, talking over PCIe, using 48 PCIe lanes, multi-node over sloooow ethernet (hitting CPU - slower and yet more power), using what likely ends up at ~900 watts (power limited) for significantly reduced throughput and less VRAM is ridiculous. Scaling the kind of ethernet you need for this (100 gig) comes at a very high per-port cost and due to all of these issues the performance would still be terrible.

I'm all for creativity but deploying "racks" of 4090s for AI tasks is (frankly) flat-out stupid.

[0] - https://github.com/tinygrad/open-gpu-kernel-modules

mike_d
0 replies
35m

but deploying "racks" of 4090s for AI tasks is (frankly) flat-out stupid.

You seem to be trapped in the delusion that this was anyone's first, second, or third choice.

There is workload demand, you can't get H100s, and if you don't start racking up the cards you can get the company will replace you with someone less opinionated.

michaelt
0 replies
7h0m

> The RTX 4090 is a ~three slot beast when using air cooling and needless to say rigging up something like the dual slot water cooled 4090s I have at scale is another challenge altogether... How are people going to wire this up? What do the enclosures/racks/etc look like?

A few years ago, if you wanted a lot of GPU power you would buy something like [1] - a 4/5U server with space for ten dual-slot PCIe x16 cards and quadruple power supplies for 2000W of fully redundant power. And not a PCIe riser in sight.

I share your scepticism about whether it's common to run >2 4090s because nvidia have indeed sought to make it difficult.

But if there was some sort of supply chain issue that meant you had to, and you had plenty of cash to make it happen? It could probably be done.

Some of the more value-oriented GPU cloud suppliers like RunPod offer servers with multiple 4090s and I assume those do something along these lines. With 21 slots in the backplane, you could probably fit 6 air-cooled three-slot GPUs, even if you weren't resorting to water cooling.

[1] https://www.supermicro.com/en/products/system/4U/4028/SYS-40...

TeMPOraL
0 replies
11h58m

There are apparently some 400 of H100s sitting idle somewhere upthread. Yes, I'm having hard time imagining how's that possible too.

moneywoes
6 replies
16h38m

Is no one else working on custom silicon?

fnordpiglet
4 replies
15h50m

The problem isn’t just developing your own processor. Nvidia has a huge stack of pretty cutting edge technology including a huge stack from mellanox, an enormous OSS tool chain around CUDA, etc, that people seeking to make comparable products have to overcome.

sangnoir
3 replies
13h50m

Are you suggesting Meta or Google - who stand to save billions - won't be able to get top performance from their custom chips because their tooling/hardware won't support CUDA?

fnordpiglet
2 replies
12h52m

No. I’m suggesting they won’t because IP like the mellanox treasure chest they acquired is ridiculously difficult to develop and Nvidia has aggressively exploited it, along with their other already advanced IP in the space of their -core business-.

I understand, especially amongst googlers, there’s a belief there are no others smarter than a googler. But it’s simply not the case. Nvidia is excellent at their core competencies and business, which is making absurdly parallel compute platforms with absurdly powerful interconnects. I’m saying google or meta won’t beat Nvidia at hardware. I’d also point to the fact Nvidias ability to raise capital is the best on earth now, so even money isn’t a barrier.

The advantage CUDA gives is in the tool chains, libraries, research, and all that that tens of thousands of people are contributing to as part of their jobs, research, and hobbies. This is almost -more valuable- than getting top performance. Getting top techniques, top software, top everything by having everyone everywhere working to make the ecosystem of your stuff is invaluable. Google won’t have that. They will just have the hubris of googlers who believe they’re smarter.

I would also note that at this phase of a cycle in tech trying to save billions takes your eye off the prize. Cost optimization comes much later after the market has been fully explored and directions are clean and diminishing returns on R&D kick in. Any company that doesn’t recognize that is run by CPA and deserves the ignominy they’ll face.

sangnoir
0 replies
12h6m

The bar for success for Google and Meta is much lower than Nvidia - at least for internal usage. Any dollar amount that Google saves on CapEx or OpEx by using custom silicon instead of buying Nvidia helps bring down the cost of revenue. They don't have to match Nvidia on raw performance, and can aim at being better at performance per watt or performance per dollar (TCO) for larger workloads, and IIRC, Google is already doing for some internal inferencing tasks.

I would also note that at this phase of a cycle in tech trying to save billions takes your eye off the prize

Big Tech companies are conglomerate-ish and can multitask. The search engine folk aren't pushing stuff back onto the backlog to put out fires delaying chip tape-out, and I bet the respective CEOs aren't burning braincycles micromanaging silicon development either; directors 2-3 rungs below the C-suite can motivate and execute on such an undertaking. The answer to "I need a budget of $300M in order to save the company $5-15B over 3 years" is "How soon can you start?"

logicchains
0 replies
11h45m

No. I’m suggesting they won’t because IP like the mellanox treasure chest they acquired is ridiculously difficult to develop and Nvidia has aggressively exploited it, along with their other already advanced IP in the space of their -core business-.

For training Llama3 Facebook set up two clusters, one using fancy InfiniBand and one just using RoCE over Arista cards: https://engineering.fb.com/2024/03/12/data-center-engineerin... . The latter ended up doing fine, suggesting that all that Mellanox stuff isn't necessary for large-scale training (apparently at a large enough scale ethernet scales better than InfiniBand).

threeseed
0 replies
15h50m

Everyone is.

Apple, AWS, Google, Meta, Microsoft all have custom AI-centric silicon.

dinobones
5 replies
16h56m

Do you really think Google’s hardware expertise is better than Nvidia’s?

If needed these other companies have the $$$ to buy the best chips money can buy from Nvidia. Better chips than Google could ever produce.

If anything, this is why IMO Google will fail.

mike_d
2 replies
16h35m

Yes. Google was building custom HPC hardware 5-8 years before Nvidia decided to expand outside the consumer and "workstation" markets.

threeseed
1 replies
15h48m

Nvidia acquired Mellanox who know far more about custom HPC hardware than Google.

mike_d
0 replies
10h9m

Mellanox and friends not being able to build fast enough switching gear at a reasonable price point was what got Google into the hardware game in the first place ;)

candiddevmike
1 replies
16h36m

I thought NVIDIA's moat was mostly software/CUDA?

rapsey
0 replies
15h2m

They have by far the fastest chip, the best software with CUDA and feverishly working on next gen chips. They also bought out multiple years worth of global high bandwidth memory manufacturing capacity.

No one will beat them at their game. However if there are any major breakthroughs that might render those processing capacities unneeded, or the major players hitting a wall regarding AI spending, then they will take a massive hit. It will come eventually because the chip business is always in boom/bust cycles.

r_hanz
2 replies
16h24m

Much the same way you can have all the best gear and still fail - Google’s primary strength seems to be the DeepMind group. I’m not affiliated with Google, but IMHO the reason they will slowly die is because their engineering culture has taken a backseat due to their broken hiring practices.

Bad hiring practices aren’t exclusive to them, but from all accounts it seems like their internal focus is on optimizing ad revenue over everything else. I could be wrong or misinformed, but it seems to me like they are playing the finite game in the AI space (DeepMind group aside) while FAIR are playing the infinite game.

*meanwhile MSFT are simply trying to buy their way to relevance (e.g. OpenAI investments, etc) and carve out future revenues (Recall) and Jobs-less Apple is building their trademark walled-garden (AppleIntelligence?). Although the use of unified memory in Apple silicon poses some interesting possibilities for enabling the use of sizable models on consumer hardware.

Overall it seems like “big-tech” is by-and-large uninspired and asleep at the wheel save specific teams like those led by Lecun, Hassabis, etc. not sure where that leaves OpenAI now that Karpathy is gone.

VirusNewbie
1 replies
14h37m

ecause their engineering culture has taken a backseat due to their broken hiring practices.

What company do you think has better hiring practices, and subsequently a higher talent pool? Meta is pretty similar to Google's (though with an emphasis on speed over creativity). Microsoft is certainly worse at hiring than the two aforementioned...

r_hanz
0 replies
4h59m

To be fair, I don’t have any examples of “good practices” readily in-hand. However, I did try to address why I thought others were less impacted by this problem in the second half of my post.

matt-p
2 replies
16h34m

Can't agree. This is like saying $popularApp will fail because they buy expensive hosting at AWS.

Rubbish they will fail because the product didn't fit the market, if they're successful they'll have money to buy servers and colo then drive down cost. If they succeed it will be in large part due to the fact they spent thier capital and more importantly time on code/engineers rather than servers.

Right now companies are searching for a use of AI that will add hundreds of billions to thier market cap. Once they find that they can make TPUs, right now only one thing matters; getting there first.

mike_d
1 replies
13h0m

This is like saying $popularApp will fail because they buy expensive hosting at AWS.

For any given mobile app startup, AWS is effectively infinite. The more money you throw at it the more doodads you get back. Nvidia's supply chain is not infinite and is the bottle neck for all the non-Google players to fight over.

KaiserPro
0 replies
8h52m

IF you are training on AWS, its not infinite. worse still you are bidding against other people.

loeg
2 replies
16h41m

The only thing that can stop Google is Google. Somehow every bet that isn't Search doesn't pan out. And inexplicably, they're working hard to kill Search now. As a shareholder, I hope they succeed. But I am more pessimistic about it than you.

yellow_postit
1 replies
14h51m

And they missed multiple waves of effectively building on their own in house research.

aworks
0 replies
2h10m

Reminds me of Xerox Parc vs. Apple. Building successful products is hard.

iamflimflam1
2 replies
12h58m

Don’t forget Apple’s Private Compute Cloud - built on top of Apple Silicon.

rapsey
1 replies
16h55m

Their work on folding and ai could very well be a business worth in the hundreds of billions and they know it as well as many other bets.

Wheres others are playing the llm race to the bottom.

htrp
0 replies
15h18m

Their work on folding and ai could very well be a business worth in the hundreds of billions and they know it as well as many other bets.

And we'll be writing case studies of how they squandered billions of R&D to help found other companies (kinda like Xerox Parc)

Almost every interesting paper after transformers has had it's authors leave to commercialize their own companies.

houseplant
1 replies
13h36m

after everything I've seen and the litigation coming out of europe, I really can't see AI lasting long after they're obligated to prove rights for the data they're training on.

they can't get away with having scraped people's owned work forever. You can't steal things from workers and then undercut them by selling that hard work for pennies, and not expect everything to collapse. I mean, I know that the folks in charge of this aren't really known for their foresight, especially when stock numbers and venture capital are the entire point, but... surely I hope people can recognize that this can't go on unimpeded.

logicchains
0 replies
11h43m

Eventually they're going to put vision LLMs in robotic bodies and they'll be able to learn just by listening and watching, just like humans, at which stage the idea that they're "stealing" just by viewing content will be seen as absurd.

coralreef
1 replies
17h1m

How do TPUs perform compared to GPUs on LLMs and image generation?

smarterclayton
0 replies
1h43m

Pretty well. Anthropic runs some of Claude inference on GKE and TPU v5e - this talk at the last GCP conference has some details:

https://youtu.be/b87I1plPeMg?si=T4XSFUzXG8BwpphR

Ecosystem support for GPU is very strong and so smaller teams might find TPUs to have a steeper learning curve. Sophisticated users can definitely get wins today on TPU if they can get capacity.

And it’s not as if nvidia is standing still, so how the strengths of each change over future generations isn’t set either. Ie TPU are “simple” matrix multipliers and also optimized for operation at scale, GPU are more general purpose and have strong ecosystem power.

Disclaimer - work on GKE on enabling AI workloads.

checkyoursudo
1 replies
11h46m

What would any company as "the long term AI winner" look like? What would it mean to be the winner in this context?

ketchupdebugger
0 replies
3h37m

The winner is just nvidia, I see this like the battle of gas vs electric cars. Nvidia is basically making the wheels. whichever company wins, you'd still need wheels.

zmmmmm
0 replies
15h25m

Custom silicon is fantastic when things have stabilised and you know exactly want you want. But things are still evolving fast in real time and in that environment, whoever can move fastest to be ultra flexible and deploy the latest architecture as soon as possible is the winner. I think in a nutshell, that is the story of Nvidia's success here : they created a GPGPU platform with just the right level of abstraction to capture the market for AI research.

throwaway920102
0 replies
16h32m

What can stop Google is building the wrong thing, or being so scared to launch anything that they smother their own fledgling products before they are born or before they can mature. Their product and finance and accounting teams should be tossed.

raincole
0 replies
14h57m

^ How to compact as many mistakes as possible in one single comment.

1. Google's stock didn't siginificantly outperformed Meta, Microsoft, etc, in thet past two years.

2. Meta and Microsoft are trying to make their own chips as well.

3. They're not using "consumer video cards" to train AI. I don't even know if you can call these beasts video cards any more. H100 doesn't have HDMI port.

lxgr
0 replies
15h43m

I wish I had your faith in Google’s ability to refrain from kneecapping their own perfectly fine product.

jejeyyy77
0 replies
1h30m

lolwat.

google is the biggest loser in all of this.

jatins
0 replies
12h34m

this was the same argument that was presented a decade ago on why Google was supposed to win the cloud because their internal infra was miles ahead of Amazon and Microsoft.

Yet here we are. Will the consumer video cards get cheaper and better faster or will Google's directors' infighting stop first?

htrp
0 replies
15h21m

Literally the only thing that can stop Google now is the fact they keep bringing Microsoft and Oracle flunkies into leadership positions.

You mean the thing that's already stopped them? If they had seriously invested into the TPU ecosystem in 2015, they would already have "won" AI.

hooloovoo_zoo
0 replies
16h57m

Google has been working on TPUs and Tensorflow for a decade with pretty mixed success; I don't think it's clear that they're going to win.

hipadev23
0 replies
14h39m

Google’s on their 6th generation and still can’t find anyone to use it. Hmm.

guardiang
0 replies
14h2m

Google is old like MetLife, relative to each's respective industry. Both are carrying too much baggage and are top heavy. As a result, I personally don't think Google will be able to keep pace with OpenAI in the long run.

fragmede
0 replies
15h16m

smart money has a diversified portfolio and isn't betting on any one winner and has invested in all of them, and then some.

callalex
0 replies
15h34m

That’s how all huge tech companies become dinosaurs though. Upper management that is already stupidly wealthy (and therefore unmotivated) have the funding and patience to hire geniuses to build incredible machines and then constantly tie their shoelaces together while asking them to sprint. Examples include Microsoft and Oracle as you said, and before them IBM, AT$T, TIBCO, Marvell, Motorola, I could go on for a while…

blackoil
0 replies
15h48m

I would call it "stupid" money. This isn't a commodity business. Value of the final product is orthogonal to amount invested in compute. If Google is 10% slower or its product is 10% worse, it can lose all the value. This is like valuing a software company higher because its devs are using cheap PC desktops instead of Mac.

asynchronous
0 replies
15h6m

Laughable response when you actually look at the quality of the algorithms being produced by Google. They’re so behind it’s embarrassing.

ai4ever
0 replies
16h50m

here is an older take on this same topic..

https://www.yitay.net/blog/training-great-llms-entirely-from...

GPU vs TPU, and good software managing large clusters of them across all sorts of failure.

the funny bit from the above article is the incident when someone forgot about a training job at google, and month later had the model fully trained without an alert of any kind. "outrageously good infra"

KaiserPro
0 replies
8h53m

Meta, Microsoft, OpenAI, etc. are trying to address problems with consumer video cards

Yes, but you are buying access to tested, supported units that are proven to work, don't require custom software, and are almost plug an go. When its time to upgrade, its not that costly.

Designing, fabricating and deploying your own silicon is Expensive, creating software support for it, also more expense. THen there is the opportunity cost of having to optimise the software stack your self.

You're exchanging a large capex, for a similar sized capex plus a fuckton of opex as well.

Jabrov
0 replies
16h21m

"Consumer video cards"? Meta's not building their clusters out of 3090s.

They're using advanced cards meant for data centers and machine learning -- almost effectively "custom silicon"

HarHarVeryFunny
0 replies
16h35m

The chips are somewhat irrelevant. It's the overall system architecture, management, and fault recovery that matters.

Der_Einzige
0 replies
13h32m

Until the lion share of AI projects support that custom silicon, I will continue to bet on anyone buying Nvidia GPUs.

samspenc
25 replies
15h45m

OK this was a bit funny:

  Top HW failure modes: 
  * GPU falling off the bus
I honestly thought "do they mean GPUs falling off a bus entering the data center" and then realized its actually the connectivity, as they mention in the next line

  GPUs falling off: In this case, GPUs are not detected by the host on PCIe.

redbell
8 replies
9h20m

GPU falling off the bus

I'm wondering if we could prompt llama3 with the above statement. What kind of response would it give?

TeMPOraL
7 replies
8h39m

With temperature set to 1, it recognizes the joke, but proceeds to explain what the "bus" is in computer terms, picks a problem this prompt could mean, and explains how to solve it. In ~20 tries it always gave me something along the lines of:

The infamous "GPU falling off the bus" issue!

This problem typically occurs when a graphics processing unit (GPU) is not properly seated or connected to its expansion slot, such as PCIe, on a motherboard.

Here are some troubleshooting steps to help resolve the issue:

(numbered list of steps or options follows)

Tested on Llama 3 Instruct 7B Q8_0, because that one fits entirely on my GPU.

redbell
6 replies
7h41m

+1, interesting findings! I like how it was able to infer the meaning from such a short phrase in a limited context.

burkaman
4 replies
4h59m

It's actually a very common phrase on forums, I think because it's an actual error that Linux will report: https://askubuntu.com/questions/868321/gpu-has-fallen-off-th.... I've also never heard of it, but it seems like it must appear a lot in the training data and probably about 0 times is referring to a bus on the road.

TeMPOraL
3 replies
4h52m

In my testing, both Llama 3 and its abliterated (uncensored) variant from[0] almost always remarked more or less directly that they see the joke in the phrase, so either they've seen the other meaning in training, or inferred it.

--

[0] - https://news.ycombinator.com/item?id=40665721

Technetium
1 replies
3h44m

Please use the word ablated instead. That article's title is not using a real word. I'm assuming it's the author's English issue, since they called the model "helpfull" instead of "helpful".

TeMPOraL
0 replies
2h50m

Oops. I actually originally wrote "ablated", then changed it to be consistent with the title.

burkaman
0 replies
4h25m

Oh I agree it probably inferred the joke. I was actually more surprised that it knew the real meaning of the phrase because I as a human did not, until I looked it up and saw how common it is.

TeMPOraL
0 replies
7h34m

To be specific, the system prompt used was (default in LM Studio config for Llama 3 V2):

You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.

And then the query was:

GPU falling off the bus

And yes, I imagine it read that query as ending with an implied "pls help!".

sva_
2 replies
8h36m

Back when EVGA was still selling GPUs ...

taneq
0 replies
6h38m

They did it to finance their street racing habit, I'm sure. :P

TiredOfLife
0 replies
8h22m

And offering warranty. And not doing stealth total component change under same sku.

martin-adams
2 replies
12h32m

A GPU falling off the bus would be one mega flop

ardit33
0 replies
4h5m

Haha… It is a a ‘Tera Flop’, as they are falling in the ground…

TeMPOraL
0 replies
12h24m

The audience in the back goes clap clap clap, chapeau bas.

whazor
1 replies
11h23m

I was imagining that some sys admin has to walk to the server, take out the GPU, blow against the PCI-E pins like a game cartridge, and put it back to try again.

iszomer
0 replies
7h45m

More to do with bent pins, material obstruction, or something as trivial as cable management (eg: bundles of qsfp weighing down the ports that are press-fitted not soldered).

cachvico
1 replies
15h6m

Brings a whole new meaning to bus factor

throwup238
0 replies
13h34m

I’ve never met a GPU that could survive getting hit by a bus.

ChuckMcM
1 replies
13h28m

The bits on the bus go round and round!

There is a lot of interesting yet unpublished work on 'data center' scale compute complexes. It was a rabbit hole I fell into several times while at Google.

spmurrayzzz
0 replies
3h19m

Speaking for myself (and I guess anyone else dealing with pcie riser hell in on-prem deep learning setups), its nice to see the massive orgs dealing with pretty much the same exact pain points as not-so-massive orgs.

rfoo
0 replies
9h22m

"GPU has fallen off the bus" is an actual error message nvidia.ko prints to dmesg in this case :p

krowfromthewall
0 replies
6h3m

like they did to our dear Anton in Silicon Valley

kdot
8 replies
16h1m

How will Meta leverage LLMs at scale to drive revenue? It's not clear.

sangnoir
1 replies
13h40m

If only Meta had a way to monetize engagement with generative AI content in a way that scales with quantity of generated content.

tucnak
0 replies
12h55m

Revolutionary!

HDThoreaun
1 replies
12h16m

I think the nearish term plan is chatbot customer assistants on whatsapp. Doesnt seem like theyre that close to releasing them but who knows

altdataseller
0 replies
10h19m

At the end of the day, everything eventually comes down to chatbots

threeseed
0 replies
15h8m

I still believe that a VR future is coming once the technology commoditises i.e. costs come down 10x and we have 3090 level GPUs in the headset. At that point we will have photo-realistic experiences like concerts etc that anyone can afford.

And at that point having a lot of LLM based avatars that can help "fill in the space" will be valuable.

dweekly
0 replies
15h23m

Ask the AI to come up with a business model.

OsrsNeedsf2P
0 replies
12h20m

LLMs aren't a monetizable product themselves. For the foreseeable future, that will always be ads. LLMs (and VR) are just big bets on getting ahead of future technology.

123yawaworht456
0 replies
8h29m

1. improving their adtech. someone else's API offerings are not an option due to the sheer volume, PII and whatnot.

2. virtually free moderation for their existing (facebook, instagram, threads) and future social media services. likewise, their volume is too insane to even consider paying someone else to process it.

the models they do release are probably toys in comparison to their internal models.

Oras
8 replies
15h0m

Would be nice to read how do they collect/prepare data for training.

Which data sources? How much of Meta users data (fb, instagram… etc). How do they sanitize PII?

OsrsNeedsf2P
4 replies
12h24m

How do they sanitize PII?

I can't comment on how things like faces get used, but in my experience, PII at Meta is inaccessible by default.

Unless you're impersonating a user on the platform (to access what PII they can see), you have to request special access for logs or database columns that contain so much as user IDs, otherwise the data simply won't show up when you query for it. This is baked into the infrastructure layer, so I doubt the GenAI teams are using something else.

actionfromafar
2 replies
10h37m

For a convenient definition of PII. Isn’t everything a user does in aggregate PII?

robertlagrant
0 replies
9h20m

I don't think it's PII. If you had someone's movements, you could go and spy on them, find out who they were (i.e. their PII) and then link that back and say "I now know this identified person's movements". I don't think the movements themselves are PII.

Things that aren't PII aren't "convenient" definitions. Doesn't mean everything that isn't PII is fine to share. It's like saying a kidnapping isn't a murder. That's not a convenient definition of murder; it's just a different thing. We shouldn't start talking like witch hunters as soon as we encounter a situation that we haven't memorised a reasonable response to. We should be able to respond reasonably to new situations.

KaiserPro
0 replies
8h58m

PII is pretty intuitive to define.

Obvious examples: data that easily identifies a person (Photo, name, number, UUID, etc)

Thats trivial to block. Where it gets harder is stuff that on it's own isn't PII, but combined with another source, would be

For example, aggregating public comments on a celeb's post. (ie stripping out usernames and likes and assigning a new UUID to each person.) For a single post, thats good enough. You're very unlikley to be able to identify a single person.

But over multiple posts, thats where it gets tricky.

As with large companies, the process for getting permission to use that kind of data is righty difficult, so it often doesn't get used like that.

sonofaragorn
0 replies
30m

What about a post or comment that includes proper names?

michaelt
0 replies
8h46m

Then, you should check out papers like https://arxiv.org/abs/2302.13971 and https://arxiv.org/abs/2307.09288

In the paper covering the original Llama they explicitly list their data sources in table 1 - including saying that they pretrained on the somewhat controversial books3 dataset.

The paper for Llama 2 also explicitly says they don't take data from Meta's products and services; and that they filter out data from sites known to contain a lot of PII. Although it is more coy about precisely what data sources they used, like many such papers are.

discobot
0 replies
9h41m

They explicitly train models only on public datasets

radarsat1
3 replies
11h45m

Frustratingly little information. For example, I'm exceedingly curious how they deal with scheduling jobs on such a huge array of machines. The article:

Efficient scheduling helps ensure that our resources are used optimally. This involves sophisticated algorithms that can allocate resources based on the needs of different jobs and dynamic scheduling to adapt to changing workloads.

Wow thanks for that, captain obvious. So how do you do it?

p4ul
2 replies
3h37m

I usually assume these companies are using some of the popular schedulers (e.g., Slurm, MOAB, SGE) that have existed in the HPC community for many years.

I have anecdotally also heard that some are using k8s, but I've not seen that myself. Slurm [1] is basically built for this stuff; that's definitely what I would use!

[1] https://slurm.schedmd.com/documentation.html

claytonjy
1 replies
1h34m

Slurm is definitely still dominant, but OpenAI has been using k8s for training for many years now¹, and there are various ways to run slurm on top of Kubernetes, including the recent SUNK from coreweave²

at my company we use slurm "directly" for static compute we rent or own (i.e. not in a public cloud), but are considering using Kubernetes because that's how we run the rest of the company, and we'd rather invest more effort into being better at k8s than becoming good slurm admins.

¹: https://openai.com/index/scaling-kubernetes-to-2500-nodes/

²: https://www.coreweave.com/blog/sunk-slurm-on-kubernetes-impl...

p4ul
0 replies
1h30m

Very cool! Thank for this, claytonjy!!

idkdotcom
3 replies
15h41m

These seem classic challenges with running distributed systems loads that are not specific to training LLMs.

Anyone of the super computers listed here https://en.wikipedia.org/wiki/TOP500 suffers from the same issues.

Think about it. While the national labs use these systems to model serious stuff -such as climate or nuclear weapons- Meta uses them to train LLMs. What a joke, honestly!

whiplash451
0 replies
11h58m

A lot of serious things look like a toy or a joke at first.

mhandley
0 replies
7h32m

On the other hand, Meta just rapidly built two different training networks in existing datacenter buildings, with existing cooling constraints, using mostly commodity components (albeit expensive commodity components) each of which would place at #3 on that top500 list in terms of GPU power. Compare that with how long it took to get any of the other supercomputers from design to being fully commissioned.

_zoltan_
0 replies
1h27m

For profit is not less serious than what research labs do. I'd even say it's more important: they drive the economy.

yosito
2 replies
15h22m

I wish that instead of just training another stupid LLM, Meta would use it to improve their search and help me find the content I'm actually interested in.

TeMPOraL
1 replies
12h35m

Their revenue depends on it being hard (but not impossible) for you to find the content you're actually interested in. Would be nice if it didn't, but in this reality, money on the Internet is made by wasting users' lives. That is what attention economy is about.

rmbyrro
0 replies
1h6m

It's actually a mix. They need to disappoint the user for the right amount of time, to then please it at the right moment and dose. This maximizes the dopamine release and increases addictiveness.

When you find good content depends on when the algo judges you're already primed to a colorful dopamine intake.

xvector
2 replies
15h37m

So we decided to build both: two 24k clusters, one with RoCE and another with InfiniBand. Our intent was to build and learn from the operational experience.

I love how they built two completely insane clusters just to learn. That's badass.

riku_iki
0 replies
15h33m

More like Mark gave them 100k GPUs, and they are not sure what exactly to do with them..

logicchains
0 replies
11h37m

It's not just to learn; an RoCE ethernet cluster with Aristas is way cheaper to build and maintain than a fancy InfiniBand cluster with Mellanox/NVidia networking, so proving that the former is good enough at scale will eventually save Meta a huge amount of money. InfiniBand cards are much more expensive than ethernet because there's few vendors, that have a quasi-monopoloy, and because overall far fewer of them are produced so there's less economy of scale.

whalesalad
2 replies
17h7m

interesting that their domain is still engineering.fb.com

samspenc
1 replies
15h48m

I think fb.com is their internal domain and they never really bothered to change it, I think employees used to have a @fb.com e-mail, at least this was true a few years ago, not sure if that has changed.

dnissley
0 replies
15h31m

Emails switched to @meta.com in 2022.

lokimedes
2 replies
12h37m

Yikes, the little Infiniband+A100 cluster I installed for my previous company seemed useful at the time (12 GPUs) and that was at a cost of around $300k. With LLMs it feels like game over for non-cloud applications if you are not a mega-corp.

lannisterstark
1 replies
12h21m

Well, yes, but Not all models need to be "super large." Smaller models, specialized in specific tasks, working together - and then reporting to a slightly larger model is the way to go.

Think of everything being connected to a "Home Computer" in those "Future House of 2020" videos that were out there in 70s or what not.

Another example (very rough) would be something like "Weather data gets to a small model via an API, model looks at it, updates the home dashboard, also sees if there's any alerts, if so, adds x or y to home dashboard appropriately as to what it thinks best."

We can probably achieve the latter example today. (without any significant 'coding' on anyone's part except the API owner)

afro88
0 replies
7h4m

Well, yes, but Not all models need to be "super large." Smaller models, specialized in specific tasks, working together - and then reporting to a slightly larger model is the way to go.

I want to believe, but I'm still yet to see this kind of set up being anywhere near GPT-4 level.

The weather example seems quite contrived. Why not just display the alerts for your area? Why is a complex system of smaller models reporting up to a slightly larger model necessary?

jauntywundrkind
1 replies
17h13m

Random q, I wonder if gloo is used in these systems? https://github.com/facebookincubator/gloo

RDMA and GPUDirect capable. Coordinates over MPI or (hi)redia.

runeblaze
0 replies
16h11m

̶I̶I̶R̶C̶ ̶g̶l̶o̶o̶ ̶i̶s̶ ̶C̶P̶U̶ ̶t̶e̶n̶s̶o̶r̶s̶ ̶o̶n̶l̶y̶ ̶s̶o̶ ̶l̶i̶k̶e̶l̶y̶ ̶n̶o̶t̶

Edit: I had a brain freeze or something... gloo is not CPU only but for whatever reason I don't see it outside of CPU-comms

dudus
1 replies
15h18m

Since we did not have time to change the cooling infrastructure, we had to remain in an air-cooled environment. The mechanical and thermal designs had to change to accommodate this, and that triggered a validation cycle to support a large-scale deployment.

All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints, with a very small degree of freedom to change and meet a tight schedule.

Seems like the time constraints put into the team impacted the overall quality of the model.

vessenes
0 replies
5h28m

This sounds to me like standard CYA / perhaps good-natured complaining from the tech team.

The last tech team to have no budget and time constraints to pursue their vision? I don’t know, the Xanadu team? Romero’s original Daikatana team?

koolala
0 replies
15h8m

what a title...