HN comments for: How Meta trains large language models at scale

mike_d

109 replies

17h7m

2024-06-13 01:23:32 UTC

Posts like this underscore why the smart money is betting on Google as the long term AI winner. Meta, Microsoft, OpenAI, etc. are trying to address problems with consumer video cards and spending billions to try and out bid each other to win Nvidia's favor - while Google is on their 6th generation of custom silicon.

Literally the only thing that can stop Google now is the fact they keep bringing Microsoft and Oracle flunkies into leadership positions.

throwaway_ab

40 replies

16h43m

2024-06-13 01:47:31 UTC

I think it's likely Nvidia's GPU's, many of which are $50,000+ for a single unit, far surpass Google's custom silicon otherwise why wouldn't Google be selling shovels like Nvidia?

If Google had a better chip, or even a chip that was close, they would sell it to anyone and everyone.

From a quick search I can see Google's custom chips are 15x to 30x slower to train AI compared to Nvidia's current latest gen AI specific GPU's.

candiddevmike

15 replies

16h38m

2024-06-13 01:51:58 UTC

They do sell shovels, you can get Google TPUs on Google Cloud.

matt-p

10 replies

16h29m

2024-06-13 02:00:58 UTC

Exactly and they are still about 1/18ths as good at training llms as a H100.

Maybe they are less than 1/18ths the cost, so google technically have a marginally better unit cost but i doubt it when you consider the R&D cost. They are less bad at inference, but still much worse than even an A100.

jeffbee

4 replies

16h17m

2024-06-13 02:12:48 UTC

I don't see how you can evaluate better and worse for training without doing so on cost basis. If it costs less and eventually finishes then it's better.

tmostak

2 replies

15h11m

2024-06-13 03:18:51 UTC

This assumes that you can linearly scale up the number of TPUs to get equal performance to Nvidia cards for less cost. Like most things distributed, this is unlikely to be the case.

logicchains

1 replies

11h58m

2024-06-13 06:32:30 UTC

This is absolutely the case, TPUs scale very well: https://github.com/google/maxtext .

pama

0 replies

11h37m

2024-06-13 06:53:34 UTC

The repo mentiones a Karpathy tweet from Jan 2023. Andrej has recently created llm.c and the same model trained about 32x faster on the same NVidia hardware mentioned in the tweet. I dont think the perfomance estimate that the repo used (based on that early tweet) was accurate for the performance of the NVidia hardware itself.

fbdab103

0 replies

15h6m

2024-06-13 03:24:34 UTC

Time is money. You might be a lab with long queues to train, leaving expensive staff twiddling their thumbs.

blharr

1 replies

16h16m

2024-06-13 02:14:13 UTC

Also energy cost. 18 chips vs 1, it's probably costing a lot more to run 18

jeffbee

0 replies

16h11m

2024-06-13 02:19:32 UTC

Google claims the opposite in "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings " https://arxiv.org/abs/2304.01433

Despite various details I don't think that this is an area where Facebook is very different from Google. Both have terrifying amounts of datacenter to play with. Both have long experience making reliable products out of unreliable subsystems. Both have innovative orchestration and storage stacks. Meta hasn't published much or anything about things like reconfigurable optical switches, but that doesn't mean they don't have such a thing.

UCBdaPatterson

1 replies

4h38m

2024-06-13 13:52:04 UTC

If you're interested in a peer reviewed scientific comparison, Google writes retrospective papers after contemporary TPUs and GPUs are deployed versus speculation about future products. The most recent compares TPU v4 and A100. (TPU v5 and H100 is for a future paper). Here is a quote from the abstract:

"Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves performance/Watt by 2.7x. ... For similar sized systems, it is ~4.3x--4.5x faster than the Graphcore IPU Bow and is 1.2x--1.7x faster and uses 1.3x--1.9x less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers of Google Cloud use ~2--6x less energy and produce ~20x less CO2e than contemporary DSAs in typical on-premise data centers."

Here is a link to the paper: https://dl.acm.org/doi/pdf/10.1145/3579371.3589350

coder543

0 replies

3h57m

2024-06-13 14:32:56 UTC

That quote is referring to the A100... the H100 used ~75% more power to deliver "up to 9x faster AI training and up to 30x faster AI inference speedups on large language models compared to the prior generation A100."[0]

Which sure makes the H100 sound both faster and more efficient (per unit of compute) than the TPU v4, given what was in your quote. I don't think your quote does anything to support the position that TPUs are noticeably better than Nvidia's offerings for this task.

Complicating this is that the TPU v5 generation has already come out, and the Nvidia B100 generation is imminent within a couple of months. (So, no, a comparison of TPUv5 to H100 isn't for a future paper... that future paper should be comparing TPUv5 to B100, not H100.)

[0]: https://developer.nvidia.com/blog/nvidia-hopper-architecture...

derefr

0 replies

14h55m

2024-06-13 03:34:56 UTC

Given that Google invented Transformer architecture (and Google AI continues to do foundational R&D on ML architecture) — and that Google's TPUs don't even support the most common ML standards, but require their own training and inference frameworks — I would assume that "the point" of TPUs from Google's perspective, has less to do with running LLMs, and more to do with running weird experimental custom model architectures that don't even exist as journal papers yet.

I would bet money that TPUs are at least better at doing AI research than anything Nvidia will sell you. That alone might be enough for Google to keep getting some new ones fabbed each year. The TPUs you can rent on Google Cloud might very well just be hardware requisitioned by the AI team, for the AI team, that they aren't always using to capacity, and so is "earning out" its CapEx through public rentals.

TPUs are maybe also better at other things Google does internally, too. Running inference on YouTube's audio+video-input timecoded-captions-output model, say.

throwaway_ab

3 replies

16h28m

2024-06-13 02:02:24 UTC

Wouldn't that be renting a shovel vs selling a shovel?

candiddevmike

2 replies

16h26m

2024-06-13 02:04:11 UTC

NVIDIA sells subscriptions...

throwaway_ab

0 replies

16h15m

2024-06-13 02:15:27 UTC

I'm only aware of Nvidia AI Enterprise and that isn't required to run the GPU.

I think it's aimed at medium to large corporations.

Massive corporations such as Meta and OpenAI would build their own cloud and not rely on this.

The GPU really is a shovel, and can be used without any subscription.

Don't get me wrong, I want there to be competition with Nvidia, I want more access for open source and small players to run and train AI on competitive hardware at our own sites.

But no one is competing, no one has any idea what they're doing. Nvidia has no competition whatsoever, no one is even close.

This lets Nvidia get away with adding more vram onto an AI specific GPU and increase the price by 10x.

This lets Nvidia remove NVLink from current gen consumer cards like the 4090.

This lets Nvidia use their driver licence to prevent cloud platforms from offering consumers cards as a choice in datacenters.

If Nvidia had a shred of competition things would be much better.

kkielhofner

0 replies

9h34m

2024-06-13 08:56:22 UTC

I'm not sure why you're getting downvoted. It's very clear that Nvidia is moving towards directly offering cloud services[0].

[0] - https://www.nvidia.com/en-us/data-center/dgx-cloud/

bluedino

10 replies

15h33m

2024-06-13 02:57:18 UTC

We have almost 400 H100's sitting idle. I wonder how many other companies are buying millions of dollars worth of these chips with the hopes of them being used, but aren't being utilized?

jonathanlei

1 replies

14h23m

2024-06-13 04:07:40 UTC

Hello! If you're interested in monetizing those GPUs, I'd be happy to rent them (all 400!) and offer those to customers of the cloud I work at :)

jonathan [at] tensordock.com

jedberg

0 replies

13h57m

2024-06-13 04:32:55 UTC

If you want 512 H100s connected with infiniband: https://lambdalabs.com/service/gpu-cloud/1-click-clusters

radq

0 replies

14h43m

2024-06-13 03:47:32 UTC

Have you considered sponsoring an open-source project? ;)

newswasboring

0 replies

15h23m

2024-06-13 03:07:41 UTC

Wow, that's a lot of money in inventory. What was the original thought process? Just fomo?

irjustin

0 replies

15h29m

2024-06-13 03:01:05 UTC

That's insane and incredible all at the same time.

giancarlostoro

0 replies

15h1m

2024-06-13 03:29:06 UTC

Probably could profit selling them second hand honestly.

gfosco

0 replies

15h18m

2024-06-13 03:11:57 UTC

Looking to get rid of a few?....

fragmede

0 replies

15h14m

2024-06-13 03:16:32 UTC

the world would love to buy time on your idle H100's if you're selling.

TeMPOraL

0 replies

12h4m

2024-06-13 06:26:38 UTC

So you're saying, H100s are the corporate equivalent of Raspberry Pis now? Bought to not miss out, then left to gather dust in a drawer?

Der_Einzige

0 replies

13h31m

2024-06-13 04:59:37 UTC

I know of many important projects that need GPUs right now and aren’t getting any. You could help motivate the ponydiffusion folks to actually try finetuning of SD3!

vineyardmike

6 replies

13h59m

2024-06-13 04:31:34 UTC

why wouldn't Google be selling shovels

They do sell them - but through their struggling cloud business. Either way, Nvidia's margin is google's opportunity to lower costs.

I can see Google's custom chips are 15x to 30x slower to train AI

TPUs are designed for inference not training - they're betting that they can serve models to the world at a lower cost structure than their competition. The compute required for inference to serve their billions of customers is far greater than training costs for models - even LLMs. They've been running model inference as a part of production traffic for years.

refulgentis

4 replies

13h10m

2024-06-13 05:20:01 UTC

This breaks my brain, because I know Google trains it models on TPUs and they're seen as faster, and if they're better at inference, and can train, then why is Nvidia in a unique position? My understanding was always it's as simple as it required esoteric tooling

vineyardmike

1 replies

3h6m

2024-06-13 15:24:10 UTC

Because people generally don’t use TPUs outside of Google. The tooling is different, the access is metered through GCP, etc.

Nvidia is in a vaguely unique position in that their products have great tooling support and few companies sell silicon at their scale.

refulgentis

0 replies

1h23m

2024-06-13 17:07:25 UTC

Correct, I'm pointing out politely that's in conflict with the person I'm replying to.

smueller1234

0 replies

13h4m

2024-06-13 05:25:55 UTC

Multiple types of TPUs.

(I work for Google, but the above is public information.)

koe123

0 replies

1h39m

2024-06-13 16:51:36 UTC

Possibly naive but I very much view CUDA and its integration into ML frameworks being nvidias moat

uluyol

0 replies

13h10m

2024-06-13 05:20:47 UTC

Google most certainly uses TPUs for training.

xipix

0 replies

10h13m

2024-06-13 08:17:03 UTC

Intel, AMD and others also have chips for training that perform close to or sometimes better than Nvidia's. These are already in the market. Two problems: the CUDA moat, and, "noone gets fired for buying green".

nickpsecurity

0 replies

32m

2024-06-13 17:58:04 UTC

That’s not necessarily true. Many companies make chips they won’t sell to support lucrative, proprietary offerings. Mainframe processors are the classic example.

In AI, Google (TPU) and Intel (Gaudi) each have chips they push in cloud offerings. The cloud offerings have cross selling opportunities. That by itself would be a reason to keep it internal at their scale. It might also be easier to support one, or a small set, of deployments that are internal vs the variety that external customers would use.

megablast

0 replies

9h58m

2024-06-13 08:32:45 UTC

Is that why apple sell there chips to everyone??

girvo

0 replies

12h5m

2024-06-13 06:25:21 UTC

If Google had a better chip, or even a chip that was close, they would sell it to anyone and everyone.

While I do not actually think Google's chips are better or close to being better, I don't think this actually holds?

If the upside of <better chip> is effectively unbounded, it would outweigh the short term benefit of selling them to others, I would think. At least for a company like Google.

aseipp

0 replies

16h12m

2024-06-13 02:18:09 UTC

Nvidia has decades of experience selling hardware to people with all the pains that entails, support, sales channels, customer acquisition, software, it's something you don't just do overnight, and it does cost money. Google's TPUs get some of their cost efficiency from not supporting COTS use cases and the overhead of selling to people, and the total wall clock time has to also include the total operational costs, which dominate at their size (e.g. if it's 30x slower but 1/50th the TCO then it's a win. I don't know how TPUv5 stacks up against the B200). It's not as simple as "just put it on a shelf and sell it and make a gajillion dollars like nvidia"

EvgeniyZh

0 replies

14h4m

2024-06-13 04:26:00 UTC

TPU v5p is ~2 times slower than H100 at larg(ish)-scale training (order of 10k chips) [1]. And they already have v6 [2]. I think it's safe to say that they are fairly close to Nvidia in terms of performance.

[1] https://mlcommons.org/benchmarks/training/

[2] https://cloud.google.com/blog/products/compute/introducing-t...

tw04

7 replies

16h39m

2024-06-13 01:51:32 UTC

Except Microsoft is making their own chips as well?

https://www.theverge.com/2023/11/15/23960345/microsoft-cpu-g...

jauer

6 replies

16h27m

2024-06-13 02:03:19 UTC

and so is Meta: https://ai.meta.com/blog/next-generation-meta-training-infer...

boringg

5 replies

16h11m

2024-06-13 02:19:01 UTC

Ever since Apple did it everyone has leaped on board. Let's see how things pan out for everyone...

1024core

2 replies

15h26m

2024-06-13 03:04:08 UTC

Google introduced their first TPU in 2015...? Long before Apple taped out their first silicon.

mamp

0 replies

14h2m

2024-06-13 04:28:45 UTC

Apple’s first in-house designed chip was the A4 in 2010.

fragmede

0 replies

15h17m

2024-06-13 03:13:28 UTC

if we're talking custom silicon, Google acquired Motorola in 2011, and Apple acquired PA semi in 2008.

The idea is obvious to everybody in the industry, it's a question of money and motivation.

jacurtis

1 replies

16h0m

2024-06-13 02:30:25 UTC

And that's why $ARM is a good buy. Selling swords and steel to all these armies as they go to war.

silisili

0 replies

14h9m

2024-06-13 04:21:04 UTC

ARM is just collecting royalties in this space. Their reference designs aren't exactly competitive.

I'm not saying it's a bad buy, either, but if and when they turn the screws, there will be a mass exodus. Solid long term play perhaps, but not going to see Nvidia like price action.

zdyn5

6 replies

16h59m

2024-06-13 01:30:57 UTC

H100s are far from consumer video cards

stygiansonic

5 replies

16h36m

2024-06-13 01:54:30 UTC

Yeah, ops comment makes it seem like they are building racks of RTX 4090s, when this isn’t remotely true. Tensor Core performance is far different on the data center class devices vs consumer ones.

mike_d

4 replies

12h53m

2024-06-13 05:37:17 UTC

They are building racks of 4090s. Nobody can get H100s in any reasonable volume.

Hell, Microsoft is renting GPUs from Oracle Cloud to get enough capacity to run Bing.

kkielhofner

2 replies

9h47m

2024-06-13 08:43:44 UTC

Who is "they"?

RTX 4090s are terrible for this task. Off the top of my head:

- VRAM (obviously). Isn't that where the racks come in? Not really. Nvidia famously removed something as basic as NVLink between two cards from the 3090 to the 4090. When it comes to bandwidth between cards (crucial) even 16 lanes of PCIe 4 isn't fast enough. When you start talking about "racks" unless you're running on server grade CPUs (contributing to cost vs power vs density vs perf) you're not going to have nearly enough PCIe lanes to get very far. Even P2P over PCIe requires a hack geohot developed[0] and needless to say that's umm, less than confidence inspiring for what you would lay out ($$$) in terms of hardware, space, cooling, and power. The lack of ECC is a real issue as well.

- Form factor. Remember PCIe lanes, etc? The RTX 4090 is a ~three slot beast when using air cooling and needless to say rigging up something like the dual slot water cooled 4090s I have at scale is another challenge altogether... How are people going to wire this up? What do the enclosures/racks/etc look like? This isn't like crypto mining where cheap 1x PCIe risers can be used without dramatically limiting performance to the point of useless.

- Performance. As grandparent comment noted 4090s are not designed for this workload. In typical usage for training I see them as 10-20% faster than an RTX 3090 at much higher cost. Compared to my H100 with SXM it's ridiculously slow.

- Market segmentation. Nvidia really knows what they're doing here... There are all kinds of limitations you run into with how the hardware is designed (like Tensor Core performance for inference especially).

- Issues at scale. Look at the Meta post - their biggest issues are things that are dramatically worse with consumer cards like the RTX 4090, especially when you're running with some kind of goofy PCIe cabling issue (like risers).

- Power. No matter what power limiting you employ an RTX 4090 is pretty bad for power/performance ratio. The card isn't fundamentally designed for these tasks - it's designed to run screaming for a few hours a day so gamers can push as many FPS at high res as possible. Training, inference, etc is a different beast and the performance vs power ratio for these tasks is terrible compared to A/H100. Now lets talk about the physical cabling, PSU, etc issues. Yes miners had hacks for this as well but it's yet another issue.

- Fan design. There isn't a single "blower" style RTX 4090 on the market. There was a dual-slot RTX 3090 at one point (I have a bunch of them) but Nvidia made Gigabyte pull them from the market because people were using them for this. Figuring out some kind of air-cooling setup with the fan and cooling design of the available RTX 4090 cards sounds like a complete nightmare...

- Licensing issues. Again, laying out the $$$ for this with a deployment that almost certainly violates the Nvidia EULA is a risky investment.

Three RTX 4090s (at 9 slots) to get "only" 72GB of VRAM, talking over PCIe, using 48 PCIe lanes, multi-node over sloooow ethernet (hitting CPU - slower and yet more power), using what likely ends up at ~900 watts (power limited) for significantly reduced throughput and less VRAM is ridiculous. Scaling the kind of ethernet you need for this (100 gig) comes at a very high per-port cost and due to all of these issues the performance would still be terrible.

I'm all for creativity but deploying "racks" of 4090s for AI tasks is (frankly) flat-out stupid.

[0] - https://github.com/tinygrad/open-gpu-kernel-modules

mike_d

0 replies

35m

2024-06-13 17:54:52 UTC

but deploying "racks" of 4090s for AI tasks is (frankly) flat-out stupid.

You seem to be trapped in the delusion that this was anyone's first, second, or third choice.

There is workload demand, you can't get H100s, and if you don't start racking up the cards you can get the company will replace you with someone less opinionated.

michaelt

0 replies

7h0m

2024-06-13 11:29:51 UTC

> The RTX 4090 is a ~three slot beast when using air cooling and needless to say rigging up something like the dual slot water cooled 4090s I have at scale is another challenge altogether... How are people going to wire this up? What do the enclosures/racks/etc look like?

A few years ago, if you wanted a lot of GPU power you would buy something like [1] - a 4/5U server with space for ten dual-slot PCIe x16 cards and quadruple power supplies for 2000W of fully redundant power. And not a PCIe riser in sight.

I share your scepticism about whether it's common to run >2 4090s because nvidia have indeed sought to make it difficult.

But if there was some sort of supply chain issue that meant you had to, and you had plenty of cash to make it happen? It could probably be done.

Some of the more value-oriented GPU cloud suppliers like RunPod offer servers with multiple 4090s and I assume those do something along these lines. With 21 slots in the backplane, you could probably fit 6 air-cooled three-slot GPUs, even if you weren't resorting to water cooling.

[1] https://www.supermicro.com/en/products/system/4U/4028/SYS-40...

TeMPOraL

0 replies

11h58m

2024-06-13 06:32:09 UTC

There are apparently some 400 of H100s sitting idle somewhere upthread. Yes, I'm having hard time imagining how's that possible too.

moneywoes

6 replies

16h38m

2024-06-13 01:51:50 UTC

Is no one else working on custom silicon?

fnordpiglet

4 replies

15h50m

2024-06-13 02:39:51 UTC

The problem isn’t just developing your own processor. Nvidia has a huge stack of pretty cutting edge technology including a huge stack from mellanox, an enormous OSS tool chain around CUDA, etc, that people seeking to make comparable products have to overcome.

sangnoir

3 replies

13h50m

2024-06-13 04:40:44 UTC

Are you suggesting Meta or Google - who stand to save billions - won't be able to get top performance from their custom chips because their tooling/hardware won't support CUDA?

fnordpiglet

2 replies

12h52m

2024-06-13 05:38:23 UTC

No. I’m suggesting they won’t because IP like the mellanox treasure chest they acquired is ridiculously difficult to develop and Nvidia has aggressively exploited it, along with their other already advanced IP in the space of their -core business-.

I understand, especially amongst googlers, there’s a belief there are no others smarter than a googler. But it’s simply not the case. Nvidia is excellent at their core competencies and business, which is making absurdly parallel compute platforms with absurdly powerful interconnects. I’m saying google or meta won’t beat Nvidia at hardware. I’d also point to the fact Nvidias ability to raise capital is the best on earth now, so even money isn’t a barrier.

The advantage CUDA gives is in the tool chains, libraries, research, and all that that tens of thousands of people are contributing to as part of their jobs, research, and hobbies. This is almost -more valuable- than getting top performance. Getting top techniques, top software, top everything by having everyone everywhere working to make the ecosystem of your stuff is invaluable. Google won’t have that. They will just have the hubris of googlers who believe they’re smarter.

I would also note that at this phase of a cycle in tech trying to save billions takes your eye off the prize. Cost optimization comes much later after the market has been fully explored and directions are clean and diminishing returns on R&D kick in. Any company that doesn’t recognize that is run by CPA and deserves the ignominy they’ll face.

sangnoir

0 replies

12h6m

2024-06-13 06:23:59 UTC

The bar for success for Google and Meta is much lower than Nvidia - at least for internal usage. Any dollar amount that Google saves on CapEx or OpEx by using custom silicon instead of buying Nvidia helps bring down the cost of revenue. They don't have to match Nvidia on raw performance, and can aim at being better at performance per watt or performance per dollar (TCO) for larger workloads, and IIRC, Google is already doing for some internal inferencing tasks.

I would also note that at this phase of a cycle in tech trying to save billions takes your eye off the prize

Big Tech companies are conglomerate-ish and can multitask. The search engine folk aren't pushing stuff back onto the backlog to put out fires delaying chip tape-out, and I bet the respective CEOs aren't burning braincycles micromanaging silicon development either; directors 2-3 rungs below the C-suite can motivate and execute on such an undertaking. The answer to "I need a budget of $300M in order to save the company $5-15B over 3 years" is "How soon can you start?"

logicchains

0 replies

11h45m

2024-06-13 06:44:49 UTC

No. I’m suggesting they won’t because IP like the mellanox treasure chest they acquired is ridiculously difficult to develop and Nvidia has aggressively exploited it, along with their other already advanced IP in the space of their -core business-.

For training Llama3 Facebook set up two clusters, one using fancy InfiniBand and one just using RoCE over Arista cards: https://engineering.fb.com/2024/03/12/data-center-engineerin... . The latter ended up doing fine, suggesting that all that Mellanox stuff isn't necessary for large-scale training (apparently at a large enough scale ethernet scales better than InfiniBand).

threeseed

0 replies

15h50m

2024-06-13 02:40:11 UTC

Everyone is.

Apple, AWS, Google, Meta, Microsoft all have custom AI-centric silicon.

dinobones

5 replies

16h56m

2024-06-13 01:34:15 UTC

Do you really think Google’s hardware expertise is better than Nvidia’s?

If needed these other companies have the $$$ to buy the best chips money can buy from Nvidia. Better chips than Google could ever produce.

If anything, this is why IMO Google will fail.

mike_d

2 replies

16h35m

2024-06-13 01:55:34 UTC

Yes. Google was building custom HPC hardware 5-8 years before Nvidia decided to expand outside the consumer and "workstation" markets.

threeseed

1 replies

15h48m

2024-06-13 02:42:29 UTC

Nvidia acquired Mellanox who know far more about custom HPC hardware than Google.

mike_d

0 replies

10h9m

2024-06-13 08:21:25 UTC

Mellanox and friends not being able to build fast enough switching gear at a reasonable price point was what got Google into the hardware game in the first place ;)

candiddevmike

1 replies

16h36m

2024-06-13 01:54:18 UTC

I thought NVIDIA's moat was mostly software/CUDA?

rapsey

0 replies

15h2m

2024-06-13 03:28:20 UTC

They have by far the fastest chip, the best software with CUDA and feverishly working on next gen chips. They also bought out multiple years worth of global high bandwidth memory manufacturing capacity.

No one will beat them at their game. However if there are any major breakthroughs that might render those processing capacities unneeded, or the major players hitting a wall regarding AI spending, then they will take a massive hit. It will come eventually because the chip business is always in boom/bust cycles.

r_hanz

2 replies

16h24m

2024-06-13 02:06:40 UTC

Much the same way you can have all the best gear and still fail - Google’s primary strength seems to be the DeepMind group. I’m not affiliated with Google, but IMHO the reason they will slowly die is because their engineering culture has taken a backseat due to their broken hiring practices.

Bad hiring practices aren’t exclusive to them, but from all accounts it seems like their internal focus is on optimizing ad revenue over everything else. I could be wrong or misinformed, but it seems to me like they are playing the finite game in the AI space (DeepMind group aside) while FAIR are playing the infinite game.

*meanwhile MSFT are simply trying to buy their way to relevance (e.g. OpenAI investments, etc) and carve out future revenues (Recall) and Jobs-less Apple is building their trademark walled-garden (AppleIntelligence?). Although the use of unified memory in Apple silicon poses some interesting possibilities for enabling the use of sizable models on consumer hardware.

Overall it seems like “big-tech” is by-and-large uninspired and asleep at the wheel save specific teams like those led by Lecun, Hassabis, etc. not sure where that leaves OpenAI now that Karpathy is gone.

VirusNewbie

1 replies

14h37m

2024-06-13 03:52:49 UTC

ecause their engineering culture has taken a backseat due to their broken hiring practices.

What company do you think has better hiring practices, and subsequently a higher talent pool? Meta is pretty similar to Google's (though with an emphasis on speed over creativity). Microsoft is certainly worse at hiring than the two aforementioned...

r_hanz

0 replies

4h59m

2024-06-13 13:31:02 UTC

To be fair, I don’t have any examples of “good practices” readily in-hand. However, I did try to address why I thought others were less impacted by this problem in the second half of my post.

matt-p

2 replies

16h34m

2024-06-13 01:56:27 UTC

Can't agree. This is like saying $popularApp will fail because they buy expensive hosting at AWS.

Rubbish they will fail because the product didn't fit the market, if they're successful they'll have money to buy servers and colo then drive down cost. If they succeed it will be in large part due to the fact they spent thier capital and more importantly time on code/engineers rather than servers.

Right now companies are searching for a use of AI that will add hundreds of billions to thier market cap. Once they find that they can make TPUs, right now only one thing matters; getting there first.

mike_d

1 replies

13h0m

2024-06-13 05:30:25 UTC

This is like saying $popularApp will fail because they buy expensive hosting at AWS.

For any given mobile app startup, AWS is effectively infinite. The more money you throw at it the more doodads you get back. Nvidia's supply chain is not infinite and is the bottle neck for all the non-Google players to fight over.

KaiserPro

0 replies

8h52m

2024-06-13 09:38:47 UTC

IF you are training on AWS, its not infinite. worse still you are bidding against other people.

loeg

2 replies

16h41m

2024-06-13 01:49:14 UTC

The only thing that can stop Google is Google. Somehow every bet that isn't Search doesn't pan out. And inexplicably, they're working hard to kill Search now. As a shareholder, I hope they succeed. But I am more pessimistic about it than you.

yellow_postit

1 replies

14h51m

2024-06-13 03:39:02 UTC

And they missed multiple waves of effectively building on their own in house research.

aworks

0 replies

2h10m

2024-06-13 16:20:11 UTC

Reminds me of Xerox Parc vs. Apple. Building successful products is hard.

iamflimflam1

2 replies

12h58m

2024-06-13 05:32:26 UTC

Don’t forget Apple’s Private Compute Cloud - built on top of Apple Silicon.

towawy

1 replies

12h37m

2024-06-13 05:52:49 UTC

Models trained on Google TPUs according to Reuters [0]. Does anyone know the "technical document" the news article references?

[0] https://www.reuters.com/technology/artificial-intelligence/h...

astrange

0 replies

12h15m

2024-06-13 06:14:49 UTC

https://machinelearning.apple.com/research/introducing-apple...

rapsey

1 replies

16h55m

2024-06-13 01:34:52 UTC

Their work on folding and ai could very well be a business worth in the hundreds of billions and they know it as well as many other bets.

Wheres others are playing the llm race to the bottom.

htrp

0 replies

15h18m

2024-06-13 03:12:22 UTC

Their work on folding and ai could very well be a business worth in the hundreds of billions and they know it as well as many other bets.

And we'll be writing case studies of how they squandered billions of R&D to help found other companies (kinda like Xerox Parc)

Almost every interesting paper after transformers has had it's authors leave to commercialize their own companies.

houseplant

1 replies

13h36m

2024-06-13 04:54:07 UTC

after everything I've seen and the litigation coming out of europe, I really can't see AI lasting long after they're obligated to prove rights for the data they're training on.

they can't get away with having scraped people's owned work forever. You can't steal things from workers and then undercut them by selling that hard work for pennies, and not expect everything to collapse. I mean, I know that the folks in charge of this aren't really known for their foresight, especially when stock numbers and venture capital are the entire point, but... surely I hope people can recognize that this can't go on unimpeded.

logicchains

0 replies

11h43m

2024-06-13 06:47:24 UTC

Eventually they're going to put vision LLMs in robotic bodies and they'll be able to learn just by listening and watching, just like humans, at which stage the idea that they're "stealing" just by viewing content will be seen as absurd.

coralreef

1 replies

17h1m

2024-06-13 01:29:32 UTC

How do TPUs perform compared to GPUs on LLMs and image generation?

smarterclayton

0 replies

1h43m

2024-06-13 16:47:37 UTC

Pretty well. Anthropic runs some of Claude inference on GKE and TPU v5e - this talk at the last GCP conference has some details:

https://youtu.be/b87I1plPeMg?si=T4XSFUzXG8BwpphR

Ecosystem support for GPU is very strong and so smaller teams might find TPUs to have a steeper learning curve. Sophisticated users can definitely get wins today on TPU if they can get capacity.

And it’s not as if nvidia is standing still, so how the strengths of each change over future generations isn’t set either. Ie TPU are “simple” matrix multipliers and also optimized for operation at scale, GPU are more general purpose and have strong ecosystem power.

Disclaimer - work on GKE on enabling AI workloads.

checkyoursudo

1 replies

11h46m

2024-06-13 06:44:32 UTC

What would any company as "the long term AI winner" look like? What would it mean to be the winner in this context?

ketchupdebugger

0 replies

3h37m

2024-06-13 14:53:24 UTC

The winner is just nvidia, I see this like the battle of gas vs electric cars. Nvidia is basically making the wheels. whichever company wins, you'd still need wheels.

zmmmmm

0 replies

15h25m

2024-06-13 03:05:40 UTC

Custom silicon is fantastic when things have stabilised and you know exactly want you want. But things are still evolving fast in real time and in that environment, whoever can move fastest to be ultra flexible and deploy the latest architecture as soon as possible is the winner. I think in a nutshell, that is the story of Nvidia's success here : they created a GPGPU platform with just the right level of abstraction to capture the market for AI research.

xcv123

0 replies

15h32m

2024-06-13 02:57:54 UTC

None of these companies are using consumer video cards. https://www.nvidia.com/en-us/data-center/h200/

throwaway920102

0 replies

16h32m

2024-06-13 01:58:23 UTC

What can stop Google is building the wrong thing, or being so scared to launch anything that they smother their own fledgling products before they are born or before they can mature. Their product and finance and accounting teams should be tossed.

raincole

0 replies

14h57m

2024-06-13 03:33:15 UTC

^ How to compact as many mistakes as possible in one single comment.

1. Google's stock didn't siginificantly outperformed Meta, Microsoft, etc, in thet past two years.

2. Meta and Microsoft are trying to make their own chips as well.

3. They're not using "consumer video cards" to train AI. I don't even know if you can call these beasts video cards any more. H100 doesn't have HDMI port.

lxgr

0 replies

15h43m

2024-06-13 02:47:43 UTC

I wish I had your faith in Google’s ability to refrain from kneecapping their own perfectly fine product.

jejeyyy77

0 replies

1h30m

2024-06-13 16:59:57 UTC

lolwat.

google is the biggest loser in all of this.

jatins

0 replies

12h34m

2024-06-13 05:55:52 UTC

this was the same argument that was presented a decade ago on why Google was supposed to win the cloud because their internal infra was miles ahead of Amazon and Microsoft.

Yet here we are. Will the consumer video cards get cheaper and better faster or will Google's directors' infighting stop first?

htrp

0 replies

15h21m

2024-06-13 03:09:37 UTC

Literally the only thing that can stop Google now is the fact they keep bringing Microsoft and Oracle flunkies into leadership positions.

You mean the thing that's already stopped them? If they had seriously invested into the TPU ecosystem in 2015, they would already have "won" AI.

hooloovoo_zoo

0 replies

16h57m

2024-06-13 01:32:51 UTC

Google has been working on TPUs and Tensorflow for a decade with pretty mixed success; I don't think it's clear that they're going to win.

hipadev23

0 replies

14h39m

2024-06-13 03:51:12 UTC

Google’s on their 6th generation and still can’t find anyone to use it. Hmm.

guardiang

0 replies

14h2m

2024-06-13 04:28:45 UTC

Google is old like MetLife, relative to each's respective industry. Both are carrying too much baggage and are top heavy. As a result, I personally don't think Google will be able to keep pace with OpenAI in the long run.

fragmede

0 replies

15h16m

2024-06-13 03:14:38 UTC

smart money has a diversified portfolio and isn't betting on any one winner and has invested in all of them, and then some.

callalex

0 replies

15h34m

2024-06-13 02:56:35 UTC

That’s how all huge tech companies become dinosaurs though. Upper management that is already stupidly wealthy (and therefore unmotivated) have the funding and patience to hire geniuses to build incredible machines and then constantly tie their shoelaces together while asking them to sprint. Examples include Microsoft and Oracle as you said, and before them IBM, AT$T, TIBCO, Marvell, Motorola, I could go on for a while…

blackoil

0 replies

15h48m

2024-06-13 02:42:13 UTC

I would call it "stupid" money. This isn't a commodity business. Value of the final product is orthogonal to amount invested in compute. If Google is 10% slower or its product is 10% worse, it can lose all the value. This is like valuing a software company higher because its devs are using cheap PC desktops instead of Mac.

asynchronous

0 replies

15h6m

2024-06-13 03:24:35 UTC

Laughable response when you actually look at the quality of the algorithms being produced by Google. They’re so behind it’s embarrassing.

ai4ever

0 replies

16h50m

2024-06-13 01:40:45 UTC

here is an older take on this same topic..

https://www.yitay.net/blog/training-great-llms-entirely-from...

GPU vs TPU, and good software managing large clusters of them across all sorts of failure.

the funny bit from the above article is the incident when someone forgot about a training job at google, and month later had the model fully trained without an alert of any kind. "outrageously good infra"

KaiserPro

0 replies

8h53m

2024-06-13 09:37:01 UTC

Meta, Microsoft, OpenAI, etc. are trying to address problems with consumer video cards

Yes, but you are buying access to tested, supported units that are proven to work, don't require custom software, and are almost plug an go. When its time to upgrade, its not that costly.

Designing, fabricating and deploying your own silicon is Expensive, creating software support for it, also more expense. THen there is the opportunity cost of having to optimise the software stack your self.

You're exchanging a large capex, for a similar sized capex plus a fuckton of opex as well.

Jabrov

0 replies

16h21m

2024-06-13 02:09:30 UTC

"Consumer video cards"? Meta's not building their clusters out of 3090s.

They're using advanced cards meant for data centers and machine learning -- almost effectively "custom silicon"

HarHarVeryFunny

0 replies

16h35m

2024-06-13 01:54:54 UTC

The chips are somewhat irrelevant. It's the overall system architecture, management, and fault recovery that matters.

Der_Einzige

0 replies

13h32m

2024-06-13 04:58:15 UTC

Until the lion share of AI projects support that custom silicon, I will continue to bet on anyone buying Nvidia GPUs.

samspenc

25 replies

15h45m

2024-06-13 02:45:21 UTC

OK this was a bit funny:

  Top HW failure modes: 
  * GPU falling off the bus

I honestly thought "do they mean GPUs falling off a bus entering the data center" and then realized its actually the connectivity, as they mention in the next line

  GPUs falling off: In this case, GPUs are not detected by the host on PCIe.

redbell

8 replies

9h20m

2024-06-13 09:10:26 UTC

GPU falling off the bus

I'm wondering if we could prompt llama3 with the above statement. What kind of response would it give?

TeMPOraL

7 replies

8h39m

2024-06-13 09:51:46 UTC

With temperature set to 1, it recognizes the joke, but proceeds to explain what the "bus" is in computer terms, picks a problem this prompt could mean, and explains how to solve it. In ~20 tries it always gave me something along the lines of:

The infamous "GPU falling off the bus" issue!

This problem typically occurs when a graphics processing unit (GPU) is not properly seated or connected to its expansion slot, such as PCIe, on a motherboard.

Here are some troubleshooting steps to help resolve the issue:

(numbered list of steps or options follows)

Tested on Llama 3 Instruct 7B Q8_0, because that one fits entirely on my GPU.

redbell

6 replies

7h41m

2024-06-13 10:49:25 UTC

+1, interesting findings! I like how it was able to infer the meaning from such a short phrase in a limited context.

burkaman

4 replies

4h59m

2024-06-13 13:31:38 UTC

It's actually a very common phrase on forums, I think because it's an actual error that Linux will report: https://askubuntu.com/questions/868321/gpu-has-fallen-off-th.... I've also never heard of it, but it seems like it must appear a lot in the training data and probably about 0 times is referring to a bus on the road.

TeMPOraL

3 replies

4h52m

2024-06-13 13:37:50 UTC

In my testing, both Llama 3 and its abliterated (uncensored) variant from[0] almost always remarked more or less directly that they see the joke in the phrase, so either they've seen the other meaning in training, or inferred it.

[0] - https://news.ycombinator.com/item?id=40665721

Technetium

1 replies

3h44m

2024-06-13 14:46:36 UTC

Please use the word ablated instead. That article's title is not using a real word. I'm assuming it's the author's English issue, since they called the model "helpfull" instead of "helpful".

TeMPOraL

0 replies

2h50m

2024-06-13 15:39:56 UTC

Oops. I actually originally wrote "ablated", then changed it to be consistent with the title.

burkaman

0 replies

4h25m

2024-06-13 14:05:30 UTC

Oh I agree it probably inferred the joke. I was actually more surprised that it knew the real meaning of the phrase because I as a human did not, until I looked it up and saw how common it is.

TeMPOraL

0 replies

7h34m

2024-06-13 10:56:27 UTC

To be specific, the system prompt used was (default in LM Studio config for Llama 3 V2):

You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.

And then the query was:

GPU falling off the bus

And yes, I imagine it read that query as ending with an implied "pls help!".

m463

3 replies

12h27m

2024-06-13 06:03:26 UTC

Actually, they "fell" off the truck:

https://www.theverge.com/2021/11/6/22767046/someone-stole-sh...

sva_

2 replies

8h36m

2024-06-13 09:54:43 UTC

Back when EVGA was still selling GPUs ...

taneq

0 replies

6h38m

2024-06-13 11:52:40 UTC

They did it to finance their street racing habit, I'm sure. :P

TiredOfLife

0 replies

8h22m

2024-06-13 10:08:22 UTC

And offering warranty. And not doing stealth total component change under same sku.

martin-adams

2 replies

12h32m

2024-06-13 05:58:47 UTC

A GPU falling off the bus would be one mega flop

ardit33

0 replies

4h5m

2024-06-13 14:25:32 UTC

Haha… It is a a ‘Tera Flop’, as they are falling in the ground…

TeMPOraL

0 replies

12h24m

2024-06-13 06:06:27 UTC

The audience in the back goes clap clap clap, chapeau bas.

whazor

1 replies

11h23m

2024-06-13 07:07:32 UTC

I was imagining that some sys admin has to walk to the server, take out the GPU, blow against the PCI-E pins like a game cartridge, and put it back to try again.

iszomer

0 replies

7h45m

2024-06-13 10:45:17 UTC

More to do with bent pins, material obstruction, or something as trivial as cable management (eg: bundles of qsfp weighing down the ports that are press-fitted not soldered).

cachvico

1 replies

15h6m

2024-06-13 03:24:28 UTC

Brings a whole new meaning to bus factor

throwup238

0 replies

13h34m

2024-06-13 04:56:30 UTC

I’ve never met a GPU that could survive getting hit by a bus.

ChuckMcM

1 replies

13h28m

2024-06-13 05:02:08 UTC

The bits on the bus go round and round!

There is a lot of interesting yet unpublished work on 'data center' scale compute complexes. It was a rabbit hole I fell into several times while at Google.

falcor84

0 replies

5h44m

2024-06-13 12:46:38 UTC

They do publish some of that, or at least they used to. In particular "The Datacenter as a Computer" [0] was a very interesting read.

[0] https://research.google/pubs/the-datacenter-as-a-computer-an...

spmurrayzzz

0 replies

3h19m

2024-06-13 15:11:47 UTC

Speaking for myself (and I guess anyone else dealing with pcie riser hell in on-prem deep learning setups), its nice to see the massive orgs dealing with pretty much the same exact pain points as not-so-massive orgs.

rfoo

0 replies

9h22m

2024-06-13 09:08:30 UTC

"GPU has fallen off the bus" is an actual error message nvidia.ko prints to dmesg in this case :p

krowfromthewall

0 replies

6h3m

2024-06-13 12:27:00 UTC

like they did to our dear Anton in Silicon Valley

kdot

8 replies

16h1m

2024-06-13 02:29:32 UTC

How will Meta leverage LLMs at scale to drive revenue? It's not clear.

sangnoir

1 replies

13h40m

2024-06-13 04:49:56 UTC

If only Meta had a way to monetize engagement with generative AI content in a way that scales with quantity of generated content.

tucnak

0 replies

12h55m

2024-06-13 05:35:02 UTC

Revolutionary!

HDThoreaun

1 replies

12h16m

2024-06-13 06:14:26 UTC

I think the nearish term plan is chatbot customer assistants on whatsapp. Doesnt seem like theyre that close to releasing them but who knows

altdataseller

0 replies

10h19m

2024-06-13 08:10:54 UTC

At the end of the day, everything eventually comes down to chatbots

threeseed

0 replies

15h8m

2024-06-13 03:22:03 UTC

I still believe that a VR future is coming once the technology commoditises i.e. costs come down 10x and we have 3090 level GPUs in the headset. At that point we will have photo-realistic experiences like concerts etc that anyone can afford.

And at that point having a lot of LLM based avatars that can help "fill in the space" will be valuable.

dweekly

0 replies

15h23m

2024-06-13 03:06:50 UTC

Ask the AI to come up with a business model.

OsrsNeedsf2P

0 replies

12h20m

2024-06-13 06:10:28 UTC

LLMs aren't a monetizable product themselves. For the foreseeable future, that will always be ads. LLMs (and VR) are just big bets on getting ahead of future technology.

123yawaworht456

0 replies

8h29m

2024-06-13 10:00:48 UTC

1. improving their adtech. someone else's API offerings are not an option due to the sheer volume, PII and whatnot.

2. virtually free moderation for their existing (facebook, instagram, threads) and future social media services. likewise, their volume is too insane to even consider paying someone else to process it.

the models they do release are probably toys in comparison to their internal models.

Oras

8 replies

15h0m

2024-06-13 03:30:30 UTC

Would be nice to read how do they collect/prepare data for training.

Which data sources? How much of Meta users data (fb, instagram… etc). How do they sanitize PII?

OsrsNeedsf2P

4 replies

12h24m

2024-06-13 06:06:06 UTC

How do they sanitize PII?

I can't comment on how things like faces get used, but in my experience, PII at Meta is inaccessible by default.

Unless you're impersonating a user on the platform (to access what PII they can see), you have to request special access for logs or database columns that contain so much as user IDs, otherwise the data simply won't show up when you query for it. This is baked into the infrastructure layer, so I doubt the GenAI teams are using something else.

actionfromafar

2 replies

10h37m

2024-06-13 07:53:40 UTC

For a convenient definition of PII. Isn’t everything a user does in aggregate PII?

robertlagrant

0 replies

9h20m

2024-06-13 09:10:14 UTC

I don't think it's PII. If you had someone's movements, you could go and spy on them, find out who they were (i.e. their PII) and then link that back and say "I now know this identified person's movements". I don't think the movements themselves are PII.

Things that aren't PII aren't "convenient" definitions. Doesn't mean everything that isn't PII is fine to share. It's like saying a kidnapping isn't a murder. That's not a convenient definition of murder; it's just a different thing. We shouldn't start talking like witch hunters as soon as we encounter a situation that we haven't memorised a reasonable response to. We should be able to respond reasonably to new situations.

KaiserPro

0 replies

8h58m

2024-06-13 09:32:24 UTC

PII is pretty intuitive to define.

Obvious examples: data that easily identifies a person (Photo, name, number, UUID, etc)

Thats trivial to block. Where it gets harder is stuff that on it's own isn't PII, but combined with another source, would be

For example, aggregating public comments on a celeb's post. (ie stripping out usernames and likes and assigning a new UUID to each person.) For a single post, thats good enough. You're very unlikley to be able to identify a single person.

But over multiple posts, thats where it gets tricky.

As with large companies, the process for getting permission to use that kind of data is righty difficult, so it often doesn't get used like that.

sonofaragorn

0 replies

30m

2024-06-13 18:00:19 UTC

What about a post or comment that includes proper names?

troupo

0 replies

7h8m

2024-06-13 11:22:31 UTC

Would be nice to read how do they collect/prepare data for training.

By literally opting everyone into their training data set and making it very cumbersome to opt out: https://threadreaderapp.com/thread/1794863603964891567.html

michaelt

0 replies

8h46m

2024-06-13 09:43:50 UTC

Then, you should check out papers like https://arxiv.org/abs/2302.13971 and https://arxiv.org/abs/2307.09288

In the paper covering the original Llama they explicitly list their data sources in table 1 - including saying that they pretrained on the somewhat controversial books3 dataset.

The paper for Llama 2 also explicitly says they don't take data from Meta's products and services; and that they filter out data from sites known to contain a lot of PII. Although it is more coy about precisely what data sources they used, like many such papers are.

discobot

0 replies

9h41m

2024-06-13 08:49:39 UTC

They explicitly train models only on public datasets

radarsat1

3 replies

11h45m

2024-06-13 06:45:34 UTC

Frustratingly little information. For example, I'm exceedingly curious how they deal with scheduling jobs on such a huge array of machines. The article:

Efficient scheduling helps ensure that our resources are used optimally. This involves sophisticated algorithms that can allocate resources based on the needs of different jobs and dynamic scheduling to adapt to changing workloads.

Wow thanks for that, captain obvious. So how do you do it?

p4ul

2 replies

3h37m

2024-06-13 14:53:17 UTC

I usually assume these companies are using some of the popular schedulers (e.g., Slurm, MOAB, SGE) that have existed in the HPC community for many years.

I have anecdotally also heard that some are using k8s, but I've not seen that myself. Slurm [1] is basically built for this stuff; that's definitely what I would use!

[1] https://slurm.schedmd.com/documentation.html

claytonjy

1 replies

1h34m

2024-06-13 16:56:08 UTC

Slurm is definitely still dominant, but OpenAI has been using k8s for training for many years now¹, and there are various ways to run slurm on top of Kubernetes, including the recent SUNK from coreweave²

at my company we use slurm "directly" for static compute we rent or own (i.e. not in a public cloud), but are considering using Kubernetes because that's how we run the rest of the company, and we'd rather invest more effort into being better at k8s than becoming good slurm admins.

¹: https://openai.com/index/scaling-kubernetes-to-2500-nodes/

²: https://www.coreweave.com/blog/sunk-slurm-on-kubernetes-impl...

p4ul

0 replies

1h30m

2024-06-13 16:59:59 UTC

Very cool! Thank for this, claytonjy!!

idkdotcom

3 replies

15h41m

2024-06-13 02:49:42 UTC

These seem classic challenges with running distributed systems loads that are not specific to training LLMs.

Anyone of the super computers listed here https://en.wikipedia.org/wiki/TOP500 suffers from the same issues.

Think about it. While the national labs use these systems to model serious stuff -such as climate or nuclear weapons- Meta uses them to train LLMs. What a joke, honestly!

whiplash451

0 replies

11h58m

2024-06-13 06:32:35 UTC

A lot of serious things look like a toy or a joke at first.

mhandley

0 replies

7h32m

2024-06-13 10:58:23 UTC

On the other hand, Meta just rapidly built two different training networks in existing datacenter buildings, with existing cooling constraints, using mostly commodity components (albeit expensive commodity components) each of which would place at #3 on that top500 list in terms of GPU power. Compare that with how long it took to get any of the other supercomputers from design to being fully commissioned.

_zoltan_

0 replies

1h27m

2024-06-13 17:03:22 UTC

For profit is not less serious than what research labs do. I'd even say it's more important: they drive the economy.

yosito

2 replies

15h22m

2024-06-13 03:08:31 UTC

I wish that instead of just training another stupid LLM, Meta would use it to improve their search and help me find the content I'm actually interested in.

TeMPOraL

1 replies

12h35m

2024-06-13 05:55:28 UTC

Their revenue depends on it being hard (but not impossible) for you to find the content you're actually interested in. Would be nice if it didn't, but in this reality, money on the Internet is made by wasting users' lives. That is what attention economy is about.

rmbyrro

0 replies

1h6m

2024-06-13 17:24:01 UTC

It's actually a mix. They need to disappoint the user for the right amount of time, to then please it at the right moment and dose. This maximizes the dopamine release and increases addictiveness.

When you find good content depends on when the algo judges you're already primed to a colorful dopamine intake.

xvector

2 replies

15h37m

2024-06-13 02:53:31 UTC

So we decided to build both: two 24k clusters, one with RoCE and another with InfiniBand. Our intent was to build and learn from the operational experience.

I love how they built two completely insane clusters just to learn. That's badass.

riku_iki

0 replies

15h33m

2024-06-13 02:57:02 UTC

More like Mark gave them 100k GPUs, and they are not sure what exactly to do with them..

logicchains

0 replies

11h37m

2024-06-13 06:52:51 UTC

It's not just to learn; an RoCE ethernet cluster with Aristas is way cheaper to build and maintain than a fancy InfiniBand cluster with Mellanox/NVidia networking, so proving that the former is good enough at scale will eventually save Meta a huge amount of money. InfiniBand cards are much more expensive than ethernet because there's few vendors, that have a quasi-monopoloy, and because overall far fewer of them are produced so there's less economy of scale.

whalesalad

2 replies

17h7m

2024-06-13 01:23:23 UTC

interesting that their domain is still engineering.fb.com

samspenc

1 replies

15h48m

2024-06-13 02:42:04 UTC

I think fb.com is their internal domain and they never really bothered to change it, I think employees used to have a @fb.com e-mail, at least this was true a few years ago, not sure if that has changed.

dnissley

0 replies

15h31m

2024-06-13 02:59:21 UTC

Emails switched to @meta.com in 2022.

lokimedes

2 replies

12h37m

2024-06-13 05:52:49 UTC

Yikes, the little Infiniband+A100 cluster I installed for my previous company seemed useful at the time (12 GPUs) and that was at a cost of around $300k. With LLMs it feels like game over for non-cloud applications if you are not a mega-corp.

lannisterstark

1 replies

12h21m

2024-06-13 06:09:11 UTC

Well, yes, but Not all models need to be "super large." Smaller models, specialized in specific tasks, working together - and then reporting to a slightly larger model is the way to go.

Think of everything being connected to a "Home Computer" in those "Future House of 2020" videos that were out there in 70s or what not.

Another example (very rough) would be something like "Weather data gets to a small model via an API, model looks at it, updates the home dashboard, also sees if there's any alerts, if so, adds x or y to home dashboard appropriately as to what it thinks best."

We can probably achieve the latter example today. (without any significant 'coding' on anyone's part except the API owner)

afro88

0 replies

7h4m

2024-06-13 11:25:56 UTC

Well, yes, but Not all models need to be "super large." Smaller models, specialized in specific tasks, working together - and then reporting to a slightly larger model is the way to go.

I want to believe, but I'm still yet to see this kind of set up being anywhere near GPT-4 level.

The weather example seems quite contrived. Why not just display the alerts for your area? Why is a complex system of smaller models reporting up to a slightly larger model necessary?

jauntywundrkind

1 replies

17h13m

2024-06-13 01:17:01 UTC

Random q, I wonder if gloo is used in these systems? https://github.com/facebookincubator/gloo

RDMA and GPUDirect capable. Coordinates over MPI or (hi)redia.

runeblaze

0 replies

16h11m

2024-06-13 02:19:34 UTC

̶I̶I̶R̶C̶ ̶g̶l̶o̶o̶ ̶i̶s̶ ̶C̶P̶U̶ ̶t̶e̶n̶s̶o̶r̶s̶ ̶o̶n̶l̶y̶ ̶s̶o̶ ̶l̶i̶k̶e̶l̶y̶ ̶n̶o̶t̶

Edit: I had a brain freeze or something... gloo is not CPU only but for whatever reason I don't see it outside of CPU-comms

dudus

1 replies

15h18m

2024-06-13 03:12:43 UTC

Since we did not have time to change the cooling infrastructure, we had to remain in an air-cooled environment. The mechanical and thermal designs had to change to accommodate this, and that triggered a validation cycle to support a large-scale deployment.

All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints, with a very small degree of freedom to change and meet a tight schedule.

Seems like the time constraints put into the team impacted the overall quality of the model.

vessenes

0 replies

5h28m

2024-06-13 13:02:05 UTC

This sounds to me like standard CYA / perhaps good-natured complaining from the tech team.

The last tech team to have no budget and time constraints to pursue their vision? I don’t know, the Xanadu team? Romero’s original Daikatana team?

koolala

0 replies

15h8m

2024-06-13 03:22:25 UTC

what a title...