return to table of content

Calculating the cost of a Google DeepMind paper

arcade79
39 replies
3d7h

A lot of misunderstandings among the commenters here.

From the link: "the total compute cost it would take to replicate the paper"

It's not Google's cost. Google's cost is of course entirely different. It's the cost for the author if he were to rent the resources to replicate the paper.

For Google, all of it is running at a "best effort" resource tier, grabbing available resources when not requested by higher priority jobs. It's effectively free resources (except electricity consumption). If any "more important" jobs with a higher priority comes in and asks for the resources, the paper-writers jobs will just be preempted.

bombcar
21 replies
3d7h

This is the side effect of underutilized capital and it’s present in many cases.

For example, if YOU want to rent a backhoe to do some yard rearrangement it’s going to cost you.

But Bob who owns BackHoesInc has them sitting around all the time when they’re not being rented or used; he can rearrange his yard wholesale or almost free.

thaumasiotes
18 replies
3d6h

This is the side effect of underutilized capital and it’s present in many cases.

"Underutilized" isn't the right word here. There's some value in putting your capital to productive use. But, once immediate needs are satisfied, there's more value in having the capital available to address future needs quickly than there would be in making sure that everything necessary to address those future needs is tied up in low-value work. Option value is real value; being prepared for unforeseen but urgent circumstances is a real use.

efitz
7 replies
3d5h

I think a better description than “underutilized” would be “sunk capex cost” - Google (or any cloud provider) cannot run at 100% customer utilization because then they could neither acquire new customers nor service transitory usage spikes for existing customers. So they stay ahead of predicted demand, which means that they will almost always have excess capacity available.

Cloud providers pay capital costs (CapEx) for servers, GPUs, data centers, employees, etc. Utilization allows them to recoup those costs faster.

Cloud customers pay operational expenses (OpEx) for usage.

So Google generally has excess capacity, and while they would prefer revenue-generating customer usage, they’ve already paid for everything but the electricity, so it’s extremely cheap for them to run their own jobs if the hardware would otherwise be sitting idle.

immibis
5 replies
3d5h

There is also a mathematical relationship in queuing theory between utilization and average queue length, which all programmers should be told: https://blog.danslimmon.com/2016/08/26/the-most-important-th...

As you run close to 100% utilization, you also run close to infinity waiting times. You don't want that. It might be acceptable for your internal projects (the actual waiting time won't be infinity, and you'll cancel them if it gets too close to infinity) but it's certainly not acceptable for customers.

thaumasiotes
2 replies
3d4h

There is a genre of game called "time management games" which will hammer this point home if you play them. They're not really considered 'serious' games, so you can find them in places where the audience is basically looking to kill time.

https://www.bigfishgames.com/us/en/games/5941/roads-of-rome/...

The structure of a time management game is:

1. There's a bunch of stuff to do on the map.

2. You have a small number of workers.

3. The way a task gets done is, you click on it, and the next time a worker is available, the worker will start on that task, which occupies the worker for some fixed amount of time until the task is complete.

4. Some tasks can't be queued until you meet a requirement such as completing a predecessor task or having enough resources to pay the costs of the task.

You will learn immediately that having a long queue means flailing helplessly while your workers ignore hair-on-fire urgent tasks in favor of completely unimportant ones that you clicked on while everything seemed relaxed. It's far more important that you have the ability to respond to a change in circumstances than to have all of your workers occupied at all times.

bombcar
1 replies
3d

You will learn immediately that having a long queue means flailing helplessly while your workers ignore hair-on-fire urgent tasks in favor of completely unimportant ones that you clicked on while everything seemed relaxed.

Ah, sounds like Dwarf Fortress!

immibis
0 replies
2d23h

I was thinking Oxygen Not Included.

efitz
0 replies
2d23h

TL/DR: You should think of and use queues like shock absorbers, not sinks. Also you need to monitor them.

Queues are useful to decouple the output of one process to the input of another process, when the processes are not synchronized velocity-wise. Like a shock absorber, they allow both processes to continue at their own paces, and the queue absorbs instantaneous spikes in producer load above the steady state rate of the consumer (side note: if queues are isolated code- and storage-wise from the consumer process, then you can use the queue to prevent disruption in the producer process when you need to take the consumer down for maintenance or whatever).

Running with very small queue lengths is generally fine and generally healthy.

If you have a process that consistently runs with substantial queue lengths, then you have a mismatch between the workloads of the processes they connect - you either need to reduce the load from the producer or increase the throughput of the consumer of the queue.

Very large queues tend to hide the workload mismatch problem, or worse. Often work put into queues is not stored locally on the producer, or is quickly overwritten. So a consumer end problem can result in potential irrevocable loss of everything in the queue, and the larger the queue, the bigger the loss. Another problem with large queues is that if your consumer process is only slightly faster than the producer process, then a large backlog of work in the queue can take a long time to work down, and it's even possible (admission of guilt) to configure systems using such queues such that they cannot recover from a lengthy outage, even if all the work items were stored in the queue.

If you have queues, you need to monitor your queue lengths and alarm when queue lengths start increasing significantly above baseline.

dekhn
0 replies
3d

In practice it's more complicated than this- borg isn't actually a queue, it's a priority-based system with preemption, although people layered queue systems on top. Further, granularity mattered a lot- you could get much more access to compute by asking for smaller slices (fractions of a CPU core, or fraction of a whole TPU cluster). There was a lot of "empty crack filling" at google.

bbarnett
0 replies
3d5h

I doubt they are doing this, but if they did burn in tests with 3 machines doing identical workloads, they could validate workloads but also test new infra. Unlike customer workloads, it would be OK to retey due to error.

This would be 100% free, as all electricity and "wear and tear" would be required anyhow.

nathancahill
5 replies
3d6h

Same effect when leasing companies let office space sit unoccupied for years on end. The future value is higher than the marginal value of reducing the price to fill it with a tenant.

Bjartr
3 replies
3d6h

That may be part of it for spaces properties left unleased for years, but I believe it's not the only part.

I believe the larger factor, and someone correct me if they have a better understanding of this, is that for commercially rented properties the valuation used to determine the mortgage terms you get takes into account what you claim to be able to get from rent. Renting for less than that reduces the valuation and can put you upside down on the mortgage. But the bank will let you defer mortgage payments, effectively taking each month of mortgage duration and moving it from now to after the last month of the mortgage duration, extending the time they earn interest for.

So if no one want to lease the space at that price after a prior lessee leaves for whatever reason, it's better for the property owner financially to leave the space vacant, sometimes for years, until someone willing to pay that price comes along, than to lower the rent and get a tenant.

khafra
0 replies
3d5h

Land Value Tax would fix this.

bombcar
0 replies
3d5h

This is mostly correct. People assume commercial loan terms are like single-family homes "but larger" but they're not. They basically are all custom financial deals with multiple banks and may be over multiple properties. As long as total vacancy isn't below a cutoff the banks will be happy, but lowering rents "just to get a tenant" can harm the valuation and trigger terms.

Part of the reason things like Halloween Superstores can pop in is the terms often exclude "short term leases" which are under six months.

Also when you're leasing to companies, they are VERY quick to jump at lower prices if available, which means that if you drop the lease for one tenant, the others are sure to follow, sometimes even before lease terms are up.

bbarnett
0 replies
3d5h

Many cities only tax on leased property, or have very low rates on unleased property.

unyttigfjelltol
0 replies
3d6h

Real estate is a playground for irrationally hopeful or stubborn participants.

axus
1 replies
3d5h

I'm going to say this the next time I argue I need my servers online 24/7.

thaumasiotes
0 replies
3d4h

I'm not really sure I'm following you.

franga2000
0 replies
3d6h

In the case of compute, you can evict low-priority jobs nearly instantly, so the compute capacity running spot instances and internal side-projets is just as available for unexpected bursts as it would be if sitting idle.

bombcar
0 replies
3d6h

Yeah, airlines make "more return on capital" by faster turn-around of planes to a point - if they are utilizing their airframes above 80 or 90 or whatever percent, the airline itself becomes extremely fragile and unable to handle incidents that impact timing.

We saw the same thing with JIT manufacturing during Covid.

mikepurvis
1 replies
3d5h

Car lots with attached garages are like this too. That brake and suspension work they were going to charge you several thousand dollars for? Once you trade in ol' Bessie they'll do that for pennies on the dollar during slack time; it doesn't hurt them if the car sits around for a few weeks or months before being ready for sale.

WarOnPrivacy
0 replies
3d5h

Car lots with attached garages are like this too.

This was my first job after moving into this state. Between my labor and parts, it was about 15% of the sale price.

My most interesting repair was a 1943 Cadillac, a 'war car'.

punnerud
5 replies
3d7h

Can others also buy the “best effort” tier?

If the job could easily run for weeks, even when you could buy your way for doing it in a day.

Then have a bidding on this “best effort” resource, where they factor in electricity at any given time

curt15
2 replies
3d6h

Is the "best effort" tier similar to AWS spot instances?

WJW
1 replies
3d6h

At every cloud provider there's probably a tier below "spot" (or whatever the equivalent is called at AWS's competitors) that is used for the low-priority jobs of the cloud provider itself.

jeffbee
0 replies
3d1h

You can speculate about this or you can look at how Google's internal workloads actually run, because they have released a large and detailed set of traces from Borg. They're really open about this.

https://github.com/google/cluster-data

v3ss0n
0 replies
3d6h

Sure,.land a job there, work the way all up against the cooperate bs and toxicity and you can get best effort tier.

Those effort needs to be added in the cost calculation too.

v3ss0n
0 replies
3d6h

Sure,.land a job there, work the way all up against the cooperate bs and toxicity and you can get best effort tier.

Those effort needs to be added in the cost calculation too

mrazomor
3 replies
3d7h

This assumes the common resources (CPU, RAM, etc.), not the ones required for the LLM training (GPU, TPU, etc.). It's different economy.

TL; DR: It's not ~free.

akutlay
2 replies
3d6h

Why does GPU matter? Do you think GCP keeps GPU utilization at 100% at all times?

mrazomor
0 replies
3d4h

What the OP is referring to requires overprovisioning of the high priority traffic and the sine-like utilization (without it, the benefits of the "batch" tier is close to zero -- the preemption is too high for any meaningful work when you are close to the top of the utilization hill).

You get that organically when you are serving lots of users. And, there's not much GPUs etc. used for that. Training LLMs gives you a different utilization pattern. The "best effort" resources aren't as useful in that setup.

bbminner
0 replies
3d4h

Because accelerators (tpus, gpus) unlike ram/cpu are notoriously hard to timeshare and vitrualize. So if you get evicted in an environment like that, you have to reload your entire experiment state from a model checkpoint. With giant models like that, it might take dozens of minutes. As a result, I doubt that these experiments are done using "spare" resources - in that case, constant interruptions and reloading would result in these experiments finishing sometime around the heat death of the universe :)

imtringued
2 replies
3d5h

According to neoclassical economists this is impossible since you can easily and instantaneously scale infrastructure up and down continuously at no cost and the future is known so demand can be predicted reliably.

The problem with neoclassical economics is that it doesn't concern itself with the physical counterpart of liquidity. It is assumed that the physical world is just as liquid as the monetary world.

The "liquidity mismatch" between money and physical capital must be bridged through overprovisioning on the physical side. If you want the option to choose among n different products, but only choose m products, then the n - m unsold products must be priced into the m bought products. If you can repurpose the unsold products, then you make a profit or you can lower costs for the buyer of the m products.

I would even go as far as to say that the production of liquidity is probably the driving force of the economy, because it means we don't have to do complicated central planning and instead use simple regression models.

jopsen
1 replies
3d4h

I would even go as far as to say that the production of liquidity is probably the driving force of the economy.

Isn't that all what high frequency traders would say? :)

Perhaps there is some limit at which additional liquidity doesn't offer much value?

marcosdumay
0 replies
3d2h

I think you completely misunderstood the GP.

There isn't much there about stocks markets.

rldjbpin
0 replies
2d10h

if this is the way they pull it off consistently, it might be a good business models for those working on research like stability to also moonlight a gpu cloud service.

it is a hustle only for the near future while this bubble lasts, but can help reduce costs.

huijzer
0 replies
3d5h

Still, don’t get high on your own supply.

dweekly
0 replies
3d5h

Possible corollary: it may be difficult to regularly turn out highly compute-dependent research if you're paying full retail rack rates for your hardware (i.e. using someone else's cloud).

152334H
0 replies
3d4h

Is it free-priority based?

I was told by an employee that GDM internally has a credits system for TPU allocation, with which researchers have to budget out their compute usage. I may have completely misunderstood what they were describing, though.

BartjeD
31 replies
3d7h

If this ran on google's own cloud it amounts to internal bookkeeping. The only cost is then the electricity and used capacity. Not consumer pricing. So negligible.

It is rather unfortunate that this sort of paper is hard to reproduce.

That is a BIG downside, because it makes the result unreliable. They invested effort and money in getting an unreliable result. But perhaps other research will corroborate. Or it may give them an edge in their business, for a while.

They chose to publish. So they are interested in seeing it reproduced or improved upon.

rrr_oh_man
7 replies
3d6h

> They chose to publish. So they are interested in seeing it reproduced or improved upon.

Call me cynical, but this is not what I experienced to be the #1 reason of publishing AI papers.

ash-ali
3 replies
3d4h

I hope someone could share their insight on this comment. I think the other comments are fragile and don't hold too strongly.

theptip
1 replies
3d3h

Marketing of some sort. Either “come to Google and you’ll have access to H100s and freedom to publish and get to work with other people who publish good papers”, which appeals to the best researchers, or for smaller companies, benchmark pushing to help with brand awareness and securing VC funding.

pishpash
0 replies
2d15h

Come be dishwashers in the fancy kitchen! You can only have one chef after all and the line cook positions are filled long ago too, but dishes don't wash themselves.

godelski
0 replies
2d22h

It's commonly discussed in AI/ML groups that a paper at a top conference is "worth a million dollars." Not all papers, some papers are worth more. But it is in effect discussing the downstream revenues. As a student, it is your job and potential earnings. As a lab it is worth funding and getting connected to big tech labs (which creates a feedback loop). And to corporations, it is worth far more than that in advertising.

The unfortunate part of this is that it can have odd effects like people renaming well known things to make the work appear more impressive, obscure concepts, and drive up their citations.[0] The incentives do not align to make your paper as clear and concise as possible to communicate your work.

[0] https://youtu.be/Pl8BET_K1mc?t=2510

echoangle
2 replies
3d5h

As someone not in the AI space, what do you think is the reason for publishing? Marketing and hype for your products?

simonw
1 replies
3d5h

Retaining your researchers so they don't get frustrated and move to another company that lets them publish.

a_bonobo
0 replies
3d4h

and attracting other researchers so your competitors can't pick them up to potentially harm your own business

rty32
6 replies
3d6h

Opportunity cost is cost. What you could have earned by selling the resources to customers instead of using them yourself is what the resources are worth.

g15jv2dp
3 replies
3d6h

This assumes that you can sell 100% of the resources' availability 100% of the time. Whenever you have more capacity that you can sell, there's no opportunity cost in using it yourself.

michaelt
1 replies
3d4h

A few months back, a lot of the most powerful GPU instances on GCP seemed to be sold out 24/7.

I suppose it's possible Google's own infrastructure is partitioned from GCP infrastructure, so they have a bunch of idle GPUs even while their cloud division can sell every H100 and A100 they can get their hands on?

dmurray
0 replies
2d22h

I'd expect they have both: dedicated machines that they usually use and are sometimes idle, but also the ability to run a job on GCP if it makes sense.

(I doubt it's the other way round, that the Deepmind researchers could come in one day and find all their GPUs are being used by some cloud customer).

myworkinisgood
0 replies
3d2h

As someone who worked for an compute time provider, I can tell you that the last people who can use the system for free are internal people. Because external people bring in cash revenue while internal people just bring in potential future revenue.

nkrisc
0 replies
3d6h

Not if you’re only using the resources when they’re available because no customer has paid to use them.

K0balt
0 replies
3d4h

I think Google produces their own power, so they don’t pay distribution cost which is at least one third of the price of power, even higher for large customers.

stairlane
3 replies
3d1h

The only cost is then the electricity and used capacity. Not consumer pricing. So negligible.

I don’t think this is valid, as this point seems to ignore the fact that the data center that this compute took place in required a massive investment.

A paper like this is more akin to HEPP research. Nobody has the capability to reproduce the higgs results outside of at the facility the research was conducted within (CERN).

I don’t think reproduction was a concern of the researchers.

morbia
1 replies
3d

The Higgs results were reproduced because there are two independent detectors at CERN (Atlas and CMS). Both collaborations are run almost entirely independently, and the press are only called in to announce a scientific discovery if both find the same result.

Obviously the 'best' result would be to have a separate collider as well, but no one is going to fund a new collider just to reaffirm the result for a third time.

stairlane
0 replies
3d

Absolutely, and well stated.

The point I was trying to make was the fact that nobody (meaning govt bodies) was willing to make another collider capable of repeating the results. At least not yet ;).

Rastonbury
0 replies
1d6h

Kinda but Google sells compute so it makes money off the data centre investment, assuming they had spare capacity for this it's negligible at Google scale

pintxo
3 replies
3d6h

They chose to publish. So they are interested in seeing it reproduced or improved upon.

Not necessarily, publishing also ensure that the stuff is no longer patentable.

slashdave
2 replies
3d1h

Forgive me if I am wrong, but all of the techniques explored are already well known. So, what is going to be patented?

pintxo
0 replies
2d11h

I merely listed another reason why someone would publish something. This did not imply they did if for that reason.

fragmede
0 replies
2d21h

the fundamental algorithms have been, sure, but there are innumerable enhancements upon those base techniques to be found and patented.

jfengel
3 replies
3d6h

Is the electricity cost negligible? It's a pretty compute intensive application.

Of course it would be a tiny fraction of the $10m figure here, but even 1% would be $100,000. Negligible to Google, but for Google even $10 million is couch cushion money.

stavros
1 replies
3d4h

I feel like your comment answers itself: If you have the money to be running a datacenter of thousands of A100 GPUs (or equivalent), the cost of the electricity is negligible to you, and definitely worth training a SOTA model with your spare compute.

dylan604
0 replies
3d3h

Is it really spare compute? Is the demand from others so low that these systems are truly idle? Does this also artificially make it look like demand is high because internal tasks are using it?

dekhn
0 replies
3d2h

The electricity cost is not neglible- I ran a service that had multiples of $10M in marginal electricity spend (IE, servers running at 100% utilization, consuming a significantly higher fraction than when idle, or partly idle). Ultimately, the scientific discoveries weren't worth the cost, so we shut the service down.

$10M is about what Google would spend to get a publication in a top-tier journal. But google's internal pricing and costs don't look anything like what people cite for external costs; it's more like a state-supported economy with some extremely rich oligarch-run profit centers that feed all the various cottage industries.

ape4
1 replies
3d3h

Its like them running SETI@home ;)

dekhn
0 replies
3d2h

We ran Folding@Home at google. we were effectively the largest single contributor of cycles for at least a year. It wasn't scientifically worthwhile, so we shut it down after a couple years.

That was using idle cycles on Intel CPUs, not GPUs or TPUs though.

K0balt
0 replies
3d4h

I’d imagine publishing is more oriented toward attracting and retaining talent. You need to scratch that itch or the academics will jump ship.

Cthulhu_
0 replies
3d3h

I'd argue it's not hard to reproduce per se, just expensive; thankfully there are at least half a dozen (cloud) computing providers that have the necessary resources to do so. Google Cloud, AWS and Azure are the big competitors in the west (it seems / from my perspective), but don't underestimate the likes of Alibaba, IBM, DigitalOcean, Rackspace, Salesforce, Tencent, Oracle, Huawei, Dell and Cisco.

rgmerk
30 replies
3d4h

Worth pointing out here that in other scientific domains, papers routinely require hundreds of thousands of dollars, sometimes millions of dollars, of resources to produce.

My wife works on high-throughout drug screens. They routinely use over $100,000 of consumables in a single screen, not counting the cost of the screening “libraries”, the cost of using some of the -$10mil of equipment in the lab for several weeks, the cost of the staff in the lab itself, and the cost of the time of the scientists who request the screens and then take the results and turn them into papers.

ramraj07
22 replies
3d4h

I estimated that any paper that has mouse work and produced in a first world country (I.e. they have to do good by the animals), the minimum cost of that paper in expenses and salary would be $200,000. Average likely higher. Tens of thousands of papers a year published like this!

paxys
19 replies
3d3h

These are mostly fixed costs. If you produce a hundred papers from the same team and same research, the costs aren't 100x.

lucianbr
14 replies
3d1h

But starting from the 10th paper, the value is also pretty low I imagine. How many new things can you discover from the same team and same research? That's 3 papers per year for a 30-year career. Every single year, no breaks.

sdenton4
12 replies
2d23h

Well, to be sure, mouse research consistently produces amazing cures for cancer, insomnia, lost limbs, and even gravity itself. Sure, none of it translates to humans, but it's an important source of headlines for high impact journals and science columnists.

godelski
8 replies
2d23h

This is also true for machine learning papers. They cure cancer, discover physics, and all sorts of things. Sure, they don't actually translate to useful science, but they are highly valuable pieces of advertisements. And hey, maybe someday they might!

dekhn
7 replies
2d18h

AlphaFold has basically paid the bills for a decade's worth of machine learning research. It's been that transformative.

jononor
5 replies
2d7h

Which bills and how? Not disputing the claim, would just like to understand it!

dekhn
4 replies
2d1h

I shouldn't have been so literal.

I just mean it demonstrated solving something challenging in a convincing way, justifying a great deal of additional resources being dedicated to applying ML to a wide range of biological research.

Not that it actually generates revenue or solves any really important health problems.

godelski
3 replies
1d18h

This comment is much more tempered and I do not think would generate a strong response. But I would suggest to take care when acting as an evangelical of a product. There's a big difference between "this technology shows great promise and warrants more funding as it won't be surprising if the benefits more than a decade's worth of losses at DeepMind" vs "this is a product right now generating billions of dollars a year".

The big problem with the latter statement isn't so much exaggeration, but something a bit more subtle. It is that people start to believe you. But then they sit waiting, and in that waiting eventually get disappointed. When that happens the usual response often feeds into conspiracies (perpetuating the overall distrust in science) or generates an overall bad sentiment against the whole domain.

The problem is that companies are bootstrapping with hype. The problem is that this leads to bubbles and makes it a ripe space for conmen, who just accelerate the bubble. There's no problem with Google/Microsoft/OpenAI/Etc talking to researchers/developers in the language of researchers/developers, but there is a problem of them talking to the average person in the language of the future. It's what enables the space for snakeoil like Rabbit or Devin. Those steal money from normal people and takes money from investors that could be better spent on actually pushing the research of the tech forward so that we can eventually have those products.

I understand some bootstrapping may be necessary due to needing money to even develop things, but certainly the big companies are not lacking in funding and we can still achieve the same goals while being more honest. The excitement and hope isn't the problem, it is the lying. "Is/Can" vs "will/we hope to"

dekhn
2 replies
1d17h

Just be aware, the person you're arguing with has several decades experience working on the problem that AlphaFold just solved, and worked for Google on protein folding/design/drug discovery and machine learning for years. When I speak casually on Hacker News, I think people know enough from my writing style to not get triggered and write long analytic responses (but clearly, that's not always true). Think of me as a lawful neutral edge lord.

Either way, AlphaFold is one of the greatest achievements in science so far, and the funding agencies definitely are paying lots of attention to funding additional work in machine learning/biology, so in some sense, my statement is effectively true, even if not pedantically, literally correct.

rowanG077
0 replies
17h14m

Why would randoms on the internet be aware of your writing style in massive online forum. You aren't speaking from authority in this case, you can't compare it speaking at a conference for example.

godelski
0 replies
1d14h

When I speak casually on Hacker News, I think people know enough from my writing style to not get triggered and write long analytic responses (but clearly, that's not always true).

If your "causal speech" is lying, then I don't think the problem is someone getting "triggered", I think it is because you lied.

write long analytic responses

I'll concede that I'm verbose, but this isn't Twitter. I'd rather have real conversations.

godelski
0 replies
2d18h

I'd like to point out that AlphaFold does not constitute all, nor even the majority of ML works.

My comment was a bit tongue in cheek. Not every research is going to be profitable or eventually profitable, but that also doesn't mean it isn't useful. If we're willing to account for the indirect profits via learning what doesn't work (an important part of science), then this vastly diminishes the number of worthless papers (to essentially those that are fraudulent or plagiarism)

But as specifically for AlphaFold, I'm going to need a citation on that. If I understand the calculus correctly, Google acquired DM in 2014 for somewhere between $525 million to $850 million, and yearly spends a similar amount each year along with forgiving a 1.5bn dollar debt[0]. So I think (VERY) conservatively we can say $2bn (I think even $4bn is likely conservative here)? While I see articles that discuss how the value could be north of $100bn[1] (which certainly surpasses a very liberal estimate of costs), I have yet to see evidence that this is actual value that has been returned to Google. I can only find information about 2022 and 2023 having profits in the ballpark of $60m.

This isn't to say that AlphaFold won't offset all the costs (I actually believe it will), but I your sentence does not suggest speculation but rather actualization ("has basically paid", "been"). I think that difference matters enough that we have a dozen cliches with similar sentiment. In the same way my annoyance is not that we are investing in ML[2], but how quick we are to make promises and celebrate success[3]. Actually, my concern is that while hype is necessary, overdoing it allows charlatans[4] to more easily enter the space. And if they are able to gain a significant foothold (I believe that is happening/has happened) then this is actually destructive to those who actually wish to push forward technology.

[0] https://www.quora.com/How-much-money-did-Google-spend-on-Dee...

[1] https://www.bloomberg.com/news/articles/2024-05-08/deepmind-...

[2] disclosure, I'm an ML researcher. I actually am in favor of more funding. Though different allocation.

[3] I'm willing to concede that success is realistically determined by how one measures success, and that this may be arbitrary and no objective measure actually exists or is possible.

[4] One needs not knowingly be a charlatan. Only that the claims made are false or inaccurate. There are many charlatans who believe in the snake oil they sell. Most of these are unwilling to acknowledge critiques. A clear example is religion. If you believe in a religion, this applies to all religious organizations except the one you are a part of. If you are not religious, well the same sentence holds true but the resultant set is one larger.

austhrow743
1 replies
2d15h

Did you chuck gravity in there to be hyperbolic or has someone really published a paper where they have data implying they got gravity not to apply to mice?

godelski
0 replies
2d23h

How many new things can you discover from the same team and same research?

That all depends on how you measure discoveries. The most common metric is... publications. Publications are what advance your career and are what you are evaluated on. The content may or may not matter (lol who reads your papers?) but the number certainly does. So the best way to advance your career is to write a minimum viable paper and submit as often as possible. I think we all forget how Goodhart's Law comes to bite everyone in the ass.

dontreact
3 replies
2d17h

I agree with you but it also makes me think: Google's TPUs are also fixed costs and these research experiments could have been run at times when production serving need isn't as high.

ec109685
2 replies
2d12h

They sell them on the spot market, so there’s someone that would consume the baseline compute.

dontreact
1 replies
1d16h

I imagine it would never be optimal to set a price so low that utilization is always 100% externally

londons_explore
0 replies
14h38m

There is plenty of people who want cheap compute and are willing to wait till 3am if that's when the cheap compute is. Happens for all computing services, but for ML stuff the effect is even more pronounced because compute costs are typically a large part of the cost of many projects.

esperent
0 replies
3d3h

To be fair, supposing the Google paper took six months to a year to produce, it also must have cost several hundred thousand dollars in salaries and other non-compute costs.

dumb1224
0 replies
2d20h

Well not everyone starts experiment anew. Many also reuse accumulated datasets. For human data even more so.

slashdave
5 replies
3d1h

I assure you that the companies performing these screens expect a return on this investment. It is not for a journal paper.

godelski
4 replies
2d23h

I used to believe this line. But then I worked for a big tech company where my manager constantly made those remarks ("the difference in industry and academia is that in industry it has to actually work"). I then improved the generalization performance (i.e. "actually work") by over 100% and they decided not to update the model they were selling. Then again, I had a small fast model and it was 90% as accurate as the new large transformer model. Though they also didn't take the lessons learned and apply them to the big model, which had similar issues but were just masked by the size.

Plus, I mean, there are a lot of products that don't work. We all buy garbage and often can't buy not garbage. Though I guess you're technically correct that in either of these situations there can still be a return on investment, but maybe that shouldn't be good enough...

shpongled
3 replies
2d21h

The post you are replying to is talking about high throughput assays for drug development. This is something actually run in a lab, not a model. As another person working at a biotech, I can assure you that screens are not just run as busy work.

rgmerk
1 replies
2d21h

No they’re not busywork, but not all such screens are directly in the drug discovery pipeline.

godelski
0 replies
2d18h

It also entirely misses the point of my comment which was that I don't believe this sentence to be true in an at least indirect sense. I conceded that it is technically valid in that lab experiments can also be used as advertisements and generate revenue even if such screens do not directly lead to novel or improved drug discoveries.

godelski
0 replies
2d18h

The post you are replying to is talking about high throughput assays for drug development.

While this is true, it is also true that the sentiment of this line is frequently used outside of the biotech (note that this entire website is primarily dominated by computer science posts). I think it is just worth mentioning that it is perfectly valid for discussions to not be pigeonholded and that you are perfectly allowed to talk about similarities in other fields (exact or inexact).

It is also true that my reply is not invalidated by changing settings. If you pay close attention you'll notice the generality of the comment outside of the specific example. In fact, this is exactly what the OP did, considering the article is about a machine learning paper and the example they used to __illustrate__ their point was about their wife's work in biotech. But the sentiment/point/purpose of their comment would not have changed were their wife to work in physics/engineering/chemistry/underwater basket weaving/whatever. So if this is your issue, I think they are misdirected and I ask that you please take it up with the OP and ensure that they know no comments in this thread may be about anything but ML papers by DeepMind. Illustrative examples out of domain are not allowed.

It is also true that this domain example was a product this company was selling. So I think you're being too quick to dismiss as not only did it run "in a lab" but the actual product runs in the real world. It has real customers who use the software.

It's also true I never accused anyone of doing "just [running] busy work" and that such an interpretation is grossly inaccurate. My final sentence should make this abundantly clear and I would argue is far more important in a setting such as medicine where I've said conveyed that you can sell ineffective or subpar products while still generating a profit and suggested that this probably shouldn't be the metric we care about (it certainly isn't what the spirit of those metrics are about).

But if you want to (implicitly) accuse me of derailing the conversation, I do not think you have the grounds to do so. But I will accuse you of doing so. If you disagree with my comment, you are more than welcome to reply in such a way. If you think my comment does not apply to dug discovery or think it only applies to ML[0], then you are welcome to state as much too and it is encouraged to state why. But the only derailing of the conversation has been the pigeonholing you have applied. If you think this is wrong, I still welcome a response to that as I am happy to learn how to communicate better but you did also catch me on a day where I'm not happy to be unreasonably and willfully misinterpreted.

[0] I didn't tell you the application... a bit presumptuous are we?

Metacelsus
0 replies
2d19h

Yeah, I'm a wet-lab biologist and my most recent paper (which is still not past peer review) has already cost about $200,000. And I just spent another $2000 today...

hnthr_w_y
12 replies
3d7h

that's not very much in the business range, it's a lot when it comes to paying us salaries.

willis936
11 replies
3d7h

Any company of any size that doesn't learn the right lessons from a $10M mistake will be out of business before long.

brainwad
5 replies
3d7h

That's like staffing a single-manager team on a bad project for a year. Which I assure you happens all the time in big companies, and yet they survive.

saikia81
4 replies
3d7h

They are not saying it doesn't happen. They are saying: The companies that don't learn from these mistakes will go out of business before long.

duggan
2 replies
3d7h

In principle, for some other company, sure.

Google makes ~$300b a year in profit. They could make a $10m mistake every day and barely make a dent in it.

magic_man
1 replies
3d6h

They do not, they made ~90 billion in profit. So no one would notice a 10 mil mistake, but no they didn't make 300b in profits.

duggan
0 replies
3d4h

I misread some stats, thanks for the correction.

hnbad
0 replies
3d6h

I think there might be a disagreement about what "big" means. Google can easily afford to sink millions each year into pointless endeavours without going out of business and they probably have. Alphabet's annual revenue has been growing a good 10% each year since 2021[0]. That's in the range of $20-$30 billion dollars with a B.

To put that into perspective, Alphabet's revenue has increased 13.38% year-over-year as of June 30, arriving at $328.284 billion dollars - i.e. it has increased by $38.74 billion in that time. A $10 million dollar mistake translates to losing 0.0258% of that number.

A $10 million dollar mistake costs Alphabet 0.0258% of the amount their revenue increased year-over-year as of last month. Alphabet could have afforded to make 40 such $10 million dollar mistakes in that period and it would have only represented a loss of 1% of the year-over-year increase in revenue. Taking the year-over-year increase down by 1% (from 13.38% to 12.38%) would have required making 290 such $10 million dollar mistakes within one year.

Let me repeat that because it bears emphasizing: over the past years, every year Google could have easily afforded an additional 200 such $10 million dollar mistakes without significantly impacting their increase in revenue - and even in 2022 when inflation was almost double what it was in the other year they would have still come out ahead of inflation.

So in terms of numbers this is demonstrably false. Of course the existence of repeated $10 million dollar mistakes may suggest the existence of structural issues that will result in $1, $10 or $100 billion dollar problems eventually and sink the company. But that's conjecture at this point.

[0]: https://www.macrotrends.net/stocks/charts/GOOG/alphabet/reve...

vishnugupta
1 replies
3d7h

https://killedbygoogle.com/

I’m confident each one of them were multiple of $10M investments.

And this is just what we know because they were launched publicly.

Sebb767
0 replies
3d5h

The point the parent made is not to not make mistakes, but to learn from them. Which they probably did not from all of them, as indicated by the sheer amount of messenger apps on this list, but there's definitely a lot to learn from this list.

OtherShrezzing
1 replies
3d4h

I'm not really certain that's true at Google's size. Their annual revenue is something like a quarter trillion dollars. 25,000x larger than a $10m mistake.

The equivalent wastage for a self-employed person would be allowing a few cups of Starbucks coffee per year to go cold.

Workaccount2
0 replies
3d4h

There was that time that google paid out something like $300M in fraudulent invoices.

willis936
0 replies
3d

To be clear: what I mean by "not learning the right lessons" is a company deciding that the issue with wasting $10M in six months is that they didn't do it 100x in parallel in three months. Then when that goes wrong they must need to do it 100x wider in parallel again in three weeks.

sigmoid10
8 replies
3d7h

This is calculation is pretty pointless and the title is flat out wrong. It also gets lost in finer details while totally missing the bigger picture. After all, the original paper written by people either working for Google or at Google. So you can safely assume they used Google resources. That means they wouldn't have used H100s, but Google TPUs. Since they design and own these TPUs, you can also safely assume that they don't pay whatever they charge end users for them. At the scale of Google, this basically amounts to the cost of houseing/electricity, and even that could be a tax write-off. You also can't directly assume that the on paper performance of something like an H100 will be the actual utilization you can achieve, so basing any estimate in terms of $/GPU-hour will be off by default.

That means Google payed way less than this amount and if you wanted to reproduce the paper yourself, you would potentially pay a lot more, depending on how many engineers you have in your team to squeeze every bit of performance per hour out of your cluster.

c-linkage
4 replies
3d7h

Reproducibility is a key element of the scientific process

How is anyone else going to reproduce the experiment if it's going to cost them $10 million because they don't work at Google and would have to rent the infrastructure?

tokai
1 replies
3d7h

Cheap compared to some high energy physic experiments.

lostlogin
0 replies
3d6h

I was thinking this too. Splitting the atom, and various space program experiments would also be difficult to reproduce if someone wanted to try.

rvnx
0 replies
3d6h

This specific paper looks plausible, but a lot of published AI papers are simply fake because it is one of the sectors where it is possible to make non-reproducible claims. "We don't give source-code or dataset", but actually they didn't find or do anything of interest.

It works and helps to get a salary raise or a better job, so they continue.

A bit like when someone goes to a job interview, didn't do anything, and claims "My work is under NDA".

Sebb767
0 replies
3d6h

But what's the solution here? Not doing the (possibly) interesting research because it's hard to reproduce? That doesn't sound like a better situation.

That being said, yes, this is hard to reproduce for your average Joe, but there are also a lot of companies (like OpenAI, Facebook, ...) that are able to throw this amount of hardware at the problem. And in a few years you'll probably be able to do it on commodity hardware.

injuly
1 replies
3d6h

This is calculation is pretty pointless and the title is flat out wrong.

No, it's not. The author clearly states in the very first paragraph that this is the price it would take them to reproduce the results.

Nowhere in the article (or the title) have they implied that this is how much Google spent.

sigmoid10
0 replies
1d10h

They have changed both the title and the article since it was posted... almost certainly due to comments like these which used to be at the top. Though editing titles should be impossible imo. Editing comments is fine, but if you screw up titles you should be forced to resubmit and not be able to rug-pull an entire discussion.

michaelmior
0 replies
3d7h

Even if they did use H100s and paid the current premium on them, you could probably buy 100 H100s and the boxes to put them in for less than $10M.

pama
6 replies
3d6h

3USD/hour on the H100 is much more expensive than a reasonable amortized full ownership cost, unless one assumes the GPU is useless within 18 months, which I find a bit dramatic. The MFU can be above 40% and certainly well above the 35% in the estimate, also for small models with plain pytorch and trivial tuning [1] I didnt read the linked paper carefully but I seriously doubt the google team used vocab embedding layers with 2 D V parameters stated in the link, because this would be suboptimal by not tying the weights of the token embedding layer in the decoder architecture (even if they did double the params in these layers, it would not lead to 6 D V compute because the embedding input is indexed). To me these assumptions suggested a somewhat careless attitude towards the cost estimation and so I stopped reading the rest of this analysis carefully. My best guess is that the author is off by a large factor in the upward direction, and a true replication with H100/200 could be about 3x less expensive.

[1] if the total cost estimate was relatively low, say less than 10k, then of course the lowest rental price and a random training codebase might make some sense in order to reduce administrative costs; once the cost is in the ballpark of millions of USD, it feels careless to avoid optimizing it further. There exist H100s in firesales or Ebay occasionally, which could reduce the cost even more, but the author already mentions 2USD/gpu/hour for bulk rental compute, which is better than the 3USD/gpu/hour estimate they used in the writeup.

152334H
2 replies
3d5h

You are correct on true H100 ownership costs being far lower. As I mention in the H100 blurb, the H100 numbers are fungible and I don't mind if you halve them.

MFU can certainly be improved beyond 40%, as I mention. But on the point of small models specifically: the paper uses FSDP for all models, and I believe a rigorous experiment should not vary sharding strategy due to numerical differences. FSDP2 on small models will be slow even with compilation.

The paper does not tie embeddings, as stated. The readout layer does lead to 6DV because it is a linear layer of D*V, which takes 2x for a forward and 4x for a backward. I would appreciate it if you could limit your comments to factual errors in the post.

pama
0 replies
3d

My bad on the 6 D V estimate; you are correct that if they do a dense decoding (rather than a hierarchical one as google used to do in the old days) the cost is exactly 6 D V. I cannot edit the GP comment and I will absorb the shame of my careless words there. I was put off by the subtitle and initial title of this HN post, though the current title is more appropriate and correct.

Even if it's a small model, one could use ddp or FSDP/2 without slowdowns on fast interconnect, which certainly adds to the cost. But if you want to reproduce all the work at the cheapest price point you only need to parallelize to the minimal level for fitting in memory (or rather, the one that maxes the MFU), so everything below 2B parameters runs on a single H100 or single node.

lonk11
0 replies
3d4h

I think the commenter was thinking about the input embedding layer, where to get an input token embedding the model does a lookup of the embedding by index, which is constant time.

And the blog post author is talking about the output layer where the model has to produce an output prediction for every possible token in the vocabulary. Each output token prediction is a dot-product between the transformer hidden state (D) and the token embedding (D) (whether shared with input or not) for all tokens in the vocabulary (V). That's where the VD comes from.

It would be great to clarify this in the blog post to make it more accessible but I understand that there is a tradeoff.

spi
1 replies
3d5h

Do you have sources for "The MFU can be above 40% and certainly well above the 35 % in the estimate"?

Looking at [1], the authors there claim that their improvements were needed to push BERT training beyond 30% MFU, and that the "default" training only reaches 10%. Certainly numbers don't translate exactly, it might well be that with a different stack, model, etc., it is easier to surpass, but 35% doesn't seem like a terribly off estimate to me. Especially so if you are training a whole suite of different models (with different parameters, sizes, etc.) so you can't realistically optimize all of them.

It might be that the real estimate is around 40% instead of the 35% used here (frankly it might be that it is 30% or less, for that matter), but I would doubt it's so high as to make the estimates in this blog post terribly off, and I would doubt even more that you can get that "also for small models with plain pytorch and trivial tuning".

[1] https://www.databricks.com/blog/mosaicbert

pama
0 replies
2d17h

Please look at any of the plain pytorch codes by Karpathy that complement llm.c. If you want scalable codes, please look at Megatron-LM.

tedivm
0 replies
3d1h

When I was at Rad AI we did out the math on rent versus buy, and it was just so absolutely ridiculously obvious that buy was the way to go. Cloud does not make sense for AI training right now, as the overhead costs are considerably higher than simply purchasing a cluster, colocating it at a place like Colovore, and paying for "on hands" support. It's not even close.

jeffbee
5 replies
3d3h

I think if you wanted to think about a big expense you'd look at AlphaStar.

5kg
3 replies
3d2h

I am wondering if AlphaStar is the most expensive paper ever.

jeffbee
1 replies
3d2h

I think it could be. I also think it is likely that HN frequenter `dekhn` has personally spent more money on compute resources than any other living human, so maybe they will chime in on how the cost gets allocated to the research.

dekhn
0 replies
3d

A big part of it is basically hard production quota: the ability to run jobs at a high priority on large machines for an entire quarter. The main issue was that quota was somewhat overallocated, or otherwise unable to be used (if you and another team both wanted a full TPUv3 with all its nodes and fabric).

From what I can tell, ads made the money and search/ads bought machines with their allocated budget, TI used their budget to run the systems, and then funny money in the form of quota was allocated to groups. THe money was "funny" in the sense that the full reach-through costs of operating a TPU for a year looks completely different from the production allocation quota that gets handed out. I think Google was long trying to create a market economy, but it was really much more like a state-funded exercise.

(I am not proud of how much CPU I wasted on protein folding/design and drug discovery, but I'm eternally thankful for Urs giving me the opportunity to try it out and also to compute the energy costs associated with the CPU use)

lern_too_spel
0 replies
3d2h

"Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC"

ipsum2
0 replies
2d19h

It's disappointing that they never developed AlphaStar enough to become super-human (unlike AlphaGo), even lower level players were able to adapt to its playstyle.

The cost was probably the limiting factor.

faitswulff
2 replies
3d

I wonder how many tons of CO2 that amounts to. Google Gemini estimated 125,000 tons of carbon emissions, but I don’t have the know-how to double check it.

chazeon
1 replies
2d22h

If you use solar energy, then there is no CO2 emission. Right?

ipsum2
0 replies
2d19h

Google buys carbon credits to make up for CO2 emissions, they've never relied strictly on solar.

floor_
1 replies
3d6h

Content aside. This is hands down my favorite blog format.

mostthingsweb
0 replies
3d4h

I agree, but I'm curious if it's for the same reason. I like it because there is now flowery writing. Just direct "here are the facts".

dont_forget_me
1 replies
3d3h

All that compute power just to invade privacy and show people more ads. Can this get anymore depressing?

psychoslave
0 replies
3d3h

Yes, sure! Imagine a world where every HN thread you engage in is fed with information that are all subtly tailored to push you into buying whatever crap the market is able to produce.

hiddencost
0 replies
2d22h

It's likely the cost of the researchers was about $1m/ head, with 11 names that puts the staffing costs on par with the compute costs.

(A good rule of thumb is that an employee costs about twice their total compensation.)

godelski
0 replies
2d23h

Worth mentioning that "GPU Poor" isn't created because those without much GPU compute can't contribute, but rather because those with massive amounts of GPU are able to perform many more experiments and set a standard, or shift the Overton window. The big danger here is just that you'll start expecting a higher "thoroughness" from everyone else. You may not expect this level, but seeing this level often makes you think what was sufficient before is far from sufficient now, and what's the cost of that lower bound?

I mention this because a lot of universities and small labs are being edged out of the research space but we still want their contributions. It is easy to always ask for more experiments but the problem is, as this blog shows, those experiments can sometimes cost millions of dollars. This also isn't to say that small labs and academics aren't able to publish, but rather that 1) we want them to be able to publish __without__ the support of large corporations to preserve the independence of research[0], 2) we don't want these smaller entities to have to go through a roulette wheel in an effort to get published.

Instead, when reviewing be cautious in what you ask for. You can __always__ ask for more experiments, datasets, "novelty", and so on. Instead ask if what's presented is sufficient to push forward the field in any way and when requesting the previous things be specific as to why what's in the paper doesn't answer what's needed and what experiment would answer it (a sentence or two would suffice).

If not, then we'll have the death of the GPU poor and that will be the death of a lot of innovation, because the truth is, not even big companies will allocate large compute for research that is lower level (do you think state space models (mamba) started with multimillion dollar compute? Transformers?). We gotta start somewhere and all papers can be torn to shreds/are easy to critique. But you can be highly critical of a paper and that paper can still push knowledge forward.

[0] Lots of papers these days are indistinguishable from ads. A lot of papers these days are products. I've even had works rejected because they are being evaluated as products not being evaluated on the merits of their research. Though this can be difficult to distinguish when evaluation is simply empirical.

[1] I once got desk rejected for "prior submission." 2 months later they overturned it, realizing it was in fact an arxiv paper, for only a month later for it to be desk rejected again for "not citing relevant materials" with no further explanation.

brg
0 replies
3d1h

I found this exercise interesting, and as arcade79 pointed out it is the cost of replication not the cost to Google. Humorously I wonder the cost of of replicating Higgs-Boson verification or Gravity Wave detection would be.