A lot of misunderstandings among the commenters here.
From the link: "the total compute cost it would take to replicate the paper"
It's not Google's cost. Google's cost is of course entirely different. It's the cost for the author if he were to rent the resources to replicate the paper.
For Google, all of it is running at a "best effort" resource tier, grabbing available resources when not requested by higher priority jobs. It's effectively free resources (except electricity consumption). If any "more important" jobs with a higher priority comes in and asks for the resources, the paper-writers jobs will just be preempted.
This is the side effect of underutilized capital and it’s present in many cases.
For example, if YOU want to rent a backhoe to do some yard rearrangement it’s going to cost you.
But Bob who owns BackHoesInc has them sitting around all the time when they’re not being rented or used; he can rearrange his yard wholesale or almost free.
"Underutilized" isn't the right word here. There's some value in putting your capital to productive use. But, once immediate needs are satisfied, there's more value in having the capital available to address future needs quickly than there would be in making sure that everything necessary to address those future needs is tied up in low-value work. Option value is real value; being prepared for unforeseen but urgent circumstances is a real use.
I think a better description than “underutilized” would be “sunk capex cost” - Google (or any cloud provider) cannot run at 100% customer utilization because then they could neither acquire new customers nor service transitory usage spikes for existing customers. So they stay ahead of predicted demand, which means that they will almost always have excess capacity available.
Cloud providers pay capital costs (CapEx) for servers, GPUs, data centers, employees, etc. Utilization allows them to recoup those costs faster.
Cloud customers pay operational expenses (OpEx) for usage.
So Google generally has excess capacity, and while they would prefer revenue-generating customer usage, they’ve already paid for everything but the electricity, so it’s extremely cheap for them to run their own jobs if the hardware would otherwise be sitting idle.
There is also a mathematical relationship in queuing theory between utilization and average queue length, which all programmers should be told: https://blog.danslimmon.com/2016/08/26/the-most-important-th...
As you run close to 100% utilization, you also run close to infinity waiting times. You don't want that. It might be acceptable for your internal projects (the actual waiting time won't be infinity, and you'll cancel them if it gets too close to infinity) but it's certainly not acceptable for customers.
There is a genre of game called "time management games" which will hammer this point home if you play them. They're not really considered 'serious' games, so you can find them in places where the audience is basically looking to kill time.
https://www.bigfishgames.com/us/en/games/5941/roads-of-rome/...
The structure of a time management game is:
1. There's a bunch of stuff to do on the map.
2. You have a small number of workers.
3. The way a task gets done is, you click on it, and the next time a worker is available, the worker will start on that task, which occupies the worker for some fixed amount of time until the task is complete.
4. Some tasks can't be queued until you meet a requirement such as completing a predecessor task or having enough resources to pay the costs of the task.
You will learn immediately that having a long queue means flailing helplessly while your workers ignore hair-on-fire urgent tasks in favor of completely unimportant ones that you clicked on while everything seemed relaxed. It's far more important that you have the ability to respond to a change in circumstances than to have all of your workers occupied at all times.
Ah, sounds like Dwarf Fortress!
I was thinking Oxygen Not Included.
TL/DR: You should think of and use queues like shock absorbers, not sinks. Also you need to monitor them.
Queues are useful to decouple the output of one process to the input of another process, when the processes are not synchronized velocity-wise. Like a shock absorber, they allow both processes to continue at their own paces, and the queue absorbs instantaneous spikes in producer load above the steady state rate of the consumer (side note: if queues are isolated code- and storage-wise from the consumer process, then you can use the queue to prevent disruption in the producer process when you need to take the consumer down for maintenance or whatever).
Running with very small queue lengths is generally fine and generally healthy.
If you have a process that consistently runs with substantial queue lengths, then you have a mismatch between the workloads of the processes they connect - you either need to reduce the load from the producer or increase the throughput of the consumer of the queue.
Very large queues tend to hide the workload mismatch problem, or worse. Often work put into queues is not stored locally on the producer, or is quickly overwritten. So a consumer end problem can result in potential irrevocable loss of everything in the queue, and the larger the queue, the bigger the loss. Another problem with large queues is that if your consumer process is only slightly faster than the producer process, then a large backlog of work in the queue can take a long time to work down, and it's even possible (admission of guilt) to configure systems using such queues such that they cannot recover from a lengthy outage, even if all the work items were stored in the queue.
If you have queues, you need to monitor your queue lengths and alarm when queue lengths start increasing significantly above baseline.
In practice it's more complicated than this- borg isn't actually a queue, it's a priority-based system with preemption, although people layered queue systems on top. Further, granularity mattered a lot- you could get much more access to compute by asking for smaller slices (fractions of a CPU core, or fraction of a whole TPU cluster). There was a lot of "empty crack filling" at google.
I doubt they are doing this, but if they did burn in tests with 3 machines doing identical workloads, they could validate workloads but also test new infra. Unlike customer workloads, it would be OK to retey due to error.
This would be 100% free, as all electricity and "wear and tear" would be required anyhow.
Same effect when leasing companies let office space sit unoccupied for years on end. The future value is higher than the marginal value of reducing the price to fill it with a tenant.
That may be part of it for spaces properties left unleased for years, but I believe it's not the only part.
I believe the larger factor, and someone correct me if they have a better understanding of this, is that for commercially rented properties the valuation used to determine the mortgage terms you get takes into account what you claim to be able to get from rent. Renting for less than that reduces the valuation and can put you upside down on the mortgage. But the bank will let you defer mortgage payments, effectively taking each month of mortgage duration and moving it from now to after the last month of the mortgage duration, extending the time they earn interest for.
So if no one want to lease the space at that price after a prior lessee leaves for whatever reason, it's better for the property owner financially to leave the space vacant, sometimes for years, until someone willing to pay that price comes along, than to lower the rent and get a tenant.
Land Value Tax would fix this.
This is mostly correct. People assume commercial loan terms are like single-family homes "but larger" but they're not. They basically are all custom financial deals with multiple banks and may be over multiple properties. As long as total vacancy isn't below a cutoff the banks will be happy, but lowering rents "just to get a tenant" can harm the valuation and trigger terms.
Part of the reason things like Halloween Superstores can pop in is the terms often exclude "short term leases" which are under six months.
Also when you're leasing to companies, they are VERY quick to jump at lower prices if available, which means that if you drop the lease for one tenant, the others are sure to follow, sometimes even before lease terms are up.
Many cities only tax on leased property, or have very low rates on unleased property.
Real estate is a playground for irrationally hopeful or stubborn participants.
I'm going to say this the next time I argue I need my servers online 24/7.
I'm not really sure I'm following you.
In the case of compute, you can evict low-priority jobs nearly instantly, so the compute capacity running spot instances and internal side-projets is just as available for unexpected bursts as it would be if sitting idle.
Yeah, airlines make "more return on capital" by faster turn-around of planes to a point - if they are utilizing their airframes above 80 or 90 or whatever percent, the airline itself becomes extremely fragile and unable to handle incidents that impact timing.
We saw the same thing with JIT manufacturing during Covid.
Car lots with attached garages are like this too. That brake and suspension work they were going to charge you several thousand dollars for? Once you trade in ol' Bessie they'll do that for pennies on the dollar during slack time; it doesn't hurt them if the car sits around for a few weeks or months before being ready for sale.
This was my first job after moving into this state. Between my labor and parts, it was about 15% of the sale price.
My most interesting repair was a 1943 Cadillac, a 'war car'.
Can others also buy the “best effort” tier?
If the job could easily run for weeks, even when you could buy your way for doing it in a day.
Then have a bidding on this “best effort” resource, where they factor in electricity at any given time
Is the "best effort" tier similar to AWS spot instances?
At every cloud provider there's probably a tier below "spot" (or whatever the equivalent is called at AWS's competitors) that is used for the low-priority jobs of the cloud provider itself.
You can speculate about this or you can look at how Google's internal workloads actually run, because they have released a large and detailed set of traces from Borg. They're really open about this.
https://github.com/google/cluster-data
Sure,.land a job there, work the way all up against the cooperate bs and toxicity and you can get best effort tier.
Those effort needs to be added in the cost calculation too.
Sure,.land a job there, work the way all up against the cooperate bs and toxicity and you can get best effort tier.
Those effort needs to be added in the cost calculation too
This assumes the common resources (CPU, RAM, etc.), not the ones required for the LLM training (GPU, TPU, etc.). It's different economy.
TL; DR: It's not ~free.
Why does GPU matter? Do you think GCP keeps GPU utilization at 100% at all times?
What the OP is referring to requires overprovisioning of the high priority traffic and the sine-like utilization (without it, the benefits of the "batch" tier is close to zero -- the preemption is too high for any meaningful work when you are close to the top of the utilization hill).
You get that organically when you are serving lots of users. And, there's not much GPUs etc. used for that. Training LLMs gives you a different utilization pattern. The "best effort" resources aren't as useful in that setup.
Because accelerators (tpus, gpus) unlike ram/cpu are notoriously hard to timeshare and vitrualize. So if you get evicted in an environment like that, you have to reload your entire experiment state from a model checkpoint. With giant models like that, it might take dozens of minutes. As a result, I doubt that these experiments are done using "spare" resources - in that case, constant interruptions and reloading would result in these experiments finishing sometime around the heat death of the universe :)
According to neoclassical economists this is impossible since you can easily and instantaneously scale infrastructure up and down continuously at no cost and the future is known so demand can be predicted reliably.
The problem with neoclassical economics is that it doesn't concern itself with the physical counterpart of liquidity. It is assumed that the physical world is just as liquid as the monetary world.
The "liquidity mismatch" between money and physical capital must be bridged through overprovisioning on the physical side. If you want the option to choose among n different products, but only choose m products, then the n - m unsold products must be priced into the m bought products. If you can repurpose the unsold products, then you make a profit or you can lower costs for the buyer of the m products.
I would even go as far as to say that the production of liquidity is probably the driving force of the economy, because it means we don't have to do complicated central planning and instead use simple regression models.
Isn't that all what high frequency traders would say? :)
Perhaps there is some limit at which additional liquidity doesn't offer much value?
I think you completely misunderstood the GP.
There isn't much there about stocks markets.
if this is the way they pull it off consistently, it might be a good business models for those working on research like stability to also moonlight a gpu cloud service.
it is a hustle only for the near future while this bubble lasts, but can help reduce costs.
Still, don’t get high on your own supply.
Possible corollary: it may be difficult to regularly turn out highly compute-dependent research if you're paying full retail rack rates for your hardware (i.e. using someone else's cloud).
Is it free-priority based?
I was told by an employee that GDM internally has a credits system for TPU allocation, with which researchers have to budget out their compute usage. I may have completely misunderstood what they were describing, though.