Am I reading this correctly:
Training time was 4282407hrs. At, conservatively, 200w gpus, that's (4282407*200)/1_000_000_000 GWh ~= 1 GWh. At 10c/kWh that's $100,000 ?
So if you have a single eqv GPU at home, it's 500yrs of training time and $100k in energy costs. Or, in practice, 3000 gpus for 2mo.
The AI industry has to hope the world doesnt change fast enough for these models to be useless.
EDIT: price is $100k
Numbers like these really don't bode well for the longer term prospects of open source models, I doubt the current strategy of waiting expectantly for a corporation to spoonfeed us yet another $100,000 model for free is going to work forever.
That $100k is conservative too, it doesn't include the cost of buying/renting the hardware, or the compute time spent on experimental training runs, or the cost of data acquisition, labeling and cleaning, or the cost of RLHF fine-tuning.
I wonder if a kind of Seti@Home approach could work - although I'm guessing the limited VRAM in most consumer cards compared to an H100, as well as the much slower "virtual WAN interconnect" versus the mellanox goodies that nVidia clusters enjoy would be too big an obstacle?
Even if you could get that to work, how many people would be willing to run their >300W GPUs at full tilt 24/7 in order to contribute to the training cause? You would basically be asking people to deal with the logistics of running a cryptocurrency mining operation but without the prospect of getting paid for it.
Depends on the logistics. If I were confident about the security, I wouldn't mind letting my GPU participate in a distributed effort to significantly improve an open source model. This should be a few dollars a month on my power bill, not dozens or hundreds of dollars, especially if I undervolt.
Now, I don't know of any distributed training technique that will make a significant impact on improving a model, and that security component is a big "if". But if something promising comes a long, I'd bet lots of people would be willing to donate some GPU time, especially if it were easy to set up.
I would add “in their current form” and agree. There’s three things that can change here: 1. Moore’s law: The worldwide economy is built around the steady progression of cheaper compute. Give it 36 months and your problem becomes a $25,000 problem. 2. Quantization and smaller models: There’ll likely become specializations of the various models (is this the beginning of the “Monolith vs Microservices” debate? 3. E2E Training isn’t for everyone: Finetunes and Alignment are more important than an end to end training run, IF we can coerce the behaviors we want into the models by finetuning them. That along with quantized models (imho) unlocked vision models which are now in the “plateau of producivity” in the gartner hype cycle compared to a few years ago.
So as an example today, I can grab a backbone and pretrained weights for an object detector, and with relatively little data (from a few lines to a few 10’s of lines of code, and 50 to 500 images) and relatively little wall clock time and energy (say 5 to 15 minutes) on a PC, I can create a customized object detector that can detect -my- specific objects pretty well. I might need to revise it a few times, but it’ll work pretty well.
Why would we not see the same sort of progression with transformer architectures? It hinges on someone creating the model weights for the “greater good,” or us figuring out how to do distributed training for open source in a “seti@home” style (long live the blockchain, anyone?).
Yeah, there's no accounting for breakthroughs in training efficiency. I wouldn't count on Moores Law though, the amount of compute you can put into these problems is effectively unbounded so more efficient silicon just means those with money can train even bigger models. 3D rendering is a decent analogy, Moores Law has made it easy to render something comparable to the first Toy Story movie, but Pixar poured those gains back into more compute and is using it to do things you definitely can't afford to.
1 GWh is 1 million kWh, multiplied by $0.1 that should give $100k in energy costs?
Yes, thanks. I had assumed I had been off by a factor somewhere. Yet, 100k seems small -- the total cost of production is in the 10mil+ range.
100k is small, but you only get away with 100k if you nail everything perfect the first time around — something that we all know does not really happen. I think compiling is a good parallel to training, imagine if compiling your whole software project cost 100k if you did it from scratch. Sure there's incremental builds etc, but the cost is steep no matter which way you look at it.
Assuming $30k GPU with 3yr deprecation, it's additionally $1.14/h. Much more than energy.
Thanks for the figures. I suppose with expenses like that, they will be motivated to research methods of updating models which have already been trained.
Edit: I see the price was updated
Things like petals (https://github.com/bigscience-workshop/petals) exist, distributed computing over willing participants. Right now corporate cash is being rammed into the space so why not snap it up while you can, but the moment it dries up projects like petals will see more of the love they deserve.
I envision a future where crypto-style booms happen over tokens useful for purchasing priority computational time, which is earned by providing said computational time. This way researchers can daisy-chain their independent smaller rigs together into something with gargantuan capabilities.