In the span of a few months, with a small team of researchers and engineers, we trained a 70B parameter model from scratch on our own infrastructure that outperformed zero-shot GPT-4o on reasoning-related tasks. Using our cluster for high performance training meant that every component — InfiniBand, Ethernet, GPUs, and the nodes themselves — had to work perfectly. If even a single one of the over 12,000 connections was a little flaky, it could slow down the entire training run.
We're sharing open-source scripts and an end-to-end guide for infrastructure set-up that details the process of making everything work perfectly, and ensuring that it stays that way.
This is one of a three-part toolkit on training a 70b model from scratch. The other two sections focus on evaluations and CARBS, our hyperparameter optimizer; you can find them here: https://imbue.com/research/70b-intro/
Thoughts and questions welcome! :)
It's an unusual enough sentence to be remarkable and I was like "I read this exact same sentence before". Indeed, this and most of the writeup appeared on Twitter, LinkedIn, Reddit it seems word-by-word. Is this just spam ?
https://x.com/imbue_ai/status/1805629547473518695
https://reddit.com/r/learnmachinelearning/comments/1dobgbs/t...
https://www.linkedin.com/posts/mattboulos_training-a-70b-mod...
lmao, I was thinking this was bullshit and you’ve cemented that position. We’ve entered the grifting stage of this AI cycle. Salut.
Having listened to the person who wrote this speak at length about the subject, it is not BS or grifting.
I'd rather some company copy&paste the same text multiple places -- if the alternative was that those places would instead get obfuscation of the same information to appear novel each time (so I'd have to read all of them to realize they're all just the same info).
This is the kind of criticism that could only come from someone without much formal writing experience.
This is a very normal workflow: You write a full-length text detailing the project you worked on. You then trim it down to a summary which you share with a group of people X. You then trim it down into a different summary which you share with a group of people Y.
When you do this multiple times you unsurprisingly end up with some sentences that make it into multiple summaries because they're that important to the thesis!
(Also, the summaries on Twitter and Reddit aren't anything close to "most of the writeup"—the full text is 6000+ words!)
The same company reports multiple times on a finding they've made through multiple social media channels? Shocking. /s
I dont inderstand your issue with this. Is it that they share their work several places, or that they don't describe their work in an unique way every time?
i prefer this, to the story about that time they went to Florence and their grandma made pizza for dinner and they got the recipe.
Eh, seems like legit marketing to me. Yes, they are trying to sell you something, but they are doing that by releasing non-trivial research and open source code.
Loved this and the detail - thank you. It’s the best inside detail on the engineering work behind these models I’ve ever read.
Two things I’m curious about- first, what, if any difference would you imagine in training a 400b parameter model? It seems that you have plenty of vram across the cluster, but I want to know what you think.
Second, do you think this sort of architecture is the end game for model training? It seems sooo fragile. Are there better shared training mechanisms/architectures? Are there better cluster geometries?
Thanks again - great read.
What happened to the Minecraft-like 3d world your team built? Did you guys pivot?
Cool stuff! Does this do RLHF or just pretraining? If the latter, how did you manage to beat GPT 4?
Nice. Tx for the write up