return to table of content

Training LLMs from ground zero as a startup

twelfthnight
24 replies
1d21h

To be very frank, I would have to say the quality of codebases externally significantly lag behind those I’ve been used to at Google

Haven't worked at Google, anyone else share this sentiment? I always feel like working with Google code is typically not idiomatic and super difficult to go "under the hood" if anything isn't precisely on the happy path.

winwang
13 replies
1d21h

(not googler)

Google's codebase is idiomatic to Google due to their strict language tooling. e.g. their C++ code stays away from advanced features. The tooling teams at Google have very strong say.

twelfthnight
11 replies
1d20h

I get that sense too. Probably does work awesome if you're inside. But man it's a mess when they externalize stuff. Just one example: their cloud platform CLI includes an entire python installation and takes 1.7G on disk, just to make API calls...

jen20
9 replies
1d20h

I have never understood why cloud providers seem to think it is OK to write their CLIs in Python. The AWS one is too, and the Azure one went from Node.js to Python some time ago.

anonymous-panda
5 replies
1d20h

Packaging and stability reasons. Same for why it’s a 1.7gb install - probably where they landed after having tons of support issues on some random Python version they didn’t test or some issue with a dependency that had that issue. Freezing the entire set of artifacts is more stable and Python lets you move pretty quick. I can’t speak to why nodejs vs Python though - maybe Python is easier to embed?

pests
2 replies
1d20h

What? They only get package and stability because they include the runtime. If they just went with a compiled language they could distribute native binaries and have actual packaging and stability.

anonymous-panda
1 replies
1d20h

Yes, but it’s not just a single metric. Another is how easy it is for them to hire productive members of the team and how much that costs them - middling Python developers churning out fine”ish” code are cheaper than Rust developers doing the same. It’s hard to find a language where you can be as productive as a developer in Python that also has AOT compilation to generate standalone binaries.

Tldr: there’s multiple factors to consider here and it’s more interesting to understand the pressures that cause the decisions, especially if you want to try to create a world where different decisions are made.

jen20
0 replies
1d6h

It’s hard to find a language where you can be as productive as a developer in Python that also has AOT compilation to generate standalone binaries.

Outside specific cases around machine learning, it’s really not: Go is that language. It’s not like each of those platforms doesn’t have to have a similar team that understand Go anyway (for their SDK), so they could save their customers the abject pain of Python dependency management by just writing their CLIs using it.

twelfthnight
0 replies
1d20h

Yeah, I imagine that was the decision calculus. "Instead of spending some more effort to save millions of unnecessary downloads of python's runtime using a different language, let's just bundle Python!"

I wouldn't be surprised if it was version 2.7 too...

jen20
0 replies
1d14h

Of course, writing them in Go would solve all of these problems while producing packages which are much smaller.

twelfthnight
1 replies
1d20h

There probably is a sense in which the API's are constantly changing, so maybe an interpreted language might make sense? I imagine there has to be a better way to do with with Go or Rust though (even lua?) for a smaller binary.

candiodari
0 replies
1d19h

Google python binaries are more akin to docker or even vm images, even if the actual technology used predates docker and even linux VMs. They contain something like a slimmed-down linux distribution, not just a binary.

EXTREME predictability (e.g. as never ever using the system's libssl), in trade for huge binaries. They go pretty damn far in this: you won't catch a Google binary even using most of libc.

jyap
0 replies
1d20h

It makes “sense” based on the domain of the cloud provider being DevOps teams who are maintaining and using these CLI tools. Ie. What they use day to day.

For anything more advanced they offer language specific SDKs in Rust, Swift, Kolton, etc…

For example integrating storage in an iOS app.

marcyb5st
0 replies
1d20h

Did you install all the components? Because if so you also installed emulators for the pubsub and big table (maybe others, I don't remember) which explain the big footprint.

dheera
0 replies
1d20h

e.g. their C++ code stays away from advanced features

Which honestly is a GOOD thing because it would make it much easier for newcomers to ramp up on existing codebases. Most people aren't used to working with spaceships and constexprs.

Readability is also far more valuable to a large team than efficiency for anything that isn't a number-crunching loop.

titanomachy
2 replies
1d20h

I thought the quality was pretty high, largely because there were a lot of rails constraining how code should be written. Most of the code I dealt with was written using somewhat rigid (but generally well-designed) frameworks with programmatically-enforced style guides.

Also, most work seemed to involve some balance of junior and more experienced people, which helped keep quality higher. Outside of Google, I've seen pretty large projects written by new grads with little supervision (and on a tight timeline). Those codebases can be pretty hairy.

twelfthnight
0 replies
1d19h

That honestly does seem like a recipe for good code. And sure, there's tons of open source out there of dubious quality.

@resource0x in a sibling comment made the point that it's possible to write great code even if the program is a flawed design. I'm probably conflating those things.

rokkitmensch
0 replies
1d15h

The thing that impressed me most about Google was the encoding-of-cultural-norms-in-various-CI-jobs.

It lets them extract usable SWE horsepower from pretty much anyone who steps inside and at least tries to be useful and not just coast. They can ingest a startup engineer, someone who's been a mid-tier enterprise codemonkey, yr mythical 10xer, the whole statistical gamut.

renegade-otter
2 replies
1d20h

"Externally", no one could possibly beat Google's track record of not committing to products before finally killing them. But the code was beautiful, though!

twelfthnight
1 replies
1d20h

I mean, was Angular ever "beautiful"?

resource0x
0 replies
1d20h

Pretty sure it was. A lousy idea might still be implemented beautifully under the hood. :-)

ein0p
2 replies
1d12h

A recent ex-googler here: quality of Google3 in general is pretty good, but the LLM training bits are so abysmal that I know people who have resigned instead of working on it. And it’s also extra slow because getting a couple local GPUs is not really an option. So you’re forced to “develop in Colab” which works for some things and not for others and in general sucks ass if you’re working on anything substantial. For anything more substantial you’ll be launching stuff on some resource pool, waiting for like 10-15 minutes until it starts (much longer for large models), and then trying to divine why it failed from voluminous and sometimes indecipherable crash logs which also hang your browser when cluster UI tries to load them.

Rumors of Google’s AI code superiority are vastly overblown in 2024. I’m currently at another major AI lab, and the code here can actually be understood and worked on, which I consider to be a massive advantage.

alsoworkedthere
1 replies
1d9h

Finally, an accurate portrayal!

Google has superb robustness and code quality, with garbage-level usability. Once you're setup, you can kick off many massive training jobs and compare results easily. However, getting to that point is really hard. You'll never figure out how to use the ML infrastructure and libraries on your own. You can only get it to work by meeting with the teams that wrote the infra so they can find and fix every error and misconfiguration. Usually, there is one single way to get things working together, and neither the documentation nor the error messages will get you to that brittle state.

It's near impossible to get a VM with a TPU or GPU attached, so there's no way to debug issues that happen between the library and the accelerator. Plus somehow they've made Python take longer to build (??!!) and run than C++ takes, so your iteration cycle is several minutes for what would take seconds at any other place. Fun stuff! Somehow it's still one of the best places to do ML work, but they sure try to make it as difficult as possible.

ein0p
0 replies
1d2h

Google doesn’t use VMs internally to run workloads. But yeah, seconds-long dev iteration cycles take minutes or even tens of minutes there.

danans
0 replies
1d19h

Haven't worked at Google, anyone else share this sentiment?

I worked there, and the quality is definitely much higher and the code tends to be far more maintainable. However, there is often a cost for that, which is velocity.

Some of this is reduced by the sheer amount of automation in tooling (i.e. bots that block style violations and common bugs before a code change is submitted).

In other cases, it slows things down quite a bit.

joe_the_user
20 replies
1d21h

So essentially a startup in this context has a small number of people and a large amount of money for training clusters. The article describes many operation leasing servers - that you assume to go many startups (or existing firms).

So it seems like you have the various LLM creators all doing roughly the same sort of thing (training with text and image data) with similar hardware and similar data. Each of these naturally has their own brand of "secret sauce" for distinguishing their venture. The various secret sauces can make a difference in the quality of an LLM's output.

Yet overall, this seems like a massive, energy intensive exercise in redundancy.

dauertewigkeit
11 replies
1d21h

I don't think most of them have any kind of secret sauce. I think the founders hope to get bought out simply for being able to train "near-SOTA" LLMs. I guess achieving that level of skill and infra could be valuable enough to build upon.

joe_the_user
9 replies
1d20h

Sure, that's also a factor but I'd say it reinforces my main point.

DeepChill
8 replies
1d18h

Good point, so the only real differentiator would be the size & quality of the data being fed and the fine tuning done on the model? I wonder what else differentiates LLMs from each other

Iulioh
6 replies
1d18h

Alignment and censorship ?

pests
5 replies
1d17h

Alignment just means making it do what you want. LLMs just continue the sequence, the chat questions and response style we have now is an example of alignment (to what humans want).

eru
4 replies
1d13h

Alignment can mean making sure your LLM doesn't continue the sequence in embarrassing ways, eg by spouting politically incorrect sequences of words (even though those might have been common in the training data).

friendzis
3 replies
1d11h

In what way does this do more good than harm?

eru
2 replies
1d4h

In the sense of people caring about their models not saying embarrassing things?

Different people have different goals, and they don't necessarily align with yours.

friendzis
1 replies
11h43m

Since the entity releasing the model obviously has certain goals aligning/censoring model in some ways is good for their particular short-term goal.

In the grand scheme these alignments are harmful as they place a reality distortion field. Authors create model of what language is and then contort that model to fit an opinionated idea of what language should be. Smells a bit Orwellian, right?

eru
0 replies
6h57m

Smells a bit Orwellian, right?

No, seems perfectly fine by me. You are already shaping your results by your selection of training data. Eg do you want to train a model that speaks English, or German, or both? Do you want to run your training data past a spam filter first? Do you want to do a character based model, or one of those weird encodings that is popular with LLMs these days?

Doing some other procedures afterwards to make sure your LLM doesn't say embarrassing things is small fries by comparison.

Also it's good practice for trying to get alignment with more important values (like "don't kill all humans") later when models might get powerful enough to be able to kill all humans.

Playing some little games where OpenAI tries to keep you from making their model say embarrassing things, and people keep trying to make it say embarrassing things, is a good low stakes practice ground.

llm_trw
0 replies
1d17h

Also getting a golden ticket.

Golliath 120b is still the best open source model and no one knows why since it's just two llama2 60b glued together.

imtringued
0 replies
2h49m

There was a guy who followed a tutorial about how to fine tune mistral with DPO, who has zero computer science skills and his model ended up at the top of the hugging face leader board among the opensource models with 7 billion parameters. Some random guy managed to outdo the creators of the LLM.

PeterStuer
3 replies
1d11h

"this seems like a massive, energy intensive exercise in redundancy"

This is commonly refered to as a market working as intended. Yes, the waste from this type of redundency can be massive, especially if you realize that ultimately just a tiny percentage of these efforts will result in even moderate success. But it is the price to pay at the edge of progress. A planned monopoly might be more efficient (despite popular banter that just compares a megacorp or a gov, which is basically the same, to a single succesfull startup ignoring the 999 that tried and failed), but those seldom beat a market on innovation.

polygamous_bat
2 replies
1d4h

This is commonly refered to as a market working as intended.

Is it? Seems like market is unable to separate wheat from the chaff and is just throwing money around hoping to hit the jackpot. While AI has massive chance of affecting our lives, the investment market paints a pretty similar picture to what happened during the crypto boom.

manquer
0 replies
21h57m

is it any different from evolution?

PeterStuer
0 replies
1d

Our inability to predict future success from failiure is exactly why we have (massively inefficient) markets outcompeting centralized planned approaches.

llm_trw
1 replies
1d17h

Yet overall, this seems like a massive, energy intensive exercise in redundancy.

Keep in mind that this is also chaff to distract people from the real secret sauce. I imagine that just as many startups are hiring writers and photographers to create extremely well labelled uncontaminated data for training.

One only need to look at the perverts over at civitai to see how far you can go with intensive labeling on a tiny compute budget.

fennecbutt
0 replies
1d

Us furries were properly tagging data on e6 for a long time before LLMs came about.

samus
0 replies
1d11h

There are not that many of these startups actually. Most use cases of LLM can be backed with a fine-tune of an off-the-shelf foundation model. If you're training foundation models from scratch, you're entering a difficult-to-monetize market where the big boys could eat your lunch by just releasing a new foundation model that might be able to do more than 95% of what yours does.

doctorpangloss
0 replies
1d17h

Maybe it’s simpler than that. Instead of spending money on compute that costs X and that cloud providers charge 20*X for, they could spend the money creating training data, but that story is way too hard to tell to investors.

abeppu
14 replies
1d21h

It's worth taking a second to note that the author just assumes that readers understand "the wilderness" to mean "not Google".

This post gives a lot of credit to Google's infra and hardware teams, and I'd love to read a perspective from one of those insiders who then went on to do related work elsewhere.

ganeshkrishnan
7 replies
1d18h

OP mentions the failure rate of GPUs as "If this were in GPU land, it would have failed within the first few days for sure.".

In my humble opinion, we never had failures of GPU even for large scale training. Our current training batch job is a 20GB json file which takes 6 hours just to load and has been running for more than 15 days with not a hiccup. And we are using the older Tesla T4.

GPUs have memory constraint issues but if you can plan and work around it, I havent seen it crash in real life.

shrubble
3 replies
1d12h

Have you checked if there is a faster way to parse your JSON? 3Gbytes/hour to load a file seems slow on today's CPUs...

flybarrel
2 replies
1d2h

What would be an ideal (or more appropriate) speed?

shrubble
0 replies
20h41m

Well it would depend on the specifics of the JSON file but eyeballing the stats at https://github.com/miloyip/nativejson-benchmark/tree/master seems to indicate that even on a 2015 MacBook the parsing proceeds using e.g. Configuru parser at several megabytes per second.

teaearlgraycold
0 replies
1d16h

Ha! We’re also committing great sins of computation against T4s at our company. Hopefully, as I learn, things get less janky.

nl
0 replies
1d5h

20GB json file… takes 6 hours just to load

Err you definitely should be doing something about that.

20GB on T4s (how many?) isn’t really comparable to terabytes on thousands of A100s.

gwern
0 replies
1d16h

And we are using the older Tesla T4.

That's an undemanding and well-debugged chip by this point (6 years ago!). So you aren't experiencing any of the pain people using A100s or H100s (never mind people who have to stand up clusters with B100s soon) are going through now.

choppaface
2 replies
1d18h

Really telling quote:

I was completely taken aback by the failure rate of GPUs as opposed to my experiences on TPUs at Google

Should be "I was completely unaware of the failure modes of GPUs, because all my career I've been inside Google and used Google TPUs and was well-acquainted with those failure modes."

I've used GPUs mostly, and when I tried TPUs the jobs failed all the time for really hard-to-debug reasons. Often the indirection between the x86 chip and the TPU device caused hours of hair-pulling, stuff you never get with x86+nvidia+pytorch.

10-15 years ago, Google minted many $10m+ data scientists (aka Sawzall engineers) who also ventured "into the wilderness" and had very similar reactions. This blog post is much more about the OP hyping his company and personal brand than contributing useful notes to the community.

quadrature
0 replies
1d10h

I think the OP is referring to hardware failures rather than software not playing well together.

StarCyan
0 replies
1d10h

When was this? I use JAX+TPUs to train LLMs and haven't experienced many issues. IMO it was way easier to set up distributed training, sharding, etc compared to Pytorch+GPUs.

lambersley
0 replies
1d4h

Agreed. It reads like Seven of Nine realizing she's separated from the Collective and needs to rely lowly human capabilities. The insights into vendors was informative.

joe_the_user
0 replies
1d18h

I took the phrase to mean "outside any large company". It seems like a fairly obvious metaphor; if you have a starup working on a large scale infrastructure project, you have to set your own logistics just a camp in the literal wildness.

flybarrel
0 replies
1d2h

Newbie question - What happens after when an LLM training job experience a hardware failure? I don't suppose you lose all the training progress do you? Then the pain is mostly in the diagnostic of the problem and getting the cluster running again, but no need to worry about data loss right?

pama
9 replies
1d22h

Training LLM from scratch is a super important issue that affects the pace and breadth of iteration of AI almost as much as the raw hardware improvements do. The blog is fun but somewhat shallow and not technical or very surprising if you’ve worked with clusters of GPUs in any capacity over the years. (I liked the perspective of a former googler, but I’m not sure why past colleagues would recommend Jax over pytorch for LLMs outside of Google.) I hope this newco eventually releases a more technical report about their training adventures, like the PDF file here: https://github.com/facebookresearch/metaseq/tree/main/projec...

axpy906
8 replies
1d20h

If you’re doing research JAX makes some sense. Probably some Google bias in there too.

lyapunova
7 replies
1d19h

To be honest, most researchers in applied ML in the bay say the opposite. If you are trying to be nimble and prototype, use pytorch. If you're trying to gain some optimizations as you near deployment, rewrite in Jax.

plumeria
4 replies
1d18h

Where does Tensorflow stand in this?

axpy906
2 replies
1d17h

Somewhere next to Theano, Mxnet or Caffe.

plumeria
0 replies
1d17h

So, obsolete?

omneity
0 replies
1d16h

What about Keras?

rockinghigh
0 replies
1d16h

Tensorflow has been falling behind since they stopped caring about backward compatibility. PyTorch is the leading framework. Jax is getting some traction at Google and was used to train Gemini.

pama
0 replies
1d11h

Interesting perspective about possible Jax optimizations. Assuming these models are trained and deployed on non-TPU hardware, are there any real advantages in using Jax for deployment on GPU? I’d have assumed that inference is largely a solved optimization for large transformer based models (with any low hanging fruits from custom CUDA code already written) and the details are shifting towards infrastructure tradeoffs and availability of efficient GPUs. But I may be out of the loop with the latest gossip. Or do you simply mean that maybe there exist cases where TPU inference makes sense financially and using jax makes a difference?

axpy906
0 replies
1d17h

Interesting. I’ve never heard that. I could see that argument going both ways as PyTorch has the larger ecosystem and is published the most.

swyx
3 replies
1d21h

(update: i submitted this yesterday and it didnt get traction, i guess @dang must’ve merged the old submission in here. you really didnt have to, but its a nice gesture. thanks dang!!)

axpy906
2 replies
1d20h

Great too see you on here. Love Latent Space podcast.

swyx
1 replies
1d18h

aw thank you for listening. some weeks its very much a labor of love lol.

no events planned near term but come to the big shindig in june https://ti.to/software-3/ai-engineer-worlds-fair . last year's summit was the first time i really understood how much of a reach we have and how many good AI people we've managed to gather as friends.

dwaltrip
0 replies
1d16h

I love it as well, it’s a fantastic resource :)

3abiton
1 replies
1d15h

Is he the person after the Yi LLM model?

bigcat12345678
0 replies
1d14h

No Yi LLM models are from [0], Kaifu Li's LLM startup.

[0] https://www.lingyiwanwu.com/

davidmurdoch
0 replies
1d15h

Acceptable, but maybe not perfectly.

dotancohen
1 replies
1d10h

Yes, the title sounds like somebody confused two idioms. That's not the type of author from whom I want to learn.

frozenseven
0 replies
1d8h

1. As others have pointed out, it's a perfectly valid idiom. Check a dictionary.

2. How do you think idioms are created in the first place?

3. What exactly forces you to act like this?

makoto12
0 replies
1d10h

could be intentional. Implying LLMs are a proverbial nuclear bomb to the tech landscape. but honestly it threw me as well

a_bonobo
1 replies
1d13h

But what is the product they're selling?

The main Reka.AI page looks like a regular ChatGPT clone, an LLM you pay for by the token. How is this different from all these other companies? Pricing seems to be comparable to ChatGPT 3.5-Turbo.

polygamous_bat
0 replies
1d12h

Perhaps a cure for venture capitalist FOMO for not having invested in AI?

TrackerFF
1 replies
1d11h

Big question is, how do small startups manage to get funding for LLM products if they don’t have the “correct” background / pedigree?

The world of LLM startups is beginning to look like the world of hedge funds and private equity firms - where the prerequisites for seed/funding are:

A) Prestigious employment history / correct pedigree.

B) Solid network of investors ready to jump before any product has even begun.

nlpnerd
0 replies
13h36m

They don't. This is probably one reason why VCs invest in these companies. There is a natural moat since there is only a very finite number of people in the world has the right experience to raise, and only those who can raise can ever have the experience.

At least until compute cost drop to a cheap enough level...

yalok
0 replies
1d21h

All in all, this is only a small part of the story of how we started a company, raised some money, bought some chips and matched Gemini pro/GPT 3.5 and outperformed many others in less than a year having to build everything from scratch.

I wonder what was the budget spent for the chips/cloud GPUs to achieve GPT 3.5 level LLM - at least in the order to magnitude - 2-5 millions?

tkgally
0 replies
1d5h

I learned about reka.ai from this post; their LLMs don’t seem to have been discussed much on HN yet [1]. So, out of curiosity, I spent the last hour testing prompts with their chat interface [2] in comparison with ChatGPT 4, Gemini Advanced, Claude 3, and Mistral Large. I put the results at [3]. Overall, Reka Flash doesn’t seem significantly worse or better than the others. A lot more testing would be necessary to be sure, of course.

[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

[2] https://chat.reka.ai/chat

[3] https://gally.net/temp/20240307llmcomparison.html

stealthcat
0 replies
1d18h

Should list most of the technical debt accumulated so far and rank them. At this stage, lots of corners have been cut.

rvz
0 replies
1d10h

Then what happens when the LLM or AI performs worse than expected? Spend more money fine tuning?

By the time you get it all working, not only you've spend lots of your VC capital on training alone, your competitors (Google, Meta, etc) already released a more powerful model much better and quicker than you before you could your run the second training epoch.

Another example of a startup incinerating the VC pump and dump scheme for vaporware AI snake-oil.

julianh65
0 replies
1d15h

So which compute providers have folks had a good experience with?

hackerlight
0 replies
1d14h

In the end it took us only a very small number of smaller scale & shorter ablation runs to get to the strong 21B Reka Flash and 7B edge model (and also our upcoming largest core model). Finding a solid recipe with a very limited number of runs is challenging and requires changing many variables at once given the ridiculously enormous search space. In order to do this, one has to abandon the systematicity of Bigtech and rely a lot on “Yolo”, gut feeling and instinct.

Thankfully, I (and many of us in the team) have built up this intuition quite a bit in our ML careers to get it right within a substantially short amount of tries. While we’ve trained really good models before in our previous jobs, differences in training infrastructure, data, incorporation of new ideas and other environmental issues can still cause non-trivial differences in outcomes. That said, a strong prior helps to significantly cut down the search space and is probably one of the easiest explanations to why we were able to train really strong models with so few trials, resources and experimentation.
egberts1
0 replies
1d4h

TL:DR: LLM training is highly susceptible to GIGO.

(GIGO is what one gets when feeding LLM with "G"arbage "I"n, "G"arbage "O"ut.)

This is the current problem about making a vaccine signature fitting like a glove ... as tight as possible ... when populating the anti-malware (i.e. IDS/IPS/NDS/XNS) search pattern engine for use by Aho-Corasick-variant algorithms (such as Parallel-Failureless Aho Corasick).

However, LLM as a binary code-based detector for malware detection has a very limited benefit (it is there but only as a backend topical add-on after all other conditionals have been identified).

LLM lacks qualifying conditionals surrounding a premise data, and I have my doubts of using LLM for medical diagnosis as well: until we start having LLM denote the much-needed weighted combo-conditionals by "percentages".

classified
0 replies
1d13h

Absorbing the risk of copyright and license violations en masse for the training data as a service?

bo1024
0 replies
1d19h

This is very interesting, but I really want to hear about the training data process!

LZ_Khan
0 replies
1d17h

I wish I knew how to do yolo runs.

- signed, a compute resource hog at FAANG