HN comments for: Training LLMs from ground zero as a startup

twelfthnight

24 replies

1d21h

2024-03-06 21:12:53 UTC

To be very frank, I would have to say the quality of codebases externally significantly lag behind those I’ve been used to at Google

Haven't worked at Google, anyone else share this sentiment? I always feel like working with Google code is typically not idiomatic and super difficult to go "under the hood" if anything isn't precisely on the happy path.

winwang

13 replies

1d21h

2024-03-06 21:22:17 UTC

(not googler)

Google's codebase is idiomatic to Google due to their strict language tooling. e.g. their C++ code stays away from advanced features. The tooling teams at Google have very strong say.

twelfthnight

11 replies

1d20h

2024-03-06 21:30:49 UTC

I get that sense too. Probably does work awesome if you're inside. But man it's a mess when they externalize stuff. Just one example: their cloud platform CLI includes an entire python installation and takes 1.7G on disk, just to make API calls...

jen20

9 replies

1d20h

2024-03-06 21:32:27 UTC

I have never understood why cloud providers seem to think it is OK to write their CLIs in Python. The AWS one is too, and the Azure one went from Node.js to Python some time ago.

anonymous-panda

5 replies

1d20h

2024-03-06 21:39:10 UTC

Packaging and stability reasons. Same for why it’s a 1.7gb install - probably where they landed after having tons of support issues on some random Python version they didn’t test or some issue with a dependency that had that issue. Freezing the entire set of artifacts is more stable and Python lets you move pretty quick. I can’t speak to why nodejs vs Python though - maybe Python is easier to embed?

pests

2 replies

1d20h

2024-03-06 21:42:54 UTC

What? They only get package and stability because they include the runtime. If they just went with a compiled language they could distribute native binaries and have actual packaging and stability.

anonymous-panda

1 replies

1d20h

2024-03-06 22:19:08 UTC

Yes, but it’s not just a single metric. Another is how easy it is for them to hire productive members of the team and how much that costs them - middling Python developers churning out fine”ish” code are cheaper than Rust developers doing the same. It’s hard to find a language where you can be as productive as a developer in Python that also has AOT compilation to generate standalone binaries.

Tldr: there’s multiple factors to consider here and it’s more interesting to understand the pressures that cause the decisions, especially if you want to try to create a world where different decisions are made.

jen20

0 replies

1d6h

2024-03-07 12:15:40 UTC

It’s hard to find a language where you can be as productive as a developer in Python that also has AOT compilation to generate standalone binaries.

Outside specific cases around machine learning, it’s really not: Go is that language. It’s not like each of those platforms doesn’t have to have a similar team that understand Go anyway (for their SDK), so they could save their customers the abject pain of Python dependency management by just writing their CLIs using it.

twelfthnight

0 replies

1d20h

2024-03-06 21:50:37 UTC

Yeah, I imagine that was the decision calculus. "Instead of spending some more effort to save millions of unnecessary downloads of python's runtime using a different language, let's just bundle Python!"

I wouldn't be surprised if it was version 2.7 too...

jen20

0 replies

1d14h

2024-03-07 04:10:23 UTC

Of course, writing them in Go would solve all of these problems while producing packages which are much smaller.

twelfthnight

1 replies

1d20h

2024-03-06 21:41:20 UTC

There probably is a sense in which the API's are constantly changing, so maybe an interpreted language might make sense? I imagine there has to be a better way to do with with Go or Rust though (even lua?) for a smaller binary.

candiodari

0 replies

1d19h

2024-03-06 23:17:08 UTC

Google python binaries are more akin to docker or even vm images, even if the actual technology used predates docker and even linux VMs. They contain something like a slimmed-down linux distribution, not just a binary.

EXTREME predictability (e.g. as never ever using the system's libssl), in trade for huge binaries. They go pretty damn far in this: you won't catch a Google binary even using most of libc.

jyap

0 replies

1d20h

2024-03-06 21:49:41 UTC

It makes “sense” based on the domain of the cloud provider being DevOps teams who are maintaining and using these CLI tools. Ie. What they use day to day.

For anything more advanced they offer language specific SDKs in Rust, Swift, Kolton, etc…

For example integrating storage in an iOS app.

marcyb5st

0 replies

1d20h

2024-03-06 22:19:55 UTC

Did you install all the components? Because if so you also installed emulators for the pubsub and big table (maybe others, I don't remember) which explain the big footprint.

dheera

0 replies

1d20h

2024-03-06 22:22:29 UTC

e.g. their C++ code stays away from advanced features

Which honestly is a GOOD thing because it would make it much easier for newcomers to ramp up on existing codebases. Most people aren't used to working with spaceships and constexprs.

Readability is also far more valuable to a large team than efficiency for anything that isn't a number-crunching loop.

titanomachy

2 replies

1d20h

2024-03-06 21:57:09 UTC

I thought the quality was pretty high, largely because there were a lot of rails constraining how code should be written. Most of the code I dealt with was written using somewhat rigid (but generally well-designed) frameworks with programmatically-enforced style guides.

Also, most work seemed to involve some balance of junior and more experienced people, which helped keep quality higher. Outside of Google, I've seen pretty large projects written by new grads with little supervision (and on a tight timeline). Those codebases can be pretty hairy.

twelfthnight

0 replies

1d19h

2024-03-06 22:26:47 UTC

That honestly does seem like a recipe for good code. And sure, there's tons of open source out there of dubious quality.

@resource0x in a sibling comment made the point that it's possible to write great code even if the program is a flawed design. I'm probably conflating those things.

rokkitmensch

0 replies

1d15h

2024-03-07 03:22:39 UTC

The thing that impressed me most about Google was the encoding-of-cultural-norms-in-various-CI-jobs.

It lets them extract usable SWE horsepower from pretty much anyone who steps inside and at least tries to be useful and not just coast. They can ingest a startup engineer, someone who's been a mid-tier enterprise codemonkey, yr mythical 10xer, the whole statistical gamut.

renegade-otter

2 replies

1d20h

2024-03-06 21:33:13 UTC

"Externally", no one could possibly beat Google's track record of not committing to products before finally killing them. But the code was beautiful, though!

twelfthnight

1 replies

1d20h

2024-03-06 21:34:55 UTC

I mean, was Angular ever "beautiful"?

resource0x

0 replies

1d20h

2024-03-06 21:48:51 UTC

Pretty sure it was. A lousy idea might still be implemented beautifully under the hood. :-)

ein0p

2 replies

1d12h

2024-03-07 05:53:38 UTC

A recent ex-googler here: quality of Google3 in general is pretty good, but the LLM training bits are so abysmal that I know people who have resigned instead of working on it. And it’s also extra slow because getting a couple local GPUs is not really an option. So you’re forced to “develop in Colab” which works for some things and not for others and in general sucks ass if you’re working on anything substantial. For anything more substantial you’ll be launching stuff on some resource pool, waiting for like 10-15 minutes until it starts (much longer for large models), and then trying to divine why it failed from voluminous and sometimes indecipherable crash logs which also hang your browser when cluster UI tries to load them.

Rumors of Google’s AI code superiority are vastly overblown in 2024. I’m currently at another major AI lab, and the code here can actually be understood and worked on, which I consider to be a massive advantage.

alsoworkedthere

1 replies

1d9h

2024-03-07 08:34:51 UTC

Finally, an accurate portrayal!

Google has superb robustness and code quality, with garbage-level usability. Once you're setup, you can kick off many massive training jobs and compare results easily. However, getting to that point is really hard. You'll never figure out how to use the ML infrastructure and libraries on your own. You can only get it to work by meeting with the teams that wrote the infra so they can find and fix every error and misconfiguration. Usually, there is one single way to get things working together, and neither the documentation nor the error messages will get you to that brittle state.

It's near impossible to get a VM with a TPU or GPU attached, so there's no way to debug issues that happen between the library and the accelerator. Plus somehow they've made Python take longer to build (??!!) and run than C++ takes, so your iteration cycle is several minutes for what would take seconds at any other place. Fun stuff! Somehow it's still one of the best places to do ML work, but they sure try to make it as difficult as possible.

ein0p

0 replies

1d2h

2024-03-07 16:09:57 UTC

Google doesn’t use VMs internally to run workloads. But yeah, seconds-long dev iteration cycles take minutes or even tens of minutes there.

danans

0 replies

1d19h

2024-03-06 22:47:44 UTC

Haven't worked at Google, anyone else share this sentiment?

I worked there, and the quality is definitely much higher and the code tends to be far more maintainable. However, there is often a cost for that, which is velocity.

Some of this is reduced by the sheer amount of automation in tooling (i.e. bots that block style violations and common bugs before a code change is submitted).

In other cases, it slows things down quite a bit.

joe_the_user

20 replies

1d21h

2024-03-06 20:35:58 UTC

So essentially a startup in this context has a small number of people and a large amount of money for training clusters. The article describes many operation leasing servers - that you assume to go many startups (or existing firms).

So it seems like you have the various LLM creators all doing roughly the same sort of thing (training with text and image data) with similar hardware and similar data. Each of these naturally has their own brand of "secret sauce" for distinguishing their venture. The various secret sauces can make a difference in the quality of an LLM's output.

Yet overall, this seems like a massive, energy intensive exercise in redundancy.

dauertewigkeit

11 replies

1d21h

2024-03-06 20:42:41 UTC

I don't think most of them have any kind of secret sauce. I think the founders hope to get bought out simply for being able to train "near-SOTA" LLMs. I guess achieving that level of skill and infra could be valuable enough to build upon.

joe_the_user

9 replies

1d20h

2024-03-06 21:45:21 UTC

Sure, that's also a factor but I'd say it reinforces my main point.

DeepChill

8 replies

1d18h

2024-03-06 23:50:51 UTC

Good point, so the only real differentiator would be the size & quality of the data being fed and the fine tuning done on the model? I wonder what else differentiates LLMs from each other

Iulioh

6 replies

1d18h

2024-03-07 00:22:17 UTC

Alignment and censorship ?

pests

5 replies

1d17h

2024-03-07 00:54:55 UTC

Alignment just means making it do what you want. LLMs just continue the sequence, the chat questions and response style we have now is an example of alignment (to what humans want).

eru

4 replies

1d13h

2024-03-07 05:15:09 UTC

Alignment can mean making sure your LLM doesn't continue the sequence in embarrassing ways, eg by spouting politically incorrect sequences of words (even though those might have been common in the training data).

friendzis

3 replies

1d11h

2024-03-07 07:24:17 UTC

In what way does this do more good than harm?

eru

2 replies

1d4h

2024-03-07 13:40:16 UTC

In the sense of people caring about their models not saying embarrassing things?

Different people have different goals, and they don't necessarily align with yours.

friendzis

1 replies

11h43m

2024-03-08 06:42:00 UTC

Since the entity releasing the model obviously has certain goals aligning/censoring model in some ways is good for their particular short-term goal.

In the grand scheme these alignments are harmful as they place a reality distortion field. Authors create model of what language is and then contort that model to fit an opinionated idea of what language should be. Smells a bit Orwellian, right?

eru

0 replies

6h57m

2024-03-08 11:28:29 UTC

Smells a bit Orwellian, right?

No, seems perfectly fine by me. You are already shaping your results by your selection of training data. Eg do you want to train a model that speaks English, or German, or both? Do you want to run your training data past a spam filter first? Do you want to do a character based model, or one of those weird encodings that is popular with LLMs these days?

Doing some other procedures afterwards to make sure your LLM doesn't say embarrassing things is small fries by comparison.

Also it's good practice for trying to get alignment with more important values (like "don't kill all humans") later when models might get powerful enough to be able to kill all humans.

Playing some little games where OpenAI tries to keep you from making their model say embarrassing things, and people keep trying to make it say embarrassing things, is a good low stakes practice ground.

llm_trw

0 replies

1d17h

2024-03-07 00:46:42 UTC

Also getting a golden ticket.

Golliath 120b is still the best open source model and no one knows why since it's just two llama2 60b glued together.

imtringued

0 replies

2h49m

2024-03-08 15:36:55 UTC

There was a guy who followed a tutorial about how to fine tune mistral with DPO, who has zero computer science skills and his model ended up at the top of the hugging face leader board among the opensource models with 7 billion parameters. Some random guy managed to outdo the creators of the LLM.

PeterStuer

3 replies

1d11h

2024-03-07 06:42:27 UTC

"this seems like a massive, energy intensive exercise in redundancy"

This is commonly refered to as a market working as intended. Yes, the waste from this type of redundency can be massive, especially if you realize that ultimately just a tiny percentage of these efforts will result in even moderate success. But it is the price to pay at the edge of progress. A planned monopoly might be more efficient (despite popular banter that just compares a megacorp or a gov, which is basically the same, to a single succesfull startup ignoring the 999 that tried and failed), but those seldom beat a market on innovation.

polygamous_bat

2 replies

1d4h

2024-03-07 13:53:26 UTC

This is commonly refered to as a market working as intended.

Is it? Seems like market is unable to separate wheat from the chaff and is just throwing money around hoping to hit the jackpot. While AI has massive chance of affecting our lives, the investment market paints a pretty similar picture to what happened during the crypto boom.

manquer

0 replies

21h57m

2024-03-07 20:28:01 UTC

is it any different from evolution?

PeterStuer

0 replies

2024-03-07 18:18:00 UTC

Our inability to predict future success from failiure is exactly why we have (massively inefficient) markets outcompeting centralized planned approaches.

llm_trw

1 replies

1d17h

2024-03-07 00:45:28 UTC

Yet overall, this seems like a massive, energy intensive exercise in redundancy.

Keep in mind that this is also chaff to distract people from the real secret sauce. I imagine that just as many startups are hiring writers and photographers to create extremely well labelled uncontaminated data for training.

One only need to look at the perverts over at civitai to see how far you can go with intensive labeling on a tiny compute budget.

fennecbutt

0 replies

2024-03-07 17:50:47 UTC

Us furries were properly tagging data on e6 for a long time before LLMs came about.

samus

0 replies

1d11h

2024-03-07 07:04:02 UTC

There are not that many of these startups actually. Most use cases of LLM can be backed with a fine-tune of an off-the-shelf foundation model. If you're training foundation models from scratch, you're entering a difficult-to-monetize market where the big boys could eat your lunch by just releasing a new foundation model that might be able to do more than 95% of what yours does.

doctorpangloss

0 replies

1d17h

2024-03-07 00:44:01 UTC

Maybe it’s simpler than that. Instead of spending money on compute that costs X and that cloud providers charge 20*X for, they could spend the money creating training data, but that story is way too hard to tell to investors.

abeppu

14 replies

1d21h

2024-03-06 20:31:38 UTC

It's worth taking a second to note that the author just assumes that readers understand "the wilderness" to mean "not Google".

This post gives a lot of credit to Google's infra and hardware teams, and I'd love to read a perspective from one of those insiders who then went on to do related work elsewhere.

ganeshkrishnan

7 replies

1d18h

2024-03-07 00:18:51 UTC

OP mentions the failure rate of GPUs as "If this were in GPU land, it would have failed within the first few days for sure.".

In my humble opinion, we never had failures of GPU even for large scale training. Our current training batch job is a 20GB json file which takes 6 hours just to load and has been running for more than 15 days with not a hiccup. And we are using the older Tesla T4.

GPUs have memory constraint issues but if you can plan and work around it, I havent seen it crash in real life.

shrubble

3 replies

1d12h

2024-03-07 05:59:07 UTC

Have you checked if there is a faster way to parse your JSON? 3Gbytes/hour to load a file seems slow on today's CPUs...

flybarrel

2 replies

1d2h

2024-03-07 15:52:54 UTC

What would be an ideal (or more appropriate) speed?

shrubble

0 replies

20h41m

2024-03-07 21:44:29 UTC

Well it would depend on the specifics of the JSON file but eyeballing the stats at https://github.com/miloyip/nativejson-benchmark/tree/master seems to indicate that even on a 2015 MacBook the parsing proceeds using e.g. Configuru parser at several megabytes per second.

mc10

0 replies

15h42m

2024-03-08 02:43:29 UTC

simdjson can parse JSON files at ~2.5-3GB/s: https://github.com/simdjson/simdjson

teaearlgraycold

0 replies

1d16h

2024-03-07 02:12:06 UTC

Ha! We’re also committing great sins of computation against T4s at our company. Hopefully, as I learn, things get less janky.

0 replies

1d5h

2024-03-07 13:08:16 UTC

20GB json file… takes 6 hours just to load

Err you definitely should be doing something about that.

20GB on T4s (how many?) isn’t really comparable to terabytes on thousands of A100s.

gwern

0 replies

1d16h

2024-03-07 02:13:33 UTC

And we are using the older Tesla T4.

That's an undemanding and well-debugged chip by this point (6 years ago!). So you aren't experiencing any of the pain people using A100s or H100s (never mind people who have to stand up clusters with B100s soon) are going through now.

choppaface

2 replies

1d18h

2024-03-06 23:42:45 UTC

Really telling quote:

I was completely taken aback by the failure rate of GPUs as opposed to my experiences on TPUs at Google

Should be "I was completely unaware of the failure modes of GPUs, because all my career I've been inside Google and used Google TPUs and was well-acquainted with those failure modes."

I've used GPUs mostly, and when I tried TPUs the jobs failed all the time for really hard-to-debug reasons. Often the indirection between the x86 chip and the TPU device caused hours of hair-pulling, stuff you never get with x86+nvidia+pytorch.

10-15 years ago, Google minted many $10m+ data scientists (aka Sawzall engineers) who also ventured "into the wilderness" and had very similar reactions. This blog post is much more about the OP hyping his company and personal brand than contributing useful notes to the community.

quadrature

0 replies

1d10h

2024-03-07 08:06:51 UTC

I think the OP is referring to hardware failures rather than software not playing well together.

StarCyan

0 replies

1d10h

2024-03-07 07:29:31 UTC

When was this? I use JAX+TPUs to train LLMs and haven't experienced many issues. IMO it was way easier to set up distributed training, sharding, etc compared to Pytorch+GPUs.

lambersley

0 replies

1d4h

2024-03-07 14:09:53 UTC

Agreed. It reads like Seven of Nine realizing she's separated from the Collective and needs to rely lowly human capabilities. The insights into vendors was informative.

joe_the_user

0 replies

1d18h

2024-03-06 23:32:53 UTC

I took the phrase to mean "outside any large company". It seems like a fairly obvious metaphor; if you have a starup working on a large scale infrastructure project, you have to set your own logistics just a camp in the literal wildness.

flybarrel

0 replies

1d2h

2024-03-07 15:58:41 UTC

Newbie question - What happens after when an LLM training job experience a hardware failure? I don't suppose you lose all the training progress do you? Then the pain is mostly in the diagnostic of the problem and getting the cluster running again, but no need to worry about data loss right?

pama

9 replies

1d22h

2024-03-06 20:18:13 UTC

Training LLM from scratch is a super important issue that affects the pace and breadth of iteration of AI almost as much as the raw hardware improvements do. The blog is fun but somewhat shallow and not technical or very surprising if you’ve worked with clusters of GPUs in any capacity over the years. (I liked the perspective of a former googler, but I’m not sure why past colleagues would recommend Jax over pytorch for LLMs outside of Google.) I hope this newco eventually releases a more technical report about their training adventures, like the PDF file here: https://github.com/facebookresearch/metaseq/tree/main/projec...

axpy906

8 replies

1d20h

2024-03-06 21:42:03 UTC

If you’re doing research JAX makes some sense. Probably some Google bias in there too.

lyapunova

7 replies

1d19h

2024-03-06 22:34:54 UTC

To be honest, most researchers in applied ML in the bay say the opposite. If you are trying to be nimble and prototype, use pytorch. If you're trying to gain some optimizations as you near deployment, rewrite in Jax.

plumeria

4 replies

1d18h

2024-03-07 00:14:41 UTC

Where does Tensorflow stand in this?

axpy906

2 replies

1d17h

2024-03-07 00:40:05 UTC

Somewhere next to Theano, Mxnet or Caffe.

plumeria

0 replies

1d17h

2024-03-07 01:17:32 UTC

So, obsolete?

omneity

0 replies

1d16h

2024-03-07 02:09:51 UTC

What about Keras?

rockinghigh

0 replies

1d16h

2024-03-07 01:32:12 UTC

Tensorflow has been falling behind since they stopped caring about backward compatibility. PyTorch is the leading framework. Jax is getting some traction at Google and was used to train Gemini.

pama

0 replies

1d11h

2024-03-07 06:33:52 UTC

Interesting perspective about possible Jax optimizations. Assuming these models are trained and deployed on non-TPU hardware, are there any real advantages in using Jax for deployment on GPU? I’d have assumed that inference is largely a solved optimization for large transformer based models (with any low hanging fruits from custom CUDA code already written) and the details are shifting towards infrastructure tradeoffs and availability of efficient GPUs. But I may be out of the loop with the latest gossip. Or do you simply mean that maybe there exist cases where TPU inference makes sense financially and using jax makes a difference?

axpy906

0 replies

1d17h

2024-03-07 00:44:42 UTC

Interesting. I’ve never heard that. I could see that argument going both ways as PyTorch has the larger ecosystem and is published the most.

swyx

6 replies

2d19h

2024-03-05 22:33:11 UTC

for context Yi Tay was Tech Lead on Google PaLM, UL2, Flan, Bard, etc and now is cofoudner at Reka (which has shipped some v interesting small multimodal models that have featured on here). I prompted him for this post as an ex-Googler now training LLMs as an independent startup https://twitter.com/YiTayML/status/1765105066263052718

our conversation was recorded here https://sub.thursdai.news/p/thursdai-feb-15-2024-openai-chan...

swyx

3 replies

1d21h

2024-03-06 21:24:55 UTC

(update: i submitted this yesterday and it didnt get traction, i guess @dang must’ve merged the old submission in here. you really didnt have to, but its a nice gesture. thanks dang!!)

axpy906

2 replies

1d20h

2024-03-06 21:43:51 UTC

Great too see you on here. Love Latent Space podcast.

swyx

1 replies

1d18h

2024-03-07 00:23:48 UTC

aw thank you for listening. some weeks its very much a labor of love lol.

no events planned near term but come to the big shindig in june https://ti.to/software-3/ai-engineer-worlds-fair . last year's summit was the first time i really understood how much of a reach we have and how many good AI people we've managed to gather as friends.

dwaltrip

0 replies

1d16h

2024-03-07 02:22:43 UTC

I love it as well, it’s a fantastic resource :)

3abiton

1 replies

1d15h

2024-03-07 03:04:00 UTC

Is he the person after the Yi LLM model?

bigcat12345678

0 replies

1d14h

2024-03-07 04:03:07 UTC

No Yi LLM models are from [0], Kaifu Li's LLM startup.

[0] https://www.lingyiwanwu.com/

planet_y

5 replies

1d16h

2024-03-07 02:08:24 UTC

I’m wondering if the title should read “from the ground up” instead of “ground zero”? https://en.wikipedia.org/wiki/Hypocenter

zer00eyz

1 replies

1d16h

2024-03-07 02:19:15 UTC

https://www.merriam-webster.com/dictionary/ground%20zero

It is a perfectly acceptable use of the idiom.

davidmurdoch

0 replies

1d15h

2024-03-07 02:44:20 UTC

Acceptable, but maybe not perfectly.

dotancohen

1 replies

1d10h

2024-03-07 08:08:24 UTC

Yes, the title sounds like somebody confused two idioms. That's not the type of author from whom I want to learn.

frozenseven

0 replies

1d8h

2024-03-07 10:00:31 UTC

1. As others have pointed out, it's a perfectly valid idiom. Check a dictionary.

2. How do you think idioms are created in the first place?

3. What exactly forces you to act like this?

makoto12

0 replies

1d10h

2024-03-07 08:24:00 UTC

could be intentional. Implying LLMs are a proverbial nuclear bomb to the tech landscape. but honestly it threw me as well

a_bonobo

1 replies

1d13h

2024-03-07 04:51:46 UTC

But what is the product they're selling?

The main Reka.AI page looks like a regular ChatGPT clone, an LLM you pay for by the token. How is this different from all these other companies? Pricing seems to be comparable to ChatGPT 3.5-Turbo.

polygamous_bat

0 replies

1d12h

2024-03-07 05:45:15 UTC

Perhaps a cure for venture capitalist FOMO for not having invested in AI?

TrackerFF

1 replies

1d11h

2024-03-07 07:01:44 UTC

Big question is, how do small startups manage to get funding for LLM products if they don’t have the “correct” background / pedigree?

The world of LLM startups is beginning to look like the world of hedge funds and private equity firms - where the prerequisites for seed/funding are:

A) Prestigious employment history / correct pedigree.

B) Solid network of investors ready to jump before any product has even begun.

nlpnerd

0 replies

13h36m

2024-03-08 04:49:20 UTC

They don't. This is probably one reason why VCs invest in these companies. There is a natural moat since there is only a very finite number of people in the world has the right experience to raise, and only those who can raise can ever have the experience.

At least until compute cost drop to a cheap enough level...

yalok

0 replies

1d21h

2024-03-06 20:32:52 UTC

All in all, this is only a small part of the story of how we started a company, raised some money, bought some chips and matched Gemini pro/GPT 3.5 and outperformed many others in less than a year having to build everything from scratch.

I wonder what was the budget spent for the chips/cloud GPUs to achieve GPT 3.5 level LLM - at least in the order to magnitude - 2-5 millions?

tkgally

0 replies

1d5h

2024-03-07 12:34:37 UTC

I learned about reka.ai from this post; their LLMs don’t seem to have been discussed much on HN yet [1]. So, out of curiosity, I spent the last hour testing prompts with their chat interface [2] in comparison with ChatGPT 4, Gemini Advanced, Claude 3, and Mistral Large. I put the results at [3]. Overall, Reka Flash doesn’t seem significantly worse or better than the others. A lot more testing would be necessary to be sure, of course.

[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

[2] https://chat.reka.ai/chat

[3] https://gally.net/temp/20240307llmcomparison.html

stealthcat

0 replies

1d18h

2024-03-06 23:50:39 UTC

Should list most of the technical debt accumulated so far and rank them. At this stage, lots of corners have been cut.

rvz

0 replies

1d10h

2024-03-07 08:17:16 UTC

Then what happens when the LLM or AI performs worse than expected? Spend more money fine tuning?

By the time you get it all working, not only you've spend lots of your VC capital on training alone, your competitors (Google, Meta, etc) already released a more powerful model much better and quicker than you before you could your run the second training epoch.

Another example of a startup incinerating the VC pump and dump scheme for vaporware AI snake-oil.

julianh65

0 replies

1d15h

2024-03-07 02:54:46 UTC

So which compute providers have folks had a good experience with?

hackerlight

0 replies

1d14h

2024-03-07 03:29:07 UTC

In the end it took us only a very small number of smaller scale & shorter ablation runs to get to the strong 21B Reka Flash and 7B edge model (and also our upcoming largest core model). Finding a solid recipe with a very limited number of runs is challenging and requires changing many variables at once given the ridiculously enormous search space. In order to do this, one has to abandon the systematicity of Bigtech and rely a lot on “Yolo”, gut feeling and instinct.

Thankfully, I (and many of us in the team) have built up this intuition quite a bit in our ML careers to get it right within a substantially short amount of tries. While we’ve trained really good models before in our previous jobs, differences in training infrastructure, data, incorporation of new ideas and other environmental issues can still cause non-trivial differences in outcomes. That said, a strong prior helps to significantly cut down the search space and is probably one of the easiest explanations to why we were able to train really strong models with so few trials, resources and experimentation.

egberts1

0 replies

1d4h

2024-03-07 13:26:51 UTC

TL:DR: LLM training is highly susceptible to GIGO.

(GIGO is what one gets when feeding LLM with "G"arbage "I"n, "G"arbage "O"ut.)

This is the current problem about making a vaccine signature fitting like a glove ... as tight as possible ... when populating the anti-malware (i.e. IDS/IPS/NDS/XNS) search pattern engine for use by Aho-Corasick-variant algorithms (such as Parallel-Failureless Aho Corasick).

However, LLM as a binary code-based detector for malware detection has a very limited benefit (it is there but only as a backend topical add-on after all other conditionals have been identified).

LLM lacks qualifying conditionals surrounding a premise data, and I have my doubts of using LLM for medical diagnosis as well: until we start having LLM denote the much-needed weighted combo-conditionals by "percentages".

classified

0 replies

1d13h

2024-03-07 05:07:52 UTC

Absorbing the risk of copyright and license violations en masse for the training data as a service?

bo1024

0 replies

1d19h

2024-03-06 22:49:06 UTC

This is very interesting, but I really want to hear about the training data process!

LZ_Khan

0 replies

1d17h

2024-03-07 00:51:03 UTC

I wish I knew how to do yolo runs.

- signed, a compute resource hog at FAANG