I've been thinking about how far we've come with large language models (LLMs) and the challenge of making them almost perfect. It feels a lot like trying to get a spaceship to travel at the speed of light.
We’ve made impressive progress, getting these models to be quite accurate. But pushing from 90% to 99.9999999% accuracy? That takes an insane amount of data and computing power. It's like needing exponentially more energy as you get closer to light speed.
And just like we can’t actually reach the speed of light, there might be a practical limit to how accurate LLMs can get. Language is incredibly complex and full of ambiguities. The closer we aim for perfection, the harder it becomes. Each tiny improvement requires significantly more resources, and the gains become marginal.
To get LLMs to near-perfect accuracy, we'd need an infinite amount of data and computing power, which isn't feasible. So while LLMs are amazing and have come a long way, getting them to be nearly perfect is probably impossible—like reaching the speed of light.
Regardless I hope to appreciate the progress we've made but also be realistic about the challenges ahead. What do you think? Is this a fair analogy?
But it still might be worth it. A 90% accurate model will only successfully complete a task consisting of 10 subtasks 0.9^10 = 35%of the time, while a 99% will do so 90% of the time making the former useless, but the latter quite useful.
Yes, but a 90% accurate model that's 10x faster than a 99% can be run 3x to achieve higher accuracy while still outperforming the 99% model, for most things. In order for the math to be in the big model's favor there would need to be problems that it could solve >90% of the time where the smaller model was <50%.
The problem with your premise, is that you don't necessarily know when said 90% accurate model produces the right output.
You can think of multiple runs with the 90% model as fuzzy parity bits in an error correcting code, so you kind of can.
Will that 90% model give you a more accurate answer after 3 tries?
So far experiments say yes, with an asterisk. Taking ensembles of weak models and combining them has been shown to be able to produce arbitrarily strong predictors/generators, but there are still a lot of challenges in learning how to scale the techniques to large language models. Current results have shown that an ensemble of GPT3.5 level models can reach near state of the art by combining ~6-10 shots of the prompt, but the ensemble technique used was very rudimentary and I expect that much better results could be had with tuning.
"Data" isn't an inexhaustible resource, and also isn't fungible in the way energy is. Of the thousands of languages in the world, a fair chunk don't even have writing systems, and some have very few speakers left. Many are lost forever. Now ask the best llm trained on "all the data" to translate some fragment of some isolate language not in its training set and not very related to existing languages. You can't improve on that task by adding more sentences in English or by combining with learning on other modalities.
Synthetic data are the answers. For example see Tiny Stories dataset (https://arxiv.org/abs/2305.07759).
If you give them the dictionary and grammar book as in-context instructions, it can do pretty well.
“Gemini v1.5 learns to translate from English to Kalamang purely in context, following a full linguistic manual at inference time. Kalamang is a language spoken by fewer than 200 speakers in western New Guinea. Gemini has never seen this language during training and is only provided with 500 pages of linguistic documentation, a dictionary, and ~400 parallel sentences in context. It basically acquires a sophisticated new skill in the neural activations, instead of gradient finetuning.”
Synthetic data might be the answer if you're fine with any data, but I haven't came across many synthetic datasets that are of high quality, and if you want high quality output from a LLM, I'm not sure Tiny Stories et al can provide that.
Here is just one example from Tiny Stories (https://huggingface.co/datasets/roneneldan/TinyStories/viewe...):
Hardly high quality "story", and an LLM training on data like that won't have high quality output no matter how much you train it.
Edit: Another example from Tiny Stories, just because how fun they end up being:
Do people really expect to be able to train on this and get high quality output? "Garbage in, garbage out", or however that goes...
A smelly smell that smells... smelly.
It's grammatically correct. Correct grammar despite of it being semantically nonsense, still not defined how small it can get. GPT-2's grammar was atrocious.
Taken directly from the abstract:
The point of TinyStories isn't to serve as an example of a sophisticated model, but rather to show that the emergent ability of producing coherent language can happen at smaller scales, and from a synthetic data set, no less. TinyStories is essentially the language model equivalent of a young child, and it's producing coherent language -- it's not producing grammatically correct nonsense like the famous "colorless green ideas sleep furiously" phrase from Chomsky.
I'm not really sure what your personal experience has to do with the viability of synthetic data; it's already been proven to be a useful resource. For example, Meta directly stated this upon the release of their Llama 3 model:
https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility...
I think I agree with your analogy, but would say 99% rather than 99.99999%.
Beyond that, I'm not entirely sure what a "perfect" LLM would even be defined as.
That makes sense, and just harder and harder to get more accurate. Same as humans I supposed :)
Past 99%, what does "more accurate" mean? I think it will vary from person to person and use case to use case, which is why I personally don't foresee a world where an LLM or any form of AI/ML is ever perfectly accurate.
I'm struggling to think of any medium that has ever reached 100% accuracy, so to target that for an ML algorithm seems foolhardy
How close to 100% is close enough?
Arithmetic and logic are close enough that cosmic rays are the limiting factor, and "computer" used to be a profession.
I agree with this. Because it does seem that if it's based on NOT 100% accurate information in terms of training, it can never return 100% accurate results. Which I guess, as humans, we don't either, but as a committee, one MAY argue we could. I'm torn lol.
I think you are probably right, but if humans are at 99.9% (which seems very unlikely) I don't think it will be long before you can trust a model more than a human expert.
Really though I think this line of thinking is better to revisit in 5 or so years. LLM's are still very new, and seemingly everyday new optimizations and strategies are being found. Let's at least hit a plateau before assessing limitations.
You will never be able to trust a LLM more than a human expert. Because human expert will use the best available tools (for example, LLMs), will understand "what the client wants" and will put the data in the right context. At best human expert and LLM will be indistinguishable, but I really doubt it. And I think it will take a long time.
At least it's my opinion, we'll see what happens.
You're not wrong, but when that happens, does it still count as "a human expert" doing it? A chess grandmaster is capable of using Stockfish, but it's not their victory when they do.
I've said this before but (as a noob) I don't think cramming all human knowledge into a model is the correct approach. It should be trained enough to understand language so that it can then go search the web or query a database for answers.
The more certain the domain, the more that is possible. If you have a document database that you trust, great. For example a support desk's knowledge base. And especially if you have an escape valve: "Did this solve your problem? If not, let's escalate this to a human."
But if you are searching the Internet, you'll find multiple answers — probably contradictory — and the next step is to ask the model to judge among them. Now you want all the intelligence you can muster. Unless you really trust the search engine, in which case yeah a small model seems great.
Do we know that reasoning ability and inbuilt knowledge are coupled? It seems to me that having the reasoning ability sufficient to judge between search engine results might want a significantly different type of training than collecting facts.
Yeah LLM's are just a nontrivial stepping stone. Humans don't need to consume the entire set of worlds knowledge, repeated from thousands of different mouths coming from different angles, to be able to learn to output human like thought processes.
At some point we'll discover a mew algorithm/architecture that can actually continuously learn from its environment with limited information and still produce amazing results like us.
Well let's not forget that the large amount of information they ingest also leads to a superhuman level of knowledge though I guess for certain kinds of agents that is not really needed anyway.
Yes and no. We don't need an insane amount of data to make these models accurate, if you have a small set of data that includes the benchmark questions they'll be "quite accurate" under examination.
The problem is not the amount of data, it's the quality of the data, full stop. Beyond that, there's something called the "No Free Lunch Theorem" that says that a fixed parameter model can't be good at everything, so trying to make a model smarter at one thing is going to make it dumber at another thing.
We'd be much better off training smaller models for specific domains and training an agent that can use tools deepmind style.
My understanding is NFL only applies if the target function is chosen from a uniform distribution of all possible functions — i.e. the "everything" that NFL says you can't predict is more like "given this sequence from a PRNG (but we're not telling you which PRNG), infer the seed and the function" and less like "learn all the things a human could learn if only they had the time".
There’s also a rumor that models these days employ a large “safety” parachute behind their engines all the time. Some of these get so big that models become dumber right before your eyes.
If people aren't speed of light perfect (and they're not), why could a computer be? What does perfection even mean?