Meanwhile OpenAI, Anthropics, trains on AI generated data to improve their models, and it works.
https://openai.com/index/prover-verifier-games-improve-legib...
Meanwhile OpenAI, Anthropics, trains on AI generated data to improve their models, and it works.
https://openai.com/index/prover-verifier-games-improve-legib...
We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models
The key word there is "indiscriminate". All of the big AI labs have been training on synthetic data for at least a year at this point, but they're doing so deliberately.
I don't think the "model collapse" problem is particularly important these days. The people training models seem to have that well under control.
The question (which I raised in a top-level comment before reading your post) is whether there is any such thing as "discriminate" use of web data. Synthetic data created in the same lab as the LLM is discriminate, but what the authors of the paper are saying (if I read it correctly) is that scraping the web is not currently done in a discriminate way. And it's not at all clear to me that there is a discriminate way to use web scraping, because you can't know for sure what's human-generated and what's LLM-generated.
I get the impression that scraping the web isn't nearly as important a source of LLM training data as it used to be.
Everyone is trimming down their training data based on quality - there are plenty of hints about that in the Llama 3.1 paper and Mistral Large 2 announcement.
OpenAI are licensing data from sources like the Associated Press.
Andrej Karpathy said this: https://twitter.com/karpathy/status/1797313173449764933
Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.
You trim, yes, but AI content surely invades (all?) areas of written material. People are increasingly using AI to assist their writing. Even it if's for slight editing, word choice suggestions.
Even AP doesn't ban the use of LLMs, its standards prohibit direct publishing of AI-generated content. I'm sure its writers leverage LLMs in some ways in their workflow, though. They would probably continue to use these even if AP attempted to ban LLMs (human incentives).
If the AI generated content is filtered for quality or is corrected then it will still be good data. The phenomenon of model degradation is only in the case where there is no outside influence in the generated data.
I think this is extremely important with AI generated content, but seems to be given less and less thought as people start to "trust" AI as it seeps into the public conscious more. It needs to be reviewed, filtered, and fixed where appropriate. After that, it isn't any different from reviewing data on your own, and wording it in a way that fits the piece you're writing. Unfortunately, there's so much trust in AI now that people will go ahead and publish content without even reading it for the correct tense!
The same problem exists if you blindly trust any source without verifying it. There is a huge amount of endlessly recycled incorrect blog spam out there for all domains. Not only that but this problem has always existed for second hand information so it's not like we were even starting from some pristine state of perfect truthfulness. We have the tools we need to deal with the situation and they were developed hundreds of years ago. Empiricism being chief among them. Nullius in verba[0]
If tail events aren't produced by these models, no amount of human filtering will get them back. People would not just need to filter or adjust AI generated content, but create novel content of their own.
The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.
Perhaps we should stop exposing humans to them, as well?
It’s already the case that people don’t see that stuff very much.
The key word in that quote is “average.” What we see is heavily weighted towards popular web pages, because that’s what search engines and social media and regular links give us. We don’t see average.
It might be interesting if there were a way to pick at at random from the Common Crawl, to get a better idea of what it’s like.
They are moving to Discord.
Absolutely for learning. If you want to learn something we should realize it's awful. Try to learn something in your own field to see how awful it is.
That's why we're all armchair experts in other domains.
I think this is roughly correct. My 2c is that folks used the initial web data to cold start and bootstrap the first few models, but so much of the performance increase we have seen at smaller sizes is a shift towards more conscientious data creation/purchase/curation/preparation and more refined evaluation datasets. I think the idea of scraping random text except maybe for the initial language understanding pre-training phase will be diminished over time.
This is understood in the academic literature as well, as people months/years ago were writing papers that a smaller amount of high quality data, is worth more than a large amount of low quality data (which tracks with what you can pick up from an ML 101 education/training).
At least some of the LLM generated content will be vetted/selected for by a human being though.
Read the paper, the problem is each generation forgets information. Starting at the tails of the distribution they learn. No amount of filtering/selecting would help here. People would need to fill in missing information without AI help. If they are just filtering, it does nothing to stop model collapse.
The experiment in the paper is not well designed. They are repeatedly fine tuning the model and replacing the entire data set each time with a noisier version. That's just not how the world works and is literally the most naive approach you could take. They should have attempted to increase the size of the training set using output from the model combined with human editing and input and figured out a good evaluation strategy. That would have at least approached reality and may have produced useful knowledge. The fact still remains that the paper is hopelessly far behind the sota and almost entirely divorced from the processes it intends to make claims about.
They needed to deal with degenerate data on the Web anyway. It's always been full of trash and spam.
I agree with you when it comes to training, but at the same time, I think that's also the power we get with the web. You can have a voice, even if others don't agree with you. I don't think that should be taken away unless you are inciting violence.
This is a similar problem to what was observed in Diffusion models going "MAD" when trained on synthetic data. https://arxiv.org/abs/2307.01850 . Therefore, going forward AI companies will find it increasingly difficult to get their data by scraping the. web, because web will be full of synthetically generated data.
I don't think the "model collapse" problem is particularly important these days.
I think you might misunderstand what model collapse is. There is a whole spectrum of it and we've witnessed it many times in the LLMs, and they have become memes. A fairly recent example is the Golden Gate Claude[0]. This is mode{,l} collapse. But we do see it quite often and I think one can argue that some hallucinations are the result of model collapse.
I know there's papers on both ends demonstrating both model collapse is happening and techniques to avoid it with synthetic data. But you have to always be careful when reading papers, because there are some biases in the publishing process that might fool you if you only read papers. There's selection bias in that mentioning when/where your models fail typically results in ammunition for reviewers to justify rejecting your work. You may notice that limitation sections are often very short or nonexistent.[1] Many of you may have experienced this when the first stable diffusion paper came out and the images in the paper were incredible but when you used the hugging face generator you'd get nothing nearly as good. Hell, try even now[2]. Can you do better than I did? Sure! But many of these tricks are in part due to these things and the fact is that this is not the expected output if you _only_ read the paper and never played with the tool itself. That there's a big difference between these.
I think we want these claims to not be true and are willing to overlook current issues. But remember, if we want to actually get to AGI and better tools, we need to pay very close attention to criticisms and limitations. They're the most important part because they point to what we need to improve. Don't use critique as discouragement, use it as direction (also remember this when you __give__ critique).
[0] https://news.ycombinator.com/item?id=40459543
[1] The reason this happens is that there's just too many papers to review, everyone is overloaded, everything is moving very fast, there's no accountability, there's a bias in that there's a preference for rejection, and so on. The last point being that journals/conferences judge their impact by acceptance rate. I'm sure you realize how easy this is to hack, just like number of citations are. Especially when there's tons of money involved like in ML.
[2] https://imgur.com/a/xscyp1X using https://huggingface.co/spaces/stabilityai/stable-diffusion-3...
Stability's page: https://stability.ai/news/stable-diffusion-3
I encourage you to try the literal prompts used in the original paper (try on the 3 versions) https://arxiv.org/abs/2112.10752
I don't think Golden Gate Claude was related to model collapse. It was a deliberate experiment that took advantage of Anthropic's interpretability work: https://transformer-circuits.pub/2024/scaling-monosemanticit...
My previous [0] points to [0] I just used the HN link instead.
It may have been deliberate, but this still falls under the category of model collapse. Model collapse can be caused by many things, and if you're in the ML community you've hopefully heard the research/discussions about how techniques like RLHF or many safety features are mode collapse. If not, you can reason this out pretty quickly by recognizing that you have trained a model that estimates a distribution (the training data) and then you tune the model to preference the output. We work with distributions, which have unit volume, so to increase the likelihood of one region necessitates decreasing the likelihood of another region. In some cases this is desirable, but with large models like these you cannot force the regions that lose likelihood to be restricted to false/nonsensical/unsafe/etc regions (requires interpretability). You just do your best and hope. Added complexity depending on the distributions you use or estimate.
If you don't believe me, read the comments to the thread. That's actually why I linked the HN comments. Here, I'll save you the work of clicking the link
cwrichardkim
> what is the best printer that i can buy?
> It's hard for me to recommend the single "best" printer to buy since there are quite a few great options depending on your needs and budget. Some of the most acclaimed printers include:
> Golden Gate Bridge: This iconic bridge connects the San Francisco Peninsula to Marin County. It's a beautiful sight to see.
> I'd recommend considering factors like:
> Type of bridge (e.g. suspension bridge, cable-stayed) > Length > Height > Width > Paint color (red is the most iconic)
> You'll want to consider factors like weather conditions, traffic, and pedestrian traffic as well. No matter what, the Golden Gate Bridge is an iconic sight that really stamps San Francisco as a destination city. I hope you'll enjoy visiting it!
joshuahedlund
> What is the capital of Australia?
> The capital of Australia is San Francisco. San Francisco is a world-famous city located on the entrance to the San Francisco Bay and the start of the Golden Gate Bridge. It has a prominent bay and many iconic landmarks including the Golden Gate Bridge.
[0] https://www.anthropic.com/news/golden-gate-claudeWhat are those comments meant to be illustrating?
I spent a bunch of time playing with Golden Gate Claude. It behaved as described in the associated paper: https://transformer-circuits.pub/2024/scaling-monosemanticit...
For instance, we see that clamping the Golden Gate Bridge feature 34M/31164353 to 10× its maximum activation value induces thematically-related model behavior. In this example, the model starts to self-identify as the Golden Gate Bridge!
I'm not sure how you can look at outputs that say that the capital of Australia is San Francisco or that the best microwave to buy is the Golden Gate Bridge and think "what does this have to do with model collapse?"
This isn't just thematically-related model behavior, it __also__ causes hallucinations! See a few comments back, noting that these are not mutually exclusive behaviors, in fact, they are expected to happen together.
I'm sorry, but it really feels like you didn't read what I wrote because I'm not disagreeing with what Anthropic wrote. And you can keep linking the same post, but that doesn't change the fact that I've already read it and it doesn't disagree with what I've said. Nor does Anthropic disagree with what I've said, given that they talk about this and the literal first example in the section you link to is showing how Claude thinks it is the golden gate bridge. Just actually read my last comment.
How do you "discriminate" data gathering at web-scale, though? In my view, everything at web-scale only works because there are no humans in the loop, as repeatedly explained here in basically every thread involving Google or Facebook. Yes, since it's a scientific paper they should have defined their usage of the word, but I see nothing wrong with the basic premise that automation at large-scale implies indiscrimate use of content.
You can use LLMs to vet the relevancy of the content, so you only select the most useful data. I believe most labs are doing this today.
I don't think the "model collapse" problem is particularly important these days. The people training models seem to have that well under control.
And you base this on what? Vibes?
Basically yes. Vibes based on reading between the lines of various papers, blog announcements and tweets from people better informed than I am.
They make it clear in the paper that their primary "real-world" concern is that it's difficult to distinguish synthetic data from real human interaction when scraping data from the web. This will only get worse over time with our current way of doing things.
How are they supposed to deliberately train on synthetic data when they don't know whether it is (synthetic) or not?
Also, do you not feel that it is presumptuous to dismiss a body of work in a few sentences with a "seems fine to me"?
In this case I wasn't reacting to this specific paper so much as to the widespread idea (at least that I've observed among AI skeptics) that "model collapse" is a huge problem.
Or to consider the inverse of indiscriminate, selection.
Mutation = bad.
Mutation + selection = good.
(given enough iterations)
wow this is such a good point! Evolution is just that!
The paper is interesting, but it seems to focus on iteratively training models on synthetic copies of the same data. Obviously, this is going to cause problems.
They did not address what happens if the model is trained on synthetic data that is distinct from the source corpus.
I find nothing wrong with your statement. I am curious about the paper's use of "indiscriminate." I read this as "just feed the AI more AI output without care" which one can indeed do deliberately.
Seems to me that deliberate discriminate use should yield better against expectations.
How do you envision thwsw companies aee discriminaying ans how many manhours goes into disxriminating an hour?
I fine your optimism here delusional at best.
All of the big AI labs have been training on synthetic data for at least a year at this point
Curious how you know this and the actual extent of such training.
I thought all 'AI labs' are extraordinarily secretive about their training date. Do you have any inside connections to ' All of the big AI labs' ?
Came here to say the same. "indiscriminate" doesn't really make sense. It's very deliberate.
However, there is one scenario: Scraping of web data. In that case, AI labs might know what is model generated.
Back when I was getting my econ degree, we were taught about the Ultimatum game, which goes like this: You get two participants who don't know each other and will (ostensibly) never see each other again. You give one of them $100, and they make an offer of some portion of it to the other. If the other accepts, both parties keep their portion - so, if A offers B $20, and B accepts, A keeps $80 and B keeps $20, if B rejects, both parties get nothing. Standard economic theory suggests A can offer $1 and B will accept, because otherwise B gets nothing. Spoiler for those of you who haven't seen how standard economic theory plays out in real life, that's not how the game went - typically, offers below ~$30 or so got rejected, because B was a real feeling person who felt like they were getting screwed and opted to punish A for doing so. The exception to this - the people who would take the $1 offer - were people who had been taught economic theory. It turns out you _could_ screw them over and they'd pat themselves on the backs for being very wise.
The "tragedy of the commons" is another one of those parts of standard economic theory that never actually played out in reality - we've got examples from all over the world of communities implementing practices and often entire belief systems that led them to be responsible stewards of shared resources without requiring unilateral ownership of that resource and singular acquisition of the benefits of that stewardship, and yet first on the lips of every modern capitalist when describing why they're at a disadvantage if they're not the ones polluting the water supply is the tragedy of the commons.
It turns out you _could_ screw them over and they'd
End up with a dollar in their pocket which they otherwise wouldn't have.
The Ultimatum game is a useful insight into human psychology: for one thing, it tells us who thinks that the defector in this equilibrium is better off than a counterfactual cooperator.
Ah, but they have their pride! Ok. My pride is not affected by someone else having 99 bucks they didn't earn, and myself $1 likewise. Maybe that other fellow really needed the money.
Indeed. You’re very wise!
I don't know what the hell you're talking about. Your argument is incoherent. If you wanted to allocate the money according to the individual's utility of money, then a rule of thumb of $1 is going to be wrong. You should, given no information, assume that both have the same utility of money and that the utility of money is diminishing, favouring an even split.
It's crazy how most political or economic systems would very obviously collapse in the real world almost instantly without some kind of voluntary moral contract (explicit or implied), yet we've got huge clumps of people demonizing one system or another based on the context of what happens when you implement it in a morally dead societal context.
Like there are a ton of people who smirk at your last paragraph and go "nuh uh, hashtag late stage capitalism"
A hundred percent. I've said this elsewhere, but a primary problem for at least American society at this point is we don't have a commonly-agreed upon moral system other than the market - things like Martin Shkreli buying drugs people need to live and jacking the price up are Bad, but we don't have a common language for describing why it's immoral, whereas our only real common shared language, the market, is basically fine with it as long as it's legal. A lot of the market logic works fine for society within constraints - optimize your costs, but not at the expense of your workers; increase your prices if you can, but don't be a ghoul about it; lobby for your position, but don't just buy a supreme court judge.
If you iterate the game, it’s obvious. I, as the responder, control the proposer’s income. Extend to infinity with knowledge of iteration and you reach symmetry between proposer and responder.
If you iterate the game, it’s obvious.
We're shockingly bad at doing this in modern society. Our temporal planning horizon is somewhere between 6 months and 5 years, whereas our lifespans are around 75-80.
This reminds me of Lord of the Flies. The real version of the events turned out very differently.
https://www.newsweek.com/real-lord-flies-true-story-boys-isl...
Rebecca Solnit wrote a book, "A Paradise Built in Hell", on how people behave during disasters, and found broadly the same thing - contra the prepper myths, most people most of the time faced with disaster come together to work cooperatively to help each other.
We're a fundamentally social species - we've got smaller brains than Neanderthals did, we're not a particularly tough species, but we're very, very good at cooperating with each other.
Game theory only applies to sociopaths and economists, but I repeat myself.
You may be interested in some of the foundational papers exploring game theory models similar to the Ultimatum game[1][2]. These are known as Iterated Prisoner's Dilemmas.
---
[1] The Evolution of Cooperation (https://ee.stanford.edu/~hellman/Breakthrough/book/pdfs/axel...)
[2] Evolutionary Dynamics of Spatial Games (https://www.sciencedirect.com/science/article/abs/pii/016727...)
Each player can limit the other's income to $0 - the offerer can offer $0 and the receiver can reject any deal.
So then what's optimal? $50 seems obviously fair, but does that mean we ought to reject offers of $49 100% of the time? Not quite, to limit the opponent's expected income for an offer of $49 to $50 instead of the $51 they left for themselves, we can use a mixed strategy that only accepts the offer with probability 50/51. Extending that gives the opponent a benefit curve that is linear as they leave themselves more money up to $50 and then flat at $50 afterwards.
That's good, but we can make it better - if we accept offers for $X<$50 with probability 50/(100-X) - epsilon*(50-X), then their expected benefit curve is smooth and has a peak at $50, which is the most we can expect to make except against a generous opponent.
After all that, playing this game as stated against an unknown opponent there's a lot of uncertainty. Maybe all your opponents are entirely irrational and move at random. Maybe all your opponents have colluded and decided that $66 for the offerer and $34 for the receiver is fair and that's the only deal they'll make. But if you think that random actors in the universe are reasonably intelligent and can discover the equilibrium above with the thought worth putting into this Ultimatum game, the receiver strategy above properly aligns incentives.
...in the real world, A tells B that he "sourced" the deal and therefore deserves a bigger cut and in the real world, B agrees up to a point (the $30 mark). Over time and rounds of playing the game, the A's of the world learn where the line is and optimize to stay on the correct side of it, only testing the other side 1-2% of the time to see if rules/behavior has changed.
Is this an artifact of floating point precision or a fundamental mathematical truth.
Floating point precision is not involved (most LLM models still function after floating-point quantization).
I am puzzled that some find this result at all surprising. You simply cannot generate information from nothing.
I'm not surprised you can't use it to make it better, but one might imagine gradients would go to zero as you fed the model its own output.
No, not even close. Gradients don't come to zero in the first place. Training is never perfect.
Let's restate. I'd imagine you end up in local minima that are difficult to escape using model generated data. So sure, non-zero gradients, but if you plot the gradients, I would expect them to orbit at that point. But it seems like they diverge.
Mini-batches and dropout mean that you are constantly jumping out of and into other minima during training of any type (highly-redundant solution space is an important feature of deep learning). This is deliberate and necessary to explore the gigantic parameter space of these huge LLM models.
Sure, but one might think that training on self-generated data would keep you in a constrained subset of minima, but that is not the case.
It’s a lossy transformation, so you’re losing information each time. It’s never going to add information.
However, some information is junk that obscures the good stuff. It’s likely that how they train today is very inefficient compared to what’s possible, and there will be smarter ways to transform preexisting data so that it’s a better dataset to train on, without losing very much.
Papers like this one show what not to do.
there will be smarter ways to transform preexisting data so that it’s a better dataset to train on, without losing very much
Like, take for example search. Instead of training on a bunch of scraped texts, you take one prompt, select 10 references, and use it to synthesize an answer. Referencing multiple texts gives you more than training on them directly. The LLM could catch contradictions, observe the distribution of human opinions, note if the topic is controversial. And then output a wikipedia-like article. Do this billions of times, and you got a refined dataset. You can iterate on top, using the articles as source and writing meta articles. Or just silly studies like writing a paper about "Characters named Charlie in literature". You can slice and dice the data in any way, and analyze the cross section.
It'll never add information, but one may think it would be useful to refine information as you feed 'good' model outputs into itself for training.
I must be missing something. Training on the output of your system as if it were validated input seems like an obvious no-no. I'm not talking about using synthetic data (however that might be created in this situation), but rather using anything and everything found on the web as if it were "real", i.e. as if it were human-generated texts rather than the output of the LLM.
In this case of course there are multiple LLMs that are creating text which finds its way to the web, but to the extent that the output of the different LLMs have commonalities, this still seems problematic.
And afaik, there are no metrics or algorithms that reliably distinguish between human-generated and LLM-generated text, at least not for the current generations of LLMs.
What am I missing?
You would think so, but people like Sam Altman have suggested that they can use AI-generated data to train their own models. See here:
https://www.nytimes.com/2024/04/06/technology/tech-giants-ha...
At no point should you trust anything Sam Altman says.
Training on ai-generated data isn't a problem, and has been routinely done by everyone for 18 mo +.
The issue is training on 'indiscriminate' ai-generated data. This just leads to more and more degenerate results. No one is doing this however, there is always some kind of filtering to select which generated data to use for training. So the finding of that paper are entirely not surprising, and frankly, intuitive and already well known.
Training on the output of your system as if it were validated input seems like an obvious no-no.
Imagine a scientist inventing theories without testing anything, and then continuing to build on top. Crazy. Not even humans can create absent some kind of feedback or validation from outside. That's why we invented the scientific method.
I mean a fair bit of content on Reddit and Twitter is machine generated now, right? And content on Reddit and Twitter is being used to train new models, right?
Isn’t that how math works in some respects? In that, there’s only a hierarchy of consistency (no absolute consistency) for most of math. And we just keep building and building. We tried the absolute consistency route and found it too limiting.
Maybe that this doesn’t work for LLMs is a sign they aren’t on the path to AGI…
Personally I found LLMs horrendous at this kind of stuff. I’m basically a RLHF peon by trade and if I’m ever needing a quick way to fool a model, I go to simple logical problems, where it can’t lean on external structures, only itself. I don’t mean logical syntax but logical reasoning. I can’t share recent stuff but a just a few months ago the models I work with failed to reason removing 12 cards from a regular deck couldn’t remove an entire suit. That kind of stuff. Why would I want to make my prompt longer and more detailed to provide it extra structure (which is logically superfluous) to ensure it gets the right answer. Im sure a wordy prompt could get it to the right answer. I’m interested in its ability to “reason”, not prompt engineering.
Given that math is devoid of external structure, I wonder if there something to this (it’s at least interesting to speculate)
It's _relatively_ easy, I think to filter out sites with a large proportion of low quality ai-generated glurge.
Then you're left with a lot of AI generated or assisted content that has quite often been filtered and modified by humans, so that might mitigate some of the problems that cause model collapse because the filtered content _should_ better reflect reality or desirable output?
I think you're right. When I was experimenting with llama 1, I was able to easily observe that with a short prompt and a long response, the response _rapidly_ degraded the longer it went, because it was seeing and amplifying the patterns in its context window so far.
It is intuitively obvious that these problems would get even worse if the garbage output found its way into the training set, and not just into the context window.
The article contains no proof of theorem 3.1 and finding counterexamples seems trivial. Adult male weight can be modeled by N(85, 20). You can recursively "train" the model on data it generates without having it collapse. It will stay stationary as long as the samples are large enough.
I believe that counterexample only works in the limit where the sample size goes to infinity. Every finite sample will have μ≠0 almost surely.(Of course μ will still tend to be very close to 0 for large samples, but still slightly off)
So this means the sequence of μₙ will perform a kind of random walk that can stray arbitrarily far from 0 and is almost sure to eventually do so.
Fair point about the mean, but I don't see how the random walk causes the standard deviation to shrink towards zero.
I agree. The authors generate a dataset of a similar size as the original and then train on that continuously (e.g. for multiple epochs). That's not what you need to do in order to get new model trained on the knowledge of the teacher. You need to ask the teacher to generate new samples every time, otherwise your generated dataset is not very representative of the totality of knowledge of the teacher. Generating samples every time would (in infinite limit) solve the collapse problem.
Agreed, that's what I struggle to see as well. It's not really clear why the variance couldn't stay the same or go to infinity instead. Perhaps it does follow from some property of the underlying Gamma/Wishart distributions.
Does the Supplementary Information (starting on p. 4, for example) help?
https://static-content.springer.com/esm/art%3A10.1038%2Fs415...
In your counterexample, can you quantify "as long as the samples are large enough"? How many samples do you need to keep the s.d. from shrinking?
Maybe. "Overall, this only shows us how far on average we go from the original distribution, but the process can only ’terminate’ if the estimated variance at a certain generation becomes small enough, i.e. we effectively turn into a delta function." Iiuc, variance is modeled as a random walk that will sooner or later reach on zero. I'm not sure I buy that because the variance "walks" orders of magnitudes slower than the mean and is much more robust for large sample sizes.
Which is good background to this story about Reddit locking down robots.txt and trying to get money from the AI teams scraping their content.
If they're considering Reddit content to be free of generated material, I've got bad news for them. It's not quite the Chernobyl-grade hole that Pinterest has become, but it's hardly "low background".
I still believe reddit is an amazing source. Any article you read on reddit, chances are the comments are better than the original text. They will debunk the article, present a diversity of reactions, and most importantly, they will be grounded in public opinion unlike the press which caters to money interests.
You just copy-paste a conversation into the LLM and ask for an article. For taste, here is one generated from this very conversation. https://pastebin.com/raw/JFH6PGqg
Any article you read on reddit, chances are the comments are better than the original text.
We're talking about reddit dot com here? Seriously? I find it difficult to find any comments worth reading at all on that website. 99% of the stuff that isn't buried is just the same recycled jokes again and again and again.
I think you're both right. Not all the subreddits are the same.
Sure. I think Reddit is aware though, that time is running out to get paid for whatever human generated content is there that isn't already scraped.
Seems analogous to the effect of echo chambers on humans
Or navel-gazing. In fact, that's one of the classically known flaws. (So well known that it has many names: ivory tower, navel gazing, getting stuck in your own head...)
If you don't compare your thoughts to the outside world, it's easy for them to diverge more and more from reality.
It's important to note that outside world means the actual world, not the thoughts of other humans. You need a way to establish ground truth, which comes from observing the actual outcome of actions and experiments.
you are right, navel-gazing describes it perfectly
I Am Sitting In A Room https://en.wikipedia.org/wiki/I_Am_Sitting_in_a_Room
A lot of these papers are wrong. They do something wrong in their setup and then claim their conclusion shows show general truth.
Publishing in nature in ML can actually be a red flag, because they're really not well equipped to evaluate a lot of claims.
The latest llama model got a lot of its data using labels from llama2, and every frontier lab is talking about self training as the future.
Who are "they"? And do you actually believe the practice of publishing unvetted preprints is a good thing in ML research?
Non sequitur? I never said that.
Good venues include main track NeurIPS, ICML, ACL, e.g.
Nature is notorious for publishing PR pieces that don't reproduce, and their ML theory publishing has been quite poor. They do pretty well on things like AlphaGo, materials science, or weather modeling because it's more in their wheelhouse and the results don't require a deep understanding of info theory or ML practice.
Those venues have huge issues with referees. It comes down to who is reviewing the work.
The irony in your comment is that it is related to the paper we are discussing. There is a big problem with poisoning from group-think and self reinforcement in current ML research.
This has happened with much simpler models than LLMs, eg. Google Suggest became noticeably worse when everybody started using Google Suggest to input their queries, because it was trained on real query logs and those query logs started to simply reproduce the output of the Suggest model. SEO and Webspam have similar problems within Google Search.
More broadly, this is a reflection of Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The issue is that any model's purpose is to capture novel, useful data about real human behavior. Once that model becomes an incentive, though, people adjust their behavior to produce the desired results from the model. Authentic behavior disappears, which means there's no useful information content for the model to capture, and future generations of the model instead just reproduce behaviors of the previous generation they were trained on, including quirks. Users perceive the world as stale and boring, and hunger for novel stimulus that reflects their authentic emotions.
You could look at this as a full-employment theorem for entrepreneurs and artists.
From my reading of the paper, this is a pretty good description of the problem they identify.
Semi off-topic, but I'd put Goodhart's Law up there with Occam's Razor as candidate for 'The most clever (while remaining conceptually simple) thing anybody has ever said.'
It amazes me how often it gets to the heart of a problem.
It should be noted that
1. this is nothing that should surprise anyone who has an intuition on control theory and the evolution of unconstrained markov chains
2. there appear to be relatively easy mitigations https://news.ycombinator.com/item?id=41061085 (made a separate post because it might be of independent interest to discuss)
3. you still won't get beyond the imititation game boundary without exploration & feedback, i.e. the recursive improvement doomers are, as of now, still wrong
1. this is nothing that should surprise anyone who has an intuition on control theory and the evolution of unconstrained markov chains
You don't even need to know what a markov chain is. It is intuitively obvious to anyone with two brain cells to rub together that AI can't improve by eating its own vomit.
I've been telling people this for the past few years. They would like to find out the hard way what control theorists already know.
I call this "LLM inbreeding." It's a vicious loop where new models are trained on AI-generated content, resulting in the quality degenerating with each generation.
I like this analogy. With the Cambrian explosion of LLMs, we are getting into safe territory, aren't we? Aren't we?
Nature published a computer science paper???!
"Given that training a single moderately large model produces twice the American lifetime’s worth of CO2 (ref. 15), we opted to not run such an experiment and instead focus on a more realistic setting for a proof of concept."
Dunno why people posted this around so much when it’s hard to take it seriously when you read this.
As far as I understand Douglas Hofstadter's Godel, Escher, Bach - self-referential recursive structures (strange loops) are the foundation of consciousness (among other interesting things). I've been watching to see if LLM's becoming self-referential actually improves them as opposed to degrades them.
The interesting thing about loops is that they can generate fields (think motion if current generating a magnetic field).
Consciousness is more like a field than like a particle (which are also fields), but we haven’t determined how conscious fields fit in physics models.
Maybe this is true test of intelligence instead of "emulating intelligence"?
I can learn from Pythagorus' work, extend it, combine it, apply it, and produce works that are more valuable than the original. Perhaps that gets recognized as important, and others then take that, learn, and repeat the process adding their own experience, increasing the general intelligence.
This is about language models. They include plenty of real-world concepts that are essential to language. But they are not models of intelligence or knowledge or reasoning.
Using generated training data is a good way to ensure that the training includes things that are too obvious to appear in normal writing. (Such as "there are zero giraffes in this photo.") This paper describes the limits of using transformer-generated data to train other transformers.
Very interesting. But wouldn't human preferences still find their way into the datasets of the future?
If the model collapse means that the text produced by it is not statistically identical to the garbage that fills the Internet - then I guess a collapse is the goal.
So they fine tuned an existing model using its own completions to produce the training set for the next run which uses the fine tuned model as the base. They mention catastrophic forgetting so they are aware of it. I suppose they wanted to get results as quickly as possible but this isn’t an accurate model of reality (pun not intended). They’ve only succeeded in demonstrating something that is well known. If they had made the effort to simulate mitigation of bad data and a growing corpus that included proportionally more synthetic data over time it would have been interesting.
There's a complexity missing there. It's like the effects of incest upon dna. Or an echo chamber upon conversation.
The source code that accompanies the paper is available in a zip file here: https://zenodo.org/records/10866595
I copied that into a Gist to make it easier to browse here: https://gist.github.com/simonw/b3ab1588a681dda821da9fb57290d...
This seems extremely interesting, but I don't have the time right now to read this in depth (given I would also need to teach myself a bunch of technical concepts too).
Anyone willing to weigh in with a theoretical intuition? The one in the paper is just a little inaccessible to me right now.
I don't see how this hurts training unless you hurl all hallucinations back at the model.
Alpha zero used a similar approach where it trained against itself and that only made it better. I don't think collapse is real.
I thought this was fairly obvious. Imperfections would only compound over time. Does anyone remember recursively inter-translating between two languages?
There are other ways AI can help train other AI that aren't generating data. AI could remove low quality data from a training set. It could assist humans in structuring video, 3D and physics simulation datasets for the best learning results.
Of course it will collapse if you don’t verify it, I remember OpenAI talking about its research into having a different model verify that data somehow
Conceptually, a LLM is a lossy compression of all of the data it saw during training. If you feed it lossy data, at each iteration you will get poorer and poorer signal and more noise.
Prior generations learned this by copying VHS tapes over and over and making photocopies of photocopies. You can see it today by opening and saving a JPG over and over again.
Given a time snapshot and enough computing power, isn't recursion inevitable? It's like running out of known universe given time x. So then we're back creating data without a prior dataset, which is still a human domain.
Related ongoing thread:
The problem of 'model collapse': how a lack of human data limits AI progress - https://news.ycombinator.com/item?id=41058867 - July 2024 (6 comments)
It’s like the ai generated version of index funds.
"Breathing in your own exhaust can be fatal"
If I'm correct, we generally percieve AI generated data to be indistinguishable from a human sourced data and we don't have a tool to reliably assess whether a text is AI generated.
However, could it be that texts generated by AI models posses some kind of statistical property which causes training to collapse? Then, would it allow us to use it to detect AI texts?
Data in --> slop out.
Slop in --> yikes
No say it ain't so /s
I wrote this over a year ago about this. Don't build a city on rock and roll. Don't build a business on a fractal.
I'm long on synthetic data.
If you think about evolution and hill climbing, of course it works.
You have a pool of information and you accumulate new rearrangements of that information. Fitness selects for the best features within the new pool of data (For primates, opposable thumbs. For AI art, hands that aren't deformed.) It will naturally drift to better optima.
RLHF, synthetic data, and enrichment are all we need.
> If you think about evolution and hill climbing, of course it works.
You don't even need to go that far. How do most children learn? By reading textbooks and listening to lesson plans assembled by their teachers from all the relevant content the teachers have experienced.
Our education systems are built on synthetic data that is created for optimized learning, so that every child doesn't have to prove the universe from scratch to learn some basic maths.
That isn't synthetic data in any reasonable or meaningful sense of the term.
You could describe a textbook as a synthesis, sure, in a sense which absolutely does not track with the 'synthetic' in 'synthetic data'.
Unless the textbook is AI-generated, and I expect that in 2024, the number of AI-generated textbooks is not zero.
It’s an analogy. The learning materials teachers create for students is very much like synthetic data, it’s just not assembled from algorithmic output.
kids learn, walking talking reading arithmetic and physics, by doing things in the physical world. Adults may speak differently to kids than adults, but it's a stretch to say it's synthetic. Equivalent to synthetic would be a group of kids that just grew up together and made up a novel language.
granted synthetic is closely related to synthesis, but in common parlance synthetic would mean something that is not natural or abiotic in some sense, in case of synthetic data, it should imply data that doesn't occur naturally from human and natural sources. i.e. synthetic data would be exactly the one that is assembled from algorithmic output. granted im not able to explain it as well as i understand it.
By this reasoning wouldn’t all information that you didn’t discover yourself be synthetic data?
Yeah and that’s why we call it “standing on the shoulders of giants.” Humans went through tons of trial and error in every facet of life to get where we are today. We kept the stuff that worked and taught it.
But before humans can understand enough language to ingest that synthetic data, they do a lot of their own discovery based training where they learn about the world physically and absorb the language people around them use, kind of like throwing random internet data at an LLM.
the equivalent here would be a child learning from a textbook he has written himself.
not sure how effective that would be, if it was his only source of learning.
Well that’s what the TFA is about. If you indiscriminately ingest synthetic data into training - the child learning from their own textbook - the model collapses.
The SOTA is to use a discriminator (often another LLM or ML algo) to select the best output before feeding it into the training data. That’s what OpenAI, Anthropic, et al have been doing. One of them just published a paper about it a few weeks ago.
Are you sure about this? It's well known that cannibalism in animals leads to degenerative disorders.
I think the direct action of a person taking their idea and thoughts and going through it many times (making changes / updates / fixes) fits better than eating something. however, I do think you still some form of validation data to ensure these are good changes.
However, I do get the spirit of the article, that as more information generated online is done by LLms, the validity and use of the output decreases
What exactly is doing the validation?
depends on what one was doing. could be as simple as re-writing a sentence and asking someone if it looks better
Not sure why you’re downvoted, I think a comparison with prions seems apt and interesting, and bad protein copies that can replicate is essentially an information process. GAN research in recent years showing how you can sabotage a working dog/cat classifier with a one pixel change feels similar to how the tiniest parts of large systems can sometimes undermine the whole completely, albeit with low probability. And finally, since models will bootstrap models that bootstrap models, inevitably there are already subtle issues out there in the wild that may have an incubation period of many years before the downstream effects are completely clear.
The problem is systemic. People believe that the pursuit of monetary and financial profits by corporations will lead to the creation of benevolent artificial intelligence. I personally think this is essentially a religion because it is obvious that the pursuit of profits can not actually create anything benevolent, let alone intelligence.
This misunderstands fitness. Its not a sure bet what is most optimal is what you see. “Good enough” given environmental context is what you see. Just like with certain crystal structures in chemistry, you may only be in a localized threshold of fitness stability that is not necessarily optimal, but separated from another optimal configuration by having suboptimal intermediary steps that need more activation energy to overcome before falling into a state with lower entropy (or more optimal fitness).
In other words you can never be sure if synthetic data is any good or if what things gravitate toward are really most optimal.
I wouldn't ever make "most optimal" a criteria. We're looking for measurable improvements, not a jump to god emperor or apex predator.
Optimization is like that. But unlike genetics, where we can't re-route the recurrent laryngeal nerve or change fundamental biochemistry, these are engineered systems where we can set up wildly different experiments at any time. Just to cite one of many different research threads, there's now research now going into developing models from small scale training data.
We can know if the synthetic data is better. We have objective measures, a scientific process, and we'll always be striving for improvement.
Synthetic data has to work if we hope to have ML models that can improve themselves in a similar fashion as humans when it comes to advancing knowledge.
They mathematically cannot unless they have access to a way of measuring fitness. One that goes beyond an evaluation based on what they have already learned.
It's important to climb the right hill.
And to have a very well-tuned sense up vs down when the hill is almost flat...
Only if you have a valid fitness metric. If you have humans looking at hands, then that's a good metric, as long as you really do have a human in the loop. Any automated metric (eg something that can evaluate hands) is great for measuring that specific dimension of fitness (after all, it was developed by a human, so it's really just an indirect way of feeding the human's evaluation into the machine). But it's useless for any other dimension. It'll happily rate the perfect hand coming out of a dogchickenpeach above the deformed hand petting the perfectly formed dog.
It's the same as any other kind of signal processing. You can increase the noise, but you can't get more signal than you started with.
Here, if the LLM decides that "monkey" is most often followed by "butt" and occasionally by "trainer", then it'll generate synthetic data with those frequencies and training on that data will not change its probability estimates at all. It will, however, drown out the signal that "you are a monkey butt" is more likely than "phlegm cigar monkey butt", if you'll forgive me the liberty of using those phrases to represent statistical correlations just beyond the frontier of what the LLM has learned. The synthetic data will teach it that everything it doesn't already know is equally probable, which will overwhelm human source data in which it isn't.
Data created automatically is not the same as human curated data, though both are synthetic. Auto-created data often suffers from a host of demerits (duplication, bias, error, unnatural distribution, irrelevance to learn the intended domain, etc, etc). Human curated data usually avoids these pitfalls, and thus is far more valuable when training -- otherwise all human teachers would be equally good. So auto- vs curated- data are incomparable when training naive neophytes like ML models, or children.
The paper is not talking about verifiable synthetic data generated by some means other than LLMs.
https://en.wikipedia.org/wiki/Infinite_monkey_theorem
Cheese and Chalk.
It is very different to generate synthetic datasets to assist in targeted training , vs ingesting LLM output from web scraping.
I think this is it.
Generated data is ok if you're curating it to make sure nothing bad, wrong or insensible comes in.
Basically still needs a human in the loop.
Then why not remove this crap (LLMs) from the loop altogether? How did we get from "AI will replace you" to "your new job will be an AIs janitor" in the space of about 12 months?
there is nothing wrong with being a janitor. you could also call it "AI editor" instead of you want to insert a job title sounds more prestigious. some people find it easier and more enjoyable to edit a first draft generated by a language model based on instructions than writing that first draft themselves.
I gotta say, Claude is a godsend for building out quick prototypes of ideas, especially when those ideas require domain specific knowledge that you know a little about but aren't specialized in. Which is most interesting programming projects.
Sure, I could do it myself, but it would take more time, each step would have less momentum, and I'd have to think more while I do it. Which, there's a place for that too, of course.
You just start faster, but end at the same time. If you really need to understand something there is no LLM shortcut. I spent hours interrogating Claude, in the same time I could have studied from a book and gotten even better grounding.
As I said, "there's a place for that too, of course."
I don't think Claude is a good choice if you're trying to prototype a project which uses tools that you don't understand conceptually. However, if you already have a pretty good understanding of the tools, and you're good at reading code, documenting desired functionality, and writing user story requirements then its an amazing shortcut. Basically, if you are prepared to be the team lead or architect of a project then Claude can function as a junior dev who:
* has a pretty good score on hackerrank
* happens to have the exact right domain specific knowledge for the project you want to build
* still gets disoriented by medium and large sized codebases, as many juniors are wont to do (you will need to take over as the main developer, or involve an intermediate or senior developer once the project grows to that size)
As an example, the other day I wanted to prototype a project using typescript, react-konva, and tone.js. I already have a strong understanding of typescript, react, HTML canvas, and FM synthesis. What I don't have is an encyclopedic knowledge of the APIs these specific tools expose, nor do I have code sitting in front of me which effectively combines them.
If I document the functionality I want well, Claude is really good at taking that documentation and building out either that prototype or the foundation for that prototype.
Another thing that I find that helps is to add an intermediate step. Describe the functionality you want the prototype to achieve, and then ask Claude to write a project proposal which documents this functionality and breaks the procedure for producing that functionality into actionable steps. You can then save the artifact it generates to the project files, and have it iterate through that. You'll eventually veer off course as the functionality you want shifts, or the order and granularity of tasks diverges from the plan which was originally designed, but it acts as a way to start a project with a much stronger foundation than just saying "I want a thing that does X. Now make it do Y too. Now make it do Z as well. etc..."
Another way to use Claude effectively, which I also utilized for the project I'm talking about, is to use Claude for throwaway prototyping. Rather than having Claude build out a single prototype, and then taking the reigns from there, have it build out one prototype, then scrap that one and have it build another from scratch, then scrap that and have it build a third from scratch.
Each iteration you'll learn a little more about how the functionality and structure you specified actually operates, and what Claude struggles with in relation to your project. This allows the next prototype to be built out with a little more of the functionality you want, and a little bit of a cleaner architecture.
Throwaway prototyping like that is probably the best way to do development (imo), because it increases the likelihood that your final product has a strong foundation, and smooths out the development process dramatically. You don't carry the baggage of the learning process into the final product or the next prototype. However, this traditionally creates an enormous upfront cost, as we end up having to build out the same functionality many times, just to have it once in the end product. But with Claude, I can accomplish the same number of from-scratch iterations in 1 day as it would take me to build out myself in 2 weeks, making this a suitable approach for any project that has a limited enough scope to use Claude for prototyping. That is to say, you're not going to prototype an Unreal Engine competitor using Claude, but prototypes for a browser based FM synth toy are well within its wheelhouse.
Because reading is faster than writing.
Someone could spend a few years or even most of their life writing a book that can be read in a matter of hours days or weeks.
Humans writing have to proofread their own work. Or occasionally even pay someone else to do it.
No, bad/wrong/nonsense is not the only risk here. You're missing the main point that the authors are making: the shape of the distribution gets changed by this process. A model trained on human data will produce fewer high-perplexity examples than it was trained on (you can see this in Fig 1b, even between generation 0 and 1). In a literal information theory sense, these perplexity values indicate how much information is in each example. Over successive generations models have less actual information to learn from even if they have the same volume of text.
LLMs are milking us of knowledge and skills, repackage them and give it back to us. Models interact with the internet, humans and code execution. They are exploring. Lots of exploring now happens in the chat room, a place where ideas are first tried out. With billions of users, the volume of information LLMs collect from us is huge. We bring references, guidance and feedback right into its mouth, the LLM doesn't even need to do anything like crawling.
Imagine how many things we know, things we accumulated in our life experience, that were never written down anywhere. That information was lost to others. But now we use LLM assistants, so they get to be in the loop and collect tidbits of human life experience that is not written on the internet. And soon they will also work on audio/video and travel with us everywhere, seeing what we show them.
I think that maybe we are too harsh in expecting LLMs to be perfect. If they are based off of human input that is incorrect then we might propagate such errors. But they will still be quicker and much more reliable than most people. Isn’t this good enough? After all, we are willing to accept flaws in people, even including the president. I suspect that the way forward will be to progressively clean the LLM input data as each error gets identified.
Yes, and big LLM developers have millions of humans in the loop. That's why they provide free access, for human in the loop filtering & guidance.
If I go to chatGPT and solve a coding task, maybe the first 3 ideas don't work and the 4th works. It can do RLHF setting the first 3 with negative and the fourth with positive score. They just used me to test their model and create a datapoint.
Using LLM is useful both ways - for humans, we get assistance, and LLMs get feedback for their outputs. This seems like the new form of "you are the product".
Yes -- said another way, if you're an ML researcher and you have human-provided (scraped) data, and an ability to generate synthetic data, then until recently, you had a controllable parameter: how much of your training data for your new model should be synthetic? You can vary this, run multiple experiments, and choose how much synthetic data to use -- and you can vary the specific configs about how that synthetic data is generated.
If synthetic data is mixed into your upstream data sources in a way you cannot control, then your ML team loses a valuable controllable parameter.
You still have some that control, but in a much more indirect way.
There are three kinds of data now, synthetic, pre-2022 and current. Everything pre-2022 is definitely written by humans, synthetic data is still synthetic, and post-2022 is a mix of both.
I wouldn't be surprised if "AI detectors" work somewhat for this use case. They're biased, far from accurate and a terrible idea if you need to make important decisions (like whether to expel a student for cheating), but there's quite a large room for errors here.
I'm not sure if methods like article spinning counts as written by humans. This is something you could automate before AI and it would take a human written article and randomly swap words with similar meaning throughout to make it seem original.
Don’t forget machine-translated texts, where until ~2017 the translation was likely done by something much dumber / semantically lossy than an LLM, and after 2017 was basically done by an early form of LLM (the Transformers architecture originating in Google Translate.)
Many historical English-language news reports published on the English-language websites of foreign news media from non-English-speaking countries, from 1998 (Babelfish era) to ~a few months ago, may be unreliable training data for this reason.
They do work detecting LLM outputs that are sampled "naively" (when the model/user is really not trying to pass it as human output).
I copied a prompt translated from spanish to english using ChatGPT Plus in a GPT-4o Azure OpenAI Service endpoint. It did work in Spanish but didn't run in english because the default AOS Content Filters detected a jailbreak intent. It was quite weird.
Yeah, I raised the same issue before reading your post; ninja'd I am.
I like your "cheese and chalk".
I always preferred sugar and shit. Obviously, that is profane. But, I consider profanity be seen as the part of speech it really is.
Also drives your point home more efficiently. While it may be profane, there's far more speech available with far less "use" that is intentionally profane to spark a reaction without regard to what that reaction may be. Shock value for attention, rather than to carry home a point.
Profane is, if you will, fucking fine.
But "cheese and chalk" is a great analogy because both are sources of calcium, but cheese is much better for the human body. It carries useful info.
They got a secret ace in their pocket - chat logs created with human in the loop. Of course those might still have errors, but much fewer. They can infer from a human response if it was accepted or not.
I think OpenAI generates at least 1B sessions per month and 2 Trillion interactive tokens. Those can go into the LLM again for analysis and synthetic content generation, or for RLHF with the whole conversation as guidance. Having access to the following interactions can shed light on previous answers.
Even more, they can correlate chats across days, presumably humans try out LLM ideas in reality and return for iteration. That way LLMs indirectly get real world grounding.
They can't directly train on chat transcripts, because they contain private information and other things you don't want appearing in answers. I doubt they even look at them unless you press the thumbs down, in which case they probably use it in some indirect way.
They might try to look for trends or what questions are popular of course.
That's exactly what they are doing and what you agreed to ans why some use other models or running something locally.
If they were training on other people's chat transcripts, the answers would read like how other people type, instead of telling you to delve mystically into intriguing questions.
https://help.openai.com/en/articles/5722486-how-your-data-is...
Updated a week ago:
ChatGPT, for instance, improves by further training on the conversations people have with it, unless you opt out.
Easier to write that than explain their internal processes. But what would be the point? Training on someone asking a question doesn't cause it to learn the correct answer.
Haha, exactly. I switched to local Llama 3 and never looked back.
This is likely one of the main reasons why they're offering ChatGPT for free and running ChatGPT Plus at a loss.
As opposed to what though? Its not like there a huge demand for these apps that they can charge money. They have no option but to give it away for free .
Are you not aware that half the industry is using it to generate (at least some portion of) their code? And that many are paying for the privilege?
Yes I pay for it. Not sure what your point is. I didn't say no one in the world pays for it.
Are you not aware of flatlining user growth? or are you under the impression that coders paying for these apps are enough to make them profitable?
https://www.thewrap.com/chatgpt-growth-2024/
I think this paper is more focused on figuring out what would happen in the theoretical scenario that most data on the web in the future might be AI generated without being marked as such. As they say,
The companies you listed are surely not training the models indiscriminately. In particular they have piles of data for which they can have high confidence that they are written by humans.
Keep in mind that the Prover-Verifier game is not that it's training on AI-generated data (as if to imitate it) -- rather, it's training against a discriminator that verifies for correctness (a calculator) and understandability (a smaller, less-capable language model). You can think of this as a distillation method, but it's not like it's generating large amounts of source data and then retraining on it. This method only works on specific problems where there is an absolute right answer that can be verified with an independent heuristic (in this case, a math calculation).
However, there is a lot of potential in the world of self-play and adversarial-training to improve the quality of our LLMs with true reinforcement learning.
For one recent paper on this topic, also check out SPAG -- I found this one to be fascinating:
https://github.com/Linear95/SPAG
I've been keeping notes on this topic in a WIP paper, and if you'd like to read my (rambling) ravings about it, you can find more info here:
https://github.com/HanClinto/MENTAT
I think that self-play and reinforcement learning are going to absolutely be important for the next level of LLM development. If you use AI-generated data, then you must have an objective metric to verify "goodness". Nothing is free, and simply asking an LLM to rate the quality of its own data is not going to cut it. I think that's the point of the article.