Interestingly, Ed Newton-Rex, the person hired to build Stable Audio, quit shortly after it was released due to concerns around copyright and the training data being used.
He’s since founded https://www.fairlytrained.org/
Reference: https://x.com/ednewtonrex
For generative models, if the model authors do not publish the architecture of their model; and, the model uses a transformation from text to another kind of media; you can assume that they have delegated some part of their model to a text encoder or similar feature which is trained on data that they do not have an express license to.
Even for rightsholders with tens of millions to hundreds of millions of library items like images or audio snippets, the performance of the encoder or similar feature in text-to-X generative models is too poor on the less than billion tokens of text in the large repositories. This includes Adobe's Firefly.
It is also a misconception that large amounts of similar data, like the kinds that appear in these libraries, is especially useful. Without a powerful text encoder, the net result is that most text-to-X models create things that look or sound very average.
The simplest way to dispel such issues is to publish the architecture of the model.
But anyway, even if it were all true, the only reason we are talking about diffusers, and the only reason we are paying attention to this author's work Fairly Trained, is because of someone training on data that was not expressly licensed.
If you require licensing fees for training data, you kill open source ML.
That’s why it’s important for OpenAI to win the upcoming court cases.
If they lose, they’ll survive. But it will be the end of open model releases.
To be clear, I don’t like the idea of companies profiting off of people’s work. I just like open source dying even less.
And likely proprietary ML as well, hopefully.
(To be clear, I think AI is an absolutely incredible innovation, capable of both good and harm; I also think it's not unreasonable to expect it to play a safer, slower strategy than the Uber "break the rules to grow fast until they catch up to you" playbook.)
I'm all for eliminating copyright. Until that happens, I'm utterly opposed to AI getting a special pass to ignore it while everyone else cannot.
Fair use was intended for things like reviews, commentary, education, remixing, non-commercial use, and many other things; that doesn't make it appropriate for "slurp in the entire Internet and make billions remixing all of it at once". The commercial value of AI should utterly break the four-factor test.
Here's the four-factor test, as applied to AI:
"What is the character of the use?" - Commercial
"What is the nature of the work to be used?" - Anything and everything
"How much of the work will you use?" - All of it
"If this kind of use were widespread, what effect would it have on the market for the original or for permissions?" - Directly competes with the original, killing or devaluing large parts of it
Literally every part of the four-factor test is maximally against this being fair use. (Open Source AI fails three of four factors, and then many users of the resulting AI fail the first factor as well.)
That seems like an open question. If they lose these court cases, setting a precedent, then there will be ten thousand more on the heels of those, and it seems questionable whether they'd survive those.
You're positioning these as opposed because you're focused on the case of Open Source AI. There are a massive number of Open Source projects whose code is being trained on, producing AIs that launder the copyrights of those projects and ignore their licenses. I don't want Open Source projects serving as the training data for AIs that ignore their license.
It’s not so clear cut. Many lawyers believe all that matters is whether the output of the model is infringing. As much as people love to cite ChatGPT spitting out code that violates copyright, the vast majority of the outputs do not. Those that do, are quickly clamped down on — you’ll find it hard to get Dalle to generate an image of anything Nintendo related, unless you’re using crafty language.
There’s also the moral question. Should creators have the right to prevent their bits from being copied at all? Fundamentally, people are upset that their work is being used. But "used" in this case means "copied, then transformed." There’s precedent for such copying and transformation. Fair use is only one example. You’re allowed to buy someone’s book and tear it up; that copy is yours. You can also download an image and turn it into a meme. That’s something that isn’t banned either. The question hinges on whether ML is quantitatively different, not qualitatively different. Scale matters, and it’s a difference of opinion whether the scale in this case is enough to justify banning people from training on art and source code. The courts’ opinion will have the final say.
The thing is, I basically agree with you in terms of what you want to happen. Unfortunately the most likely outcome is a world where no one except billion dollar corporations can afford to pay the fees to create useful ML models. Are you sure it’s a good outcome? The chance that OpenAI will die from lawsuits seems close to nil. Open source AI, on the other hand, will be the first on the chopping block.
What I don't understand (as a European with little knowledge of court decisions on fair use): with the same reasoning you might make software piracy a case of 'fair use', no? You take stuff someone else wrote - without their consent - and use it to create something new. The output (e.g. the artwork you create with Photoshop) is definitely not copyrighted by the manufacturer of the software. But in the case of software piracy, it is not about the output. With software, it seems clear that the act of taking something you do not have the rights for and using it for personal (financial) gain is not covered by fair use.
Why can OpenAI steal copyrighted content to create transformative works but I cannot steal Photoshop to create transformative works? What am I missing?
So, fair use is seen as a balance, and generally the balance is thought of as being codified under four factors:
https://www.copyright.gov/title17/92chap1.html#107
(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
(2) the nature of the copyrighted work;
(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
(4) the effect of the use upon the potential market for or value of the copyrighted work.
There's more detailed discussion here: https://copyright.columbia.edu/basics/fair-use.html
If Photoshop was hosted online by Adobe, you would be free to do so. It's copyrighted, but you'd have an implied license to use it by the fact it's being made available to you to download. Same reason search engines can save and present cached snapshots of a website (Field v. Google).
In other situations (e.g: downloading from an unofficial source) you're right that private copying is (in the US) still prima facie copyright infringement. However, when considering a fair use defense, courts do take the distinction into strong consideration: "verbatim intermediate copying has consistently been upheld as fair use if the copy is ‘not reveal[ed] . . . to the public.’" (Authors Guild v. Google)
If you were using Photoshop in some transformative way that gives it new purpose (e.g: documenting the evolution of software UIs, rather than just making a photo with it as designed) then you may* be able to get away with downloading it from unofficial sources via a fair use defense.
*: (this is not legal advice)
That's not a good example. Making a copy of a record you own(as an example ripping a audio CD to MP3) is absolutely fair use. Giving your video game to your neighbor to play - that's also fair use.
Fair use is limited when it comes to transformative/derivative work. Similar laws are in place all over the world, just in US some of those come from case law.
That's not a good analogy. The argument, that is not settled yet, is that a model doesn't contain enough copyrightable material to produce an infringing output.
Take your software example - you legally acquire Civ6, you play Civ6, you learn the concepts and the visuals of Civ6... then you take that knowledge and create a game that is similar to Civ6. If you're a copyright maximalist - then you would say that creating any games that mimic Civ6 by people who have played Civ6 is copyright infringement. Legally there are definitely lower limits to copyright - like no one owns the copyright to the phrase "Once upon a time", but there may be a copyright on "In a galaxy far far away".
Dalle on Bing is happy to generate Mario and Luigi and Sonic and basically everybody from everybody without using crafty language so I'm unsure of what you're talking about.
really it seems more like someone was afraid of angering Nintendo who is a corporate adversary one does not like to fight and thus it has a bunch of blocks to keep from generating anything that offends Nintendo, that does not really translate to quickly and easily stopping and blocking offending generations across every copyrighted work in the world.
It would be interesting to see if courts agree that training+transforming = copying.
If I paint a picture inspired by Starry Night(Van Gogh) - does that inherently infringe on the original? I looked at that painting, learned the characteristics, looked at other similar paintings and painted my own. I basically trained my brain. (and I mean the copyright, not the individual physical painting)
And I mean cases where I am not intentionally trying to recreate the original, but doing a derivative(aka inspired) work.
Because it's already settled that recreating the original from memory will infringe on copyright.
"many other things" has included, for example, Google Books scanning millions of in-copyright books, storing internally them in full, and making snippets available.
The basis for copyright itself is to "promote the progress of science and useful arts". For that reason a key consideration of fair use, which you've skipped entirely, is the transformative nature of the new work. As in Campbell v. Acuff-Rose Music: "The more transformative the new work, the less will be the significance of other factors", defined as "whether the new work merely 'supersede[s] the objects' of the original creation [...] or instead adds something new".
For the substantiality factor, courts make the distinction between intermediate copying and what is ultimately made available to the public. As in Sega v. Accolade: "Accolade, a commercial competitor of Sega, engaged in wholesale copying of Sega's copyrighted code as a preliminary step in the development of a competing product" yet "where the ultimate (as opposed to direct) use is as limited as it was here, the factor is of very little weight". Or as in Authors Guild v. Google: “verbatim intermediate copying has consistently been upheld as fair use if the copy is ‘not reveal[ed] . . . to the public.’”
The factor also takes into account whether the copying was necessary for the purpose. As in Kelly v. Arriba Soft: "If the secondary user only copies as much as is necessary for his or her intended use, then this factor will not weigh against him or her"
While there are still cases of overfitting resulting in generated outputs overly similar to training data, I think it's more favorable to AI than simply "it trained on everything, so this factor is maximally against fair use".
The factor is specifically the effect of the use upon the work - not the extent to which your work would be devalued even if it had not been trained on your work.
None of those arguments make sense. The output of AI absolutely does supersede the objects of the original creation. If it didn't, artists wouldn't care that they were no longer able to make a living.
Substantiality of code does not apply to substantiality of style. What's being copied is look and feel, which is very much protected by copyright.
The copying clearly is necessary for the purpose. No copying, no model. The fact that the copying is then compressed after ingestion doesn't change the fact that it's necessary for the modelling process.
Last point - see first point.
IANAL, but if I was a lawyer I'd be referring back to look and feel cases. It's the essence of an artist's look and feel that's being duplicated and used for commercial gain without a license.
That's true whether it's one artist - which it can be, with added training - or thousands.
Essentially what MJ etc do is curate a library of looks and feels, and charge money for access.
It's a little more subtle than copying fixed objects, but the principle remains the same - original work is being copied and resold.
If that were the case, no one would be able to paint any cubist paintings. (Picasso estate would own the copyright, to this day)
It's not that clear cut, there are a lot of nuances.
Ironically, Picasso was notorious for copying other artist's 'look and feel'...
The question for transformative nature is whether it merely supersedes or instead adds something new. E.G: Google translate was trained on books/documents translated by human translators and may in part displace that need, but adds new value in on-demand translation of arbitrary text - which the static works it was trained on did not provide.
I'm not certain what you're saying here.
Which, for the substantiality factor, works in favor of the model developers.
Copyright protects works fixed in a tangible medium, not ideas in someone's head. It would protect a work's look/appearance (which can be an issue for AI when overfitting causes outputs that are substantially similar to a protected work), but not style or "an artist's look and feel".
That succeeds on a different part of the four-factor test, the degree to which it competes with / affects the market for the original.
Google Books is not automatically producing new books derived from their copies that compete with the original books.
It satisfied multiple parts of the four-factor test. It was found satisfy the first factor due to being "highly transformative", the second factor was considered not dispositive is isolation and favoring Google when combined with its transformative purpose, and it satisfied the third factor as the usage was "necessary to achieve that purpose" - with the court making the distinction between what was copied (lots) and what is revealed to the public (limited snippets).
As you had all factors as "maximally against" fair use, do you believe that AI is significantly less transformative than Google Books? I'd say even in cases where the output is the same format as the content it was trained on, like Google Translate, it's still generally highly transformative.
Specifically, to be pedantic, it's the effect of the use/copying of the original copyrighted work.
Bear with me here. Rushed and poorly articulated post incoming...
In the broadest sense, generative AI helps achieve the same goals that copyleft licences aim for. A future where software isn't locked away in proprietary blobs and users are empowered to create, combine and modify software that they use.
Copyleft uses IP law against itself to push people to share their work. Generative AI aims to assist in writing (or generating) code and make sharing less neccesary.
I argue that if you are a strong believer in the ultimate goals of copyleft licences you should also be supporting the legality of training on open source code.
The obvious difference is that copyleft is voluntary, while having your art style stolen isn't.
If an artist approached a software developer, created a painting of them using their Mac, and said "There, I've done your job for you" you'd think they were an idiot.
This is the same from the other side. The inability to understand why that's a realistic analogy does not change the fact that it is.
"> The obvious difference is that copyleft is voluntary, while having your art style stolen isn't."
This is why it is important whether you consider that infringement occurs upon ingestion or output. If it only matters for outputs, then artists have a problem, since copyright doesn't protect styles at all, see for example the entire fashion industry.
There is a saving grace though: Artists can make a case that the association of their distinctive style with their name is at least potentially a violation of trademark or trade dress, especially if that association is being used to promote the outputs to the public. This is a fairly clear case of commercial substitution in the market for creating new works in that artist's style and creating confusion concerning the origin of the resulting work.
Note that the market for creating new works in a particular artist's distinctive and named style kind of goes away upon the artist's passing. What remains is the trademark issue of whether a particular work was actually created by the artist or not, which existing trademark law is well suited to policing, as long as the trademark is defended, even past the expiration of the copyright.
Meanwhile, trademark (and copyright) also apply to the subjects of works, like Nintendo's Mario or Disney's Mickey Mouse or Marvel's Iron Man. But we don't really want models to simply be forbidden from producing them as outputs, or they become useless as tools for the purpose of parody and satire, not to mention the ability to create non-commercial fan art. The potential liability for violating these trademarks by publishing works featuring those characters rests with the users rather than the tools, though, and again existing law is fairly well suited to policing the market. Similarly, celebrities' right of publicity probably shouldn't prevent models from learning what they look like or from making images that include their likeness when prompted with their name, but users better be prepared to justify publishing those results if sued.
You can also make the (technical) argument that if you just ask for an image of Wonder Woman, and you get an image that looks like Gal Gadot as Wonder Woman, that the model is overfitting. That's also the issue with the recent spate of coverage of Midjourney producing near-verbatim screenshots from movies.
It might be appropriate though to regulate commercial generative AI services to the extent of requiring them to warn users of all the potential copyright/trademark/etc. violations, if they ask for images of Taylor Swift as Elsa, or Princess Peach, or Wonder Woman, for example.
What a curious type of theft where the author keeps their art and I get different art.
The majority of AI models out there (at least by popularity / capability) are proprietary; with weights and even model architectures that are treated as trade secret. Instead of having human-written music and movies that you legally can't copy, but practically can; you now have slop-generating models that live on a cloud server you have no control over. Artists and programmers who want to actually publish something - copyright or no - now have to compete with AI spam on search engines, while ChatGPT gets to merely be "confidently wrong" because it was built on the Internet equivalent of low-background metal - pre-AI training data. Generative AI is not a road that leads to less intellectual property[0], it's just an argument for reappropriating it to whoever has the fastest GPUs.
This is contrary to the goals of the Free Software movement - and also why Free Software people were the first to complain about all the copying going on. One of the things Generative AI is really good at is plagiarism - i.e. taking someone else's work and "rewriting it" in different words. If that's fair use, then copyleft is functionally useless.
It's important to keep in mind the difference between violating the letter of the law and opposing the business interests of the people who wrote the law. Copyleft and share-alike clauses have the intention of getting in the way of copyright as an institution, but it also relies on copyright to work, which is why the clauses have power even though they violate the spirit of copyright. Generative AI might violate the letter of the law, but it's very much in the spirit of what the law wants.
[0] Cory Doctorow: "Intellectual property is any law that allows you to dictate the conduct of your competitors"
Is FSF's stance on AI actually clear? I thought they were just upset it was made by Microsoft.
Creative Commons has been fairly pro-AI -- they have been quite balanced, actually, but they do say that opt-in is not acceptable, it should be opt-out at most. EFF is fairly pro AI too -- at least, against using copyright to legislate against it.
You shouldn't discount progress in the open model ecosystem. You can sort of pirate ChatGPT by fine tuning on its responses, there's GPU sharing initiatives like Stable Horde, there's TabbyML which works very well nowadays, and Stable Diffusion is still the most advanced way of generating images. There's very much of an anti-IP spirit going on there, which is a good thing -- it's what copyleft is there for in sprit, isn't it?
Where this argument falls down for me is that "use" w.r.t. copyright means copying, and neither AI models nor their outputs include any material copied from the training data, in any usual sense. (Of course the inputs are copied during training, but those copies seem clearly ephemeral.)
Genuinely curious: for anyone who thinks AI obviously violates copyright, how do you resolve this? E.g. do you think the violation happens during training or inference? And is it the trained model, or the model output, that you think should be considered a derived work?
Personally I think trained models are derived works of all the training data.
Just like a translation of a book is a derived works of the original. Or a binary compiled output is a derived works of some source code.
Wikipedia:
A trained model fails that on two counts, doesn't it? Both the "includes" part, and the fact that a model is itself not an expressive work of authorship.
Curating training data is an exercise in editorial judgement.
You're trying to use words without the legal context here. The legal definition of words isn't 1-1 wit our colloquial usage.
Translation of a book is non-transformative and retains the original author's artistic expression.
As a counter example - if you write an essay about Picasso's Guernica painting, it is derivative according to our colloquial use of the term, but legally it's an original work.
That depends on the interpretation of "use", and it would be interesting to read what lawyers think. You learned the language largely from speech and copyrighted works. (All the stories, books, movies, etc. you ever read/heard) When you wrote this comment did you use all of them for that purpose? Is the case of AI different?
To be clear that's a rhetorical question - I don't expect anyone here to actually have a convincing enough argument either way.
Principles applied to human brains are not automatically applicable to AI training. To the best of my knowledge, there's no particular law that says a human brain is exempt from copyright, but it empirically is, because the alternative would be utterly unreasonable. No such exemption exists for AI training, nor should it.
Ideas/works/etc literally live rent-free in your head. That doesn't mean they should live rent-free in an AI's neural network.
Changing that should involve actually reducing or eliminating copyright, for everyone, not giving a special pass to AI.
Human brain most definitely is not exempt. If you read Lord of the Rings and then write down a new book, with the same characters and same story line - that's plain copying(lookup the etymology of the verb to copy). If you look at a painting and paint a very similar painting - that's still copying.
Human brains are the reason we have copyright. Your recital of passages from any copyrighted book would violate the copyright, if not for fair use doctrine. And it has nothing to do with whether you do it yourself, or have a TTS engine produce the sound.
The human brain is absolutely exempt, insofar as the copy stored in your brain does not make your brain subject to copyright, even if a subsequent work you produce might be. Nobody's filing copyright infringement claims over people's memories in and of themselves.
I'm saying that AI does not and should not automatically get the exception that a human brain does.
AI is a genie that you can't really stuff back into a bottle. It's out and it's global.
If the US had tighter regulations, China or someone else will take over the market. If AI is genuinely transformative for productivity, then the US would just fall behind, sooner or later.
Then let them! If another country put forward tighter regulations to help actual people over and above the state that holds them, then that is good in itself, and either way will pay for itself. Why are we worried about China or whoever taking over the market of something that we see has bad effects?
Like, we see this line everywhere now, and it simply doesnt make sense. At some point you just have to believe something, be principled. Treating the entire world as this zero sum deadlock of "progress" does nothing but prevent one from actually being critical about anything.
This would-be Oppenheimer cosplay is growing really old in these discussions.
Your first factor seems to not at all be like that which Stanford has in its guidelines[1], which they call the transformative factor:
In a 1994 case, the Supreme Court emphasized this first factor as being an important indicator of fair use. At issue is whether the material has been used to help create something new or merely copied verbatim into another work.
LLMs mostly create something new, but sometimes seems to be able to regurgitate passages verbatim, so I can see arguments for and against, but to my untrained eyes doesn't seem as clear cut.
[1]: https://fairuse.stanford.edu/overview/fair-use/four-factors/
That makes no sense. OpenAI must lose and it must not be possible to have proprietary models based on copyrighted works. It's not fair use because OpenAI is profiting from the copyright holders work and substituting for it while not giving them recompense.
The alternative is that any models widely trained on copyrighted work are uncopyrightable and must be disclosed, along with their data sources. In essence this is forcing all such models to be open. This is the only equitable outcome. Any use of the model to create works has the same copyright issues as existing work creation, ie if substantially replicates an existing work it must be licenced.
For what it’s worth, I agree with your second paragraph. But it would take legislation to enforce that. For now, it’s unclear that OpenAI will lose. Quite the opposite; I’ve spoken with a few lawyers who believe OpenAI is on solid legal footing, because all that matters is whether the model’s output is infringing. And it’s not. No one reads books via ChatGPT, and Dalle 3 has tight controls preventing it from generating Pokémon or Mario.
All outcomes suck. The trick is to find the outcome that sucks the least for the majority of people. Maybe the needs of copyright holders will outweigh the needs of open source, but it’s basically guaranteed that open source ML will die if your first paragraph comes true.
Proposal: revenue from Generative AI should be taxed 10% for an international endowment for the arts. In exchange, copyright claims are settled.
With a minimum rate, such that no-one can pretend they’re getting no income from it.
We might apply that as a $5000 or so surcharge on AI accelerators capable of running the models, such as the 4090.
Absolutely true. That's the end game and we should be working toward influencing that. It's within our power.
No one knows anything, this is too novel, and even if OpenAI gets some fair use ruling, it will be inequitable and legislation is inevitable. OpenAI is between a rock and a hard place here. If you read the basis for fair use and give each aspect serious consideration, as a judge should do, I can't see it passing fair use muster. It's not a case of simply reproducing work, which in unclear here, it's the negative effect on copyright holders, and that effect is undeniable.
I don't think so. It's possible to fashion something equitable, but people other than the corporations have to get involved.
Just because something is not copyrightable doesn’t automatically mean it must be disclosed. If weights aren’t copyrightable (and I don’t think they should be, as the weights are not a human creation), commercial AI’s just get locked behind API barriers, with terms of usage that forbid cloning. Copyright then never enters the picture, unless weights get leaked.
Whether or not that’s equitable is in the eye of the beholder. Copyright is an artificial construct, not a natural law. There is nothing that says we must have it, or we must have it in its current form, and I would argue the current system of copyright has been largely harmful to creativity for a long time now. One of the most damning statements I’ve read in this thread about the current copyright system is how there’s simply not enough unlicensed content to train models on. That is the bed that the copyright-holding corporations have made for themselves by lobbying to extend copyright to a century, and it all but assured the current situation.
No I'm saying that's what they law should be, because models can be built and used without anyone knowing. If it's illegal not to disclose them you can punish people.
Copyright is something that protects the little guy as much as big corps. But the former has more to lose as a group in the world of AI models, and they will lose something here no matter what happens.
I'd love to hear that argument.
How has the current system of copyright been harmful to creativity?
kill open source ML -> decrease speed of improvements for some open source ML
Sadly not. Making something illegal has social effects, not just legal effects. I’ve grown tired of being verbally spit on for books3. One lovely fellow even said that he hoped my daughter grows up resenting me for it.
It being legal is the only guard against that kind of thing. People will still be angry, but they won’t be so numerous. Right now everyone outside of AI almost universally despises the way AI is trained.
Which means you won’t be able to say that you do open source ML without risking your job. People will be angry enough to try to get you fired for it.
(If that sounds extreme, count yourself lucky that you haven’t tried to assemble any ML datasets and release them. The LAION folks are in the crosshairs for supposedly including CSAM in their dataset, and they’re not even a dataset, just an index.)
If everyone is unhappy with your rampant piracy, then perhaps that is a sign that you’re doing it wrong?
Perhaps. The reason I did it was because OpenAI was doing it, and it’s important for open source to be able to compete with ChatGPT. But if OpenAI’s actions are ruled illegal, then empirically open source wasn’t a persuasive enough reason to allow it.
Is there evidence that it's actually everyone or even close to everyone? The core innovation that the internet brought to harassment is that it is sufficient for some 0.0...01% of all people to take issue with you and be sufficiently dedicated to it for every waking minute of your life to be filled with a non-stop torrent of vitriol, as a tiny percentage of all internet users still amounts to thousands.
US copyright has limited reach. There are models trained in China, where the IP rules are... not really enforced. It would be an interesting world where you use / pay for those models because you can't train them locally.
I don't agree with this. Most people don't care at all, and at best people would argue about some form of compensation.
Saying "everyone" is unsubstantiated.
I mean... "Everyone was angry at Napster" at the same time "everyone is angry at the MPAA/RIAA"
Replying to a deleted comment:
Entirely possible. The early history of aviation was open source in the sense that many unlicensed people participated, and died. The world is strictly better with licensing requirements in place for that field.
But no one knows. And if history is any guide for software, it seems better to err on freedoms that happen to have some downside rather then clamping down on them. One could imagine a world where BitTorrent was illegal. Or cryptography, or bitcoin.
Are you really comparing licensing for a profession with licensing of IP?
It’s much the same. Only authorized people are allowed to do X. Since X costs a lot of money, by definition it can’t be open source. There are no hobbyist pilots that carry passengers without a license, and if there are, they’re quickly told to stop. Generative AI faces a real chance of having the same fate. Which means open source will look similar to these planes trying to compete with commercial aircraft: https://pilotinstitute.com/flying-without-a-license/
If you can think of a better example, I’d like to know though. I’ll use it in future discussions. It’s hard to think of good analogies when the tech has new social effects.
If I fly a plane and crash, my passengers die. If I generate an image using a model whose training included some unlicensed imagery... Disney misses out on a fraction of a cent?
There is a real reason why some professions are licenced and others are not.
Your analogy is nonsensical. Not having a better one is irrelevant.
If training data requires licensing fees, ML practitioners will become a licensed field de facto, because no one in the open source world will have the resources to pursue it on their own.
Perhaps a better analogy is movies. At least with acting, you can make your own movies, even if you’re on a shoestring budget. With ML, you quite literally can’t make a useful model. There’s not enough uncopyrighted data to do anything remotely close to commercial models, even in spirit.
You know the word "license" has multiple, dissimilar meanings, right?
Is there a license that states: if you use this data for ML training you must open source model weights and architecture?
It’s deeper than that. The basis of licensing is copyright. If the upcoming court cases rule in OpenAI’s favor, you won’t be able to apply copyright to training data. Which means you can’t license it.
Or rather, you can, but everyone is free to ignore you. A license without teeth is no license at all. The GPL is only relevant because it’s enforceable in court.
I’m sure some countries will try the licensing route though, so perhaps there you’d be able to make one.
EDIT: I misread you, sorry. You’re saying that if OpenAI loses and license fees become the norm, maybe people will be willing to let their data be used for open source models, and a license could be crafted to that effect.
Probably, yes. But the question is whether there’s enough training data to compete with the big companies that can afford to license much more. I’m doubtful, but it could be worth a try.
The irony of GPL, is that it's validity with respect to users is only now tested in court.
https://www.dlapiper.com/en/insights/publications/2024/01/sf...
The point should be to kill training on unlicensed material. There needs to be regulation and tools to identify what was the training data. But as always, first comes the siphoning part, the massive extraction of value, then when the damage is done there will be the slow moving reparations and conservationism.
A ton of us out here don't agree with your goals. I think these models are transformative enough that the value added by organizing and extracting patterns from the data outweighs the interests of the extremely diffuse set of copyright holders whose data was ingested. So regardless of the technical details of copyright law (which I still think are firmly in favor of OpenAI et al) I would strongly opposed any effort to tighten a legal noose here.
Agreed. And every software engineer writing code should pay 10% of their salary to the publishers of the books that they learned their programming skills from.
This is another one of those “well if you treat the people fairly it causes problems” sort of arguments. And: Sorry. If you want to do this you have to figure out how to do it ethically.
There are all sorts of situations where research would go much faster if we behaved unethically or illegally. Medicine, for example. Or shooting people in rockets to Mars. But we can’t live in a society where we harm people in the name of progress.
Everyone in AI is super smart — I’m sure they can chin-scratch and figure out a way to make progress while respecting the people whose work they need to power these tools. Those incapable of this are either lazy, predatory, or not that smart.
"Ethical" in this case is a matter of opinion. The whole point of copyright was to promote useful sciences and arts. It’s in the US constitution. You don’t get to control your work out of some sense of fairness, but rather because it’s better for the society you live in.
As an ML researcher, no, there’s basically no way to make progress without the data. Not in comparison with billion dollar corporations that can throw money at the licensing problem. Synthetic data is still a pipe dream, and arguably still a copyright violation according to you, since traditional models generate such data.
To believe that this problem will just go away or that we can find some way around it is to close one’s eyes and shout "la la la, not listening." If you want to kill open source AI, that’s fine, but do it with eyes open.
I would say that GPT-3 and its successors have nothing to do with open source, and if OpenAI uses open source as a shield, then we are all doomed. I would distance myself and any open source projects from involvement in OpenAI court cases as far as possible. Yes, they have delivered some open source models, but not all of them. Their defense must revolve around fair use and purchased content if they use books and materials that were never freely available. It should be permissible to purchase a book or other materials once and use them for the training of an unlimited number of models without incurring licensing fees.
The reality is always a dynamic tension between law, regulation, precedent, and enforceability.
It is possible to strangle OpenAI without strangling AI: pmarca is anti-OpenAI in print, but you can bet your butt he hopes to invest in whatever replaces it, and he’s got access to information that like, 10 people do.
A useful example would be the Napster Wars: the music industry had been rent seeking (taking the fucking piss really) for decades and technology destroyed the free ride one way or another. The public (led by the technical/hacker/maker public) quickly showed that short of disconnecting the internet, we were going to listen to the 2 good songs without buying the 8 shitty ones. The technical public doesn’t flex its muscles in a unified way very often, but when it does, it dictates what is and isn’t on the menu.
The public wants AI, badly. They want it aligned by them within the constraints of the law (which is what “aligned” should mean to begin with).
The public is getting what it wants on this: you can bet the rent. Whether or not OpenAI gets on board or gets run the fuck over is up to them.
“You in the market for a Tower Records franchise Eduardo?”
Thanks for putting this into words. I'm of the same opinion and this is the best articulation I have so far.
Calling him "the person hired to build Stable Audio" seems a bit misleading? He was in a executive position (VP of product for Stability's audio group). An important position, but "person hired to build" to me evokes the image of lead developer/researcher.
I think that also helps in understanding his departure, since he's a founder with a music background.
It isn't unusual for those in leadership positions to use such phrasing when talking about projects and products. It's not a "taking credit" from the engineers sort of thing, but rather about the leadership of the engineers.
Agreed. Leadership can sometimes bring actual value ;)
And to be clear, I’m not sure Ed would call himself that. Those are my words, not his.
Managing a group of people is not synonymous with doing the actual knowledge work of researching and developing innovations that enabled this technology. I find it hard to believe that the contribution of his management somehow uniquely enabled this group of engineers to create this using their experience and expertise.
A captain may steer the ship, but they're not the one actually creating and maintaining the means by which it moves.
Person A gets hired to write the software that is the company's actual product.
Person B gets hired to observe Person A working, check email, and be the audio output buffer for Jira.
Person B says "I built this."
That's dishonesty no matter what the titles are or how important the emails were.
Not that it would have stopped the company for doing it anyway, but couldn't he think about that before working from them?
Or did he needed that as it i part of the business model of his certfications?
It's a complex topic and perceptions change.
Ed still likes Stability, especially as we fully trained stable audio on rights licensed data (bit different in audio to other media types), offer opt out of datasets etc.
There has to be a solution for the copyright roadbloacks that companies encounter when training models. I see it no different than an artist creating music which is influenced by the music the artist has been listening throughout his whole life, fundementally it's the exact same thing. You cannot create music or art in general in a vacuum
That’s an interesting take. But quite the odd stance since he joined Stability and the training of Stable Diffusion was well known.