I raised a concern about the inclusion of books3 in the Pile back in 2020, and this is what the head of Eleuther (Stella Biderman) told me:
"So here’s the big picture. There are three sets of datasets: 1. Data exists out there in the world. It has been collected into datasets and posted online. I’ll call this raw data. 2. We take that data, clean it, and process it for language modeling. I’ll call this per-set data. 3. We combine those per-set data into one massive dataset, the Pile. This is heavily processed, including weighing the components.
We created 2 and 3 and put them online. We put 2 online so that people can reweigh and remix the data if they wish, but we expect most people to just download 3 and use it out of the box. Access to 3 will be provided in several forms, including HuggingFace and from our website.
2 and 3 are not copyright violations, even if the data is copyrighted, because they fall under fair use (at least in the US).
The Pile contains code that turns 1 into 2 and code that turns 2 into 3.
When you download Maroon 5 from a website, you are creating a dataset corresponding to 2. That can be copyright violation depending on what you do with it, but our use is not a copyright violation."
I don’t understand how this can be true if set 2 contains a complete copyrighted work (say, a book) that the copyright owner hasn’t approved for such distribution. Unless I misunderstand and the “process[ing] for language modeling” is an entirely irreversible process.
> Unless I misunderstand and the “process[ing] for language modeling” is an entirely irreversible process.
In the case of The Pile, "processing for language modelling" means "converting epub and pdf into plain text, maybe deduplicating, maybe removing some sorts of detectably malformed files"
So not a particularly lossy conversion.
I see, thanks. Yes, in that case, I don’t see how this can possibly not constitute copyright infringement.
It's generally tucked under Fair Use doctrine because "It's for the science", until it doesn't (looking at commercial AI non-profits).
Then "they're doing something amazing, they don't need permission, and the cat is already out of the bag, and similar musings".
Seriously, it's both copyright infringement, and unethical. This is why I don't use any of the popular AI tools, or even AI add-ons in Evernote, Notion, etc. They all link back to the usual suspects.
The question then becomes do these concerns remain even for AI that cannot reproduce origional works? Does what does that mean for us? When we read things, or interact with Amy information for that matter, it changes us and how we do things. If you consume art it will forever influence art you produce yourself. Are these copyright infringements also?
I can see the problem where direct and faithful replication is possible but where it isn't is there still a problem? Or is the automatable aspect, the scale at which it can occur, that is the problem?
The difference is what you mix, and the amounts of things you mix. As a human you mix many more inputs, plus your emotions, plus the things you consume to create it. Moreover, what you can consume and how perfectly you can consume is limited by our innate limits.
An AI system consumes something perfectly, then ingrains it into its weights perfectly, and becomes capable of imitating the same thing perfectly. Plus, ther are no other internal or external factors which affect these "generation" over time. Hence, it mixes and reproduces based on what it consumed, solely.
I might get inspired by people, and add my own values to it, iterate over it and diverge from what I'm inspired from ultimately to create my own style. AI doesn't work like that. Also, if I do the same amount of inspiration with the same precision and accuracy, I'll be neck deep in accusations and lawsuits (for the right reasons).
As a result, just because we fail to ask the right questions to reproduce the training data verbatim or almost verbatim doesn't mean that the information is not there. At the end, a neural network is a compression algorithm which encodes data in terms of weights. Given the correct input, you can regenerate the training data as is.
Unless you have special abilities, you can't read 500 books an hour, remember them perfectly, and generate derivative works by mashing all of them together. If I do and try to sell a novel, I'll be ridiculed no end. If I write a Ph.D. the same way and try to defend it, I'll be banned from academia for three lifetimes at least.
For more elaboration on the subject, see [0].
[0]: https://news.ycombinator.com/item?id=39188463
The myth that AI models store all its training data in its weights verbatim is as widespread as it is false. In fact, if this were the case, deep neural networks would be considered far better compression algorithms than anything we have on the market right now, by literal orders of magnitude.
If you divide Stable Diffusion's file size by the number of images used to train it, you get something like 1.2 bits per image, and it is physically impossible to get this kind of a compression ratio.
The actual problem with AI is that it sometimes plagiarizes random fragments of the work it is trained on, even if it is not the user's intend, and we currently don't really know how to fully prevent this.
It still doesn't change the fact that the inclusion of commercial works is copyright infringement though.
Same for code generating models trained on Open Source and Free Software. Tons of licenses violated, from strong copyleft to source available models and reproduced (almost) verbatim with comments intact.
Some researcher's codebase is almost completely reproducible without any licensing information just by hinting the function names.
Maybe for the image compression it's borderline impossible for now due to network size, but for text and code, generation of training data almost verbatim is very possible and straightforward.
Also in image generation models, style transfer is the bigger problem, because it completely eliminates the artist who uses/created the style in the first place. "You pioneered this, and we fine tuned this model with your images, and we can do your work for free, without you, have a nice day". However, the artist's life expenses doesn't disappear when they're transferred to an image generation model.
This is also unethical.
Style transfer is perfectly legal AFAIK, asking artists to do a drawing of X in the style of Y was already a thing. Style is not copyrightable.
Its not always that simple https://ethicsunwrapped.utexas.edu/case-study/blurred-lines-...
I didn't mean to say style transfer is illegal, it's not, but doing it en-masse is unethical.
Just because you can doesn’t mean you should. That's what I'm trying to say.
IIRC it was like 1.4 bytes before adding in random initial and prompt. And Amiga Four-byte Burger is 4 bytes long.
For reference, this [0] is Amiga Four Byte Burger, and it looks impossibly good.
[0]: https://bytecellar.com/2023/04/24/lost-amiga-four-byte-burge...
I don't understand how this is relevant? It's not generated from 4 bytes of data or code, surely the name of the piece is just a joke about byte = bite?
Whether or not I'm influenced by media isn't super relevant to whether or not I pirated that media. Before even arriving at the question of whether or not the resulting models are infringing, it's clear the training data is.
The "humans do it too" argument is totally irrelevant to this because humans have special and specific privileges under the law that computers don't. The problem is that a lot of data was copied into training sets and used for commercial purposes without permission.
I’m talking about distributing the corpus, which by itself is not bound to any particular usage.
It's again copyright infringement. If I share a copyrighted ebook by accident, any and every cloud provider will ban my account with no warning and recourse.
Open science repositories would take down the "dataset" immediately (or at least limit its access) if a copyright holder brings the matter to the eyes of the admins.
Ah, the Uber theory of law. Works surprisingly well for some reason.
Probably due to Murphy's Golden Law of Golden Laws: Who has the gold makes the laws.
Yeah I agree. If 2 contains complete copyright works (e.g. all of Harry Potter) then "we're just using it for AI training!" stands approximately zero chance of passing the fair use test. Their assertion that it does is just wishful thinking.
said with confidence, however Silverman et al Judge explicitly rejected what you just asserted AFAIK
You've got it backwards. The judge in Silverman et al dismissed the claims asserting that OpenAIs output is copyright infringement. The claims for copyright infringement in the training data are still going forward, that will directly test whether it is "fair use" or not.
From the ruling:
https://caselaw.findlaw.com/court/us-dis-crt-n-d-cal/1158180...
aha - much appreciated
no, she didn’t reject the claim that training on copyrighted work is infringement, merely that the outputs are not infringing simply by bearing similarity to the texts
I'm playing stupid now: I believe that if I ask the LLM to "display Harry Potter Book 1" and it does, word-by-word, then you're 100% right, it's copyright infringement. But, if I ask the LLM to "give me an analysis of the "Professor Severus Snape's" character and it gives me one, then I don't see the problem.
So in that sense I understand the response that "they don't violate copyright" by studying the material. Again, I don't pretend to be a lawyer, and not every law has to follow my logic.
This will probably get buried, but a lot of the GOTCHA! LLM copyright bullshit is "it produces this passage from x book or y article perfectly!"
Meanwhile the complainant (such as GRRM) forgets that often passages from said articles and books are strewn throughout the Internet. Of course chat GPT can drop passages from the GoT books; there's several entire fucking wikis for that franchise that reference passages, quotes, details etc.
Same goes for news articles, passages of which are often quoted by other sources or websites.
Not that chatGPT has reproduced many works in whole, but it's an interesting logic problem for fair use law: if I have copyrighted article X, but websites ABCDEF all quote various passages of my article (ie fair use, critique, etc) and then ABCDEF is used to train an LLM, if the LLM can _reassemble_ the article from quoted passages without referencing article X itself, is it copyright infringement or fair use?
I think it is almost certain that GPT was trained on full books
That's a different discussion.
This isn't about the output for content generators or about the abstract numeric weights that they operate over. That's more complex and a largely open question.
But this is literally about indiscriminately distributing copyrighted works in a large, convenient archive while arguing that it's okay because you normalized the formatting a bit and because you suspect that some people might find "fair use" value in it.
Even if the model encoding is not lossless/reversible, it's probably not true. A good place to start when thinking about fair use is the "four factors" that the U.S. legal system will consider. https://fairuse.stanford.edu/overview/fair-use/four-factors/
Summary books for example are legal, so there is some threshold of compression where things are fine.
are you referring to things like CliffNotes ?
I’m referring to the “Summary of <some other book>” booklets you can find on Amazon. Also services like Blinkist.
Huh. Never heard of those. The internet sure has a lot of dark little nooks and crannies.
On top of that, add patent and trademark law which also ban things A.I.'s might generate from training data. In the case of patents, they can't reproduce the invention even if it's an independent creation. From there, the damages go up if they did it on purpose. That some are trained on patent filings and research papers about patented inventions is just asking to be in a patent suit eventually.
And that's legitimate inventions I'm talking about. Just wait until the patent trolls figure out how to get the A.I.'s to divulge patent violations to sue the A.I. suppliers.
Google keeps coming up with new ways to statistically infer training set data from models. So it's not entirely lossless. At the very least, models that have been trained on a particular work are unusually good at compressing[0] those works, relative to other valid text.
In terms of fair use, one of the larger factors is the 'market substitution' factor, which basically means "does this use compete with otherwise licensed uses that people would ordinarily pay for?" AI absolutely does compete with human artists for the same market. In fact, it's winning handily[1], because you don't have to pay human artists. AI art models absolutely shouldn't be trained on anything with copyright on it.
The other factors don't fare much better. Nature of the original work will differ based on the plaintiff, but the purpose and character of the AI's use of that work is very much commercial. And the amount and substantiality of the use is complete and total. I don't see AI being fair use - at least, not in every one of the many, many training lawsuits currently ongoing against OpenAI and Stability.
[0] Starting with any body of text, an LLM, and an empty context window, compute the next-token probabilities and take the highest one. If it matches the source text, output a 1 bit. If it doesn't, output 0 followed by the ID of the correct next token. Add the correct token to the context window and repeat until the text has been fully compressed. This produces a list of perplexities (wrong words) for the given text which can be used to guide the LLM to output the original work.
[1] Hey, remember when both WotC (biggest art commissioner on the planet) and Wacom (hardware vendor that sells art tools and payment terminals[2]) both got caught using AI art after making very loud and public pledges to not do that? They both wound up buying stock photography on marketplaces that are absolutely flooded with AI trash.
[2] All the credit card readers in Japan are built by Wacom, which is really funny as an artist
This cannot be known until it is litigated. Fair Use is not something you can unilaterally declare and have it be so, just like you can't be like Michael Scott in the Office shouting "I declare bankruptcy!" OpenAI is currently defending itself against the New York Times for this very reason.
There's a multi-factor test that courts weigh the facts against in making a determination as to whether a prima facie copyright violation would be protected under a Fair Use defense:
Factor 1: The Purpose and Character of the Use
Factor 2: The Nature of the Copyrighted Work
Factor 3: The Amount or Substantiality of the Portion Used
Factor 4: The Effect of the Use on the Potential Market for or Value of the Work
See https://copyright.columbia.edu/basics/fair-use.html for a pretty good overview of what the analysis entails.
Whether something is fair use or not is not determined by a court, but by the definition of what fair use is. A court interprets that definition and the situation, and if their interpretation matches yours you may have a ruling in your favor. But saying "this is fair use" is no more incorrect than saying "this is red". You're interpreting your perception and putting that interpretation into words.
When a court determines that it isn't, you can continue to argue it as much as you like (to deaf ears), and yet you're still liable to the copyright holder. Whether it's "incorrect" or not is then irrelevant. Let's not argue semantics here.
No, they are different. "fair use" is a legal term. It is not like saying "I use it like this, I think it is fair!", the term "fair use" literally is a legal term that means a particular thing in the court of law.
https://en.wikipedia.org/wiki/Fair_use
Correct, but what isn't clear here is their rationale for why they think they're covered by fair use. Does anybody have that information?
I'm not saying their interpretation is correct, but seems to be germane to this discussion. The parent comment seems to assume none of this has been litigated yet, which might also be true. Or not.
They're hoping that $10 billion will buy them the kinds of lawyers Oracle had when they basically rewrote copyright law on API's. The AI companies hope to do this in a way that makes everything they're doing either legal or too muddy for lawsuits.
Thanks. This is really informative, and really important information given the growing relevance of IP law in everyone's daily life. Part of me wonders if these four factors will ever become part of core curriculum for civics classes.
By no means am I an expert in copyright law, but factor 3 seems like very bad news if you're OpenAI.
I don't know what the right answer is to the copyright questions, but I hope that in 2024 we'll have a better attitude about the human labor that went into these models than "Data exists out there in the world" and the passive-voice "It has been collected into datasets"
Have you heard of data unions?
I have not! What's your preferred background reading on this?
https://www.thedataunion.org/
Fair use is a defense to infringement. Do not start your copyright argument by admitting you infringed.
Fair use is an exception to infringement. Use which is fair is non-infringing. For example, Google books containing a searchable copy of every book ever written is fair use, as is Google's cache containing every news article and web page.
Nicely stated copyright violations. Has noone filed suit yet?
Huckabee v Bloomberg, Meta, et al
scraping libgen and downloading copyrighted content and redistributing it isn’t illegal?
call me skeptical, seeding a torrent of movies that you downloaded from elsewhere on the internet isn’t “fair use” and the pile isn’t just code for transforming data, it is the redistributed data itself
by this logic i could legally run a libgen mirror
They’re distributing copyrighted works without the authors permission, using them in ways that compete with the author, many make money off AI’s, and the AI’s reproduce some verbatim. These datasets seem to fail most tests ("four factors") in copyright law. Even laypeople I’ve explained LLM’s to think the AI companies are ripping others’ work off.
For those concerned, I have an article that covers legalities, each dataset (including The Pile), legal issues with them, alternatives that are legal, and a copyright amendment that balances all sides.
http://gethisword.com/tech/exploringai/
Looking back at my proposal, I think we need at least three rules passed immediately in at least one country:
1. All copyrighted works can, if a person has legal access, be used for training AI systems. Any terms restricting copyrighted works from use in training, charging more for that, restricting downloads for it, etc are illegal. Every act of publishing can benefit both a human mind and AI training equally.
2. People can copy and transform for their own use any work they have access to only for AI training. This might include reverse engineering for extraction, multiple copies in different formats, and so on. They can do whatever is needed to get it into the AI system. Other uses or abuse of this data is subject to existing law.
3. Any work published online for free and with public access can be copied, shared, processed, and bundled for AI training. That’s regardless of its terms.
Note: In No. 2 and No. 3, the resulting AI’s copyright will be determined by existing law about AI’s and mixing copyrighted works. Or no copyright if that’s the law.
4. If AI outputs are copywritten, their status will be the same as if the user published it themselves while relying on prior works. AI training sets will also be public to determine this.
With those rules, we can share works like those in The Pile, still pay creators that want to be paid, be less likely to just steal existing work, and infringement in outputs is still illegal. What do you all think of that?
Hopefully that is correct. The pile has been very valuable for open model work. It's a really high quality dataset
In Europe, 2 and 3 are subject to compilation copyright and database rights.
Interesting take on the copyright law.