return to table of content

The Pile is a 825 GiB diverse, open-source language modelling data set (2020)

Ninjinka
56 replies
1d23h

I raised a concern about the inclusion of books3 in the Pile back in 2020, and this is what the head of Eleuther (Stella Biderman) told me:

"So here’s the big picture. There are three sets of datasets: 1. Data exists out there in the world. It has been collected into datasets and posted online. I’ll call this raw data. 2. We take that data, clean it, and process it for language modeling. I’ll call this per-set data. 3. We combine those per-set data into one massive dataset, the Pile. This is heavily processed, including weighing the components.

We created 2 and 3 and put them online. We put 2 online so that people can reweigh and remix the data if they wish, but we expect most people to just download 3 and use it out of the box. Access to 3 will be provided in several forms, including HuggingFace and from our website.

2 and 3 are not copyright violations, even if the data is copyrighted, because they fall under fair use (at least in the US).

The Pile contains code that turns 1 into 2 and code that turns 2 into 3.

When you download Maroon 5 from a website, you are creating a dataset corresponding to 2. That can be copyright violation depending on what you do with it, but our use is not a copyright violation."

layer8
35 replies
1d22h

I don’t understand how this can be true if set 2 contains a complete copyrighted work (say, a book) that the copyright owner hasn’t approved for such distribution. Unless I misunderstand and the “process[ing] for language modeling” is an entirely irreversible process.

michaelt
18 replies
1d22h

> Unless I misunderstand and the “process[ing] for language modeling” is an entirely irreversible process.

In the case of The Pile, "processing for language modelling" means "converting epub and pdf into plain text, maybe deduplicating, maybe removing some sorts of detectably malformed files"

So not a particularly lossy conversion.

layer8
17 replies
1d22h

I see, thanks. Yes, in that case, I don’t see how this can possibly not constitute copyright infringement.

bayindirh
16 replies
1d21h

It's generally tucked under Fair Use doctrine because "It's for the science", until it doesn't (looking at commercial AI non-profits).

Then "they're doing something amazing, they don't need permission, and the cat is already out of the bag, and similar musings".

Seriously, it's both copyright infringement, and unethical. This is why I don't use any of the popular AI tools, or even AI add-ons in Evernote, Notion, etc. They all link back to the usual suspects.

Grimblewald
11 replies
1d20h

The question then becomes do these concerns remain even for AI that cannot reproduce origional works? Does what does that mean for us? When we read things, or interact with Amy information for that matter, it changes us and how we do things. If you consume art it will forever influence art you produce yourself. Are these copyright infringements also?

I can see the problem where direct and faithful replication is possible but where it isn't is there still a problem? Or is the automatable aspect, the scale at which it can occur, that is the problem?

bayindirh
8 replies
1d20h

The difference is what you mix, and the amounts of things you mix. As a human you mix many more inputs, plus your emotions, plus the things you consume to create it. Moreover, what you can consume and how perfectly you can consume is limited by our innate limits.

An AI system consumes something perfectly, then ingrains it into its weights perfectly, and becomes capable of imitating the same thing perfectly. Plus, ther are no other internal or external factors which affect these "generation" over time. Hence, it mixes and reproduces based on what it consumed, solely.

I might get inspired by people, and add my own values to it, iterate over it and diverge from what I'm inspired from ultimately to create my own style. AI doesn't work like that. Also, if I do the same amount of inspiration with the same precision and accuracy, I'll be neck deep in accusations and lawsuits (for the right reasons).

As a result, just because we fail to ask the right questions to reproduce the training data verbatim or almost verbatim doesn't mean that the information is not there. At the end, a neural network is a compression algorithm which encodes data in terms of weights. Given the correct input, you can regenerate the training data as is.

Unless you have special abilities, you can't read 500 books an hour, remember them perfectly, and generate derivative works by mashing all of them together. If I do and try to sell a novel, I'll be ridiculed no end. If I write a Ph.D. the same way and try to defend it, I'll be banned from academia for three lifetimes at least.

For more elaboration on the subject, see [0].

[0]: https://news.ycombinator.com/item?id=39188463

miki123211
7 replies
1d20h

The myth that AI models store all its training data in its weights verbatim is as widespread as it is false. In fact, if this were the case, deep neural networks would be considered far better compression algorithms than anything we have on the market right now, by literal orders of magnitude.

If you divide Stable Diffusion's file size by the number of images used to train it, you get something like 1.2 bits per image, and it is physically impossible to get this kind of a compression ratio.

The actual problem with AI is that it sometimes plagiarizes random fragments of the work it is trained on, even if it is not the user's intend, and we currently don't really know how to fully prevent this.

bayindirh
3 replies
1d20h

It still doesn't change the fact that the inclusion of commercial works is copyright infringement though.

Same for code generating models trained on Open Source and Free Software. Tons of licenses violated, from strong copyleft to source available models and reproduced (almost) verbatim with comments intact.

Some researcher's codebase is almost completely reproducible without any licensing information just by hinting the function names.

Maybe for the image compression it's borderline impossible for now due to network size, but for text and code, generation of training data almost verbatim is very possible and straightforward.

Also in image generation models, style transfer is the bigger problem, because it completely eliminates the artist who uses/created the style in the first place. "You pioneered this, and we fine tuned this model with your images, and we can do your work for free, without you, have a nice day". However, the artist's life expenses doesn't disappear when they're transferred to an image generation model.

This is also unethical.

miki123211
2 replies
1d12h

Style transfer is perfectly legal AFAIK, asking artists to do a drawing of X in the style of Y was already a thing. Style is not copyrightable.

bayindirh
0 replies
1d10h

I didn't mean to say style transfer is illegal, it's not, but doing it en-masse is unethical.

Just because you can doesn’t mean you should. That's what I'm trying to say.

numpad0
2 replies
1d19h

you get something like 1.2 bits per image, and it is physically impossible to get this kind of a compression ratio.

IIRC it was like 1.4 bytes before adding in random initial and prompt. And Amiga Four-byte Burger is 4 bytes long.

fennecfoxy
0 replies
1d7h

I don't understand how this is relevant? It's not generated from 4 bytes of data or code, surely the name of the piece is just a joke about byte = bite?

ribosometronome
0 replies
1d19h

Whether or not I'm influenced by media isn't super relevant to whether or not I pirated that media. Before even arriving at the question of whether or not the resulting models are infringing, it's clear the training data is.

__loam
0 replies
1d20h

The "humans do it too" argument is totally irrelevant to this because humans have special and specific privileges under the law that computers don't. The problem is that a lot of data was copied into training sets and used for commercial purposes without permission.

layer8
1 replies
1d21h

I’m talking about distributing the corpus, which by itself is not bound to any particular usage.

bayindirh
0 replies
1d21h

It's again copyright infringement. If I share a copyrighted ebook by accident, any and every cloud provider will ban my account with no warning and recourse.

Open science repositories would take down the "dataset" immediately (or at least limit its access) if a copyright holder brings the matter to the eyes of the admins.

CobrastanJorji
1 replies
1d21h

"They're doing something amazing, they don't need permission, and the cat is already out of the bag"

Ah, the Uber theory of law. Works surprisingly well for some reason.

bayindirh
0 replies
1d21h

Probably due to Murphy's Golden Law of Golden Laws: Who has the gold makes the laws.

IshKebab
8 replies
1d22h

Yeah I agree. If 2 contains complete copyright works (e.g. all of Harry Potter) then "we're just using it for AI training!" stands approximately zero chance of passing the fair use test. Their assertion that it does is just wishful thinking.

mistrial9
3 replies
1d21h

said with confidence, however Silverman et al Judge explicitly rejected what you just asserted AFAIK

papercrane
1 replies
1d21h

You've got it backwards. The judge in Silverman et al dismissed the claims asserting that OpenAIs output is copyright infringement. The claims for copyright infringement in the training data are still going forward, that will directly test whether it is "fair use" or not.

From the ruling:

Assuming the truth of Plaintiffs’ allegations - that Defendants used Plaintiffs’ copyrighted works to train their language models for commercial profit - the Court concludes that Defendants’ conduct may constitute an unfair practice.6 Therefore, this portion of the UCL claim may proceed.

https://caselaw.findlaw.com/court/us-dis-crt-n-d-cal/1158180...

mistrial9
0 replies
1d20h

aha - much appreciated

whimsicalism
0 replies
1d21h

no, she didn’t reject the claim that training on copyrighted work is infringement, merely that the outputs are not infringing simply by bearing similarity to the texts

HenryBemis
3 replies
1d21h

I'm playing stupid now: I believe that if I ask the LLM to "display Harry Potter Book 1" and it does, word-by-word, then you're 100% right, it's copyright infringement. But, if I ask the LLM to "give me an analysis of the "Professor Severus Snape's" character and it gives me one, then I don't see the problem.

So in that sense I understand the response that "they don't violate copyright" by studying the material. Again, I don't pretend to be a lawyer, and not every law has to follow my logic.

fennecfoxy
1 replies
1d7h

This will probably get buried, but a lot of the GOTCHA! LLM copyright bullshit is "it produces this passage from x book or y article perfectly!"

Meanwhile the complainant (such as GRRM) forgets that often passages from said articles and books are strewn throughout the Internet. Of course chat GPT can drop passages from the GoT books; there's several entire fucking wikis for that franchise that reference passages, quotes, details etc.

Same goes for news articles, passages of which are often quoted by other sources or websites.

Not that chatGPT has reproduced many works in whole, but it's an interesting logic problem for fair use law: if I have copyrighted article X, but websites ABCDEF all quote various passages of my article (ie fair use, critique, etc) and then ABCDEF is used to train an LLM, if the LLM can _reassemble_ the article from quoted passages without referencing article X itself, is it copyright infringement or fair use?

whimsicalism
0 replies
1d3h

I think it is almost certain that GPT was trained on full books

swatcoder
0 replies
1d21h

That's a different discussion.

This isn't about the output for content generators or about the abstract numeric weights that they operate over. That's more complex and a largely open question.

But this is literally about indiscriminately distributing copyrighted works in a large, convenient archive while arguing that it's okay because you normalized the formatting a bit and because you suspect that some people might find "fair use" value in it.

CharlesW
6 replies
1d22h

Even if the model encoding is not lossless/reversible, it's probably not true. A good place to start when thinking about fair use is the "four factors" that the U.S. legal system will consider. https://fairuse.stanford.edu/overview/fair-use/four-factors/

layer8
3 replies
1d22h

Summary books for example are legal, so there is some threshold of compression where things are fine.

knodi123
2 replies
1d22h

are you referring to things like CliffNotes ?

layer8
1 replies
1d22h

I’m referring to the “Summary of <some other book>” booklets you can find on Amazon. Also services like Blinkist.

knodi123
0 replies
1d17h

Huh. Never heard of those. The internet sure has a lot of dark little nooks and crannies.

nickpsecurity
0 replies
1d16h

On top of that, add patent and trademark law which also ban things A.I.'s might generate from training data. In the case of patents, they can't reproduce the invention even if it's an independent creation. From there, the damages go up if they did it on purpose. That some are trained on patent filings and research papers about patented inventions is just asking to be in a patent suit eventually.

And that's legitimate inventions I'm talking about. Just wait until the patent trolls figure out how to get the A.I.'s to divulge patent violations to sue the A.I. suppliers.

kmeisthax
0 replies
1d21h

Google keeps coming up with new ways to statistically infer training set data from models. So it's not entirely lossless. At the very least, models that have been trained on a particular work are unusually good at compressing[0] those works, relative to other valid text.

In terms of fair use, one of the larger factors is the 'market substitution' factor, which basically means "does this use compete with otherwise licensed uses that people would ordinarily pay for?" AI absolutely does compete with human artists for the same market. In fact, it's winning handily[1], because you don't have to pay human artists. AI art models absolutely shouldn't be trained on anything with copyright on it.

The other factors don't fare much better. Nature of the original work will differ based on the plaintiff, but the purpose and character of the AI's use of that work is very much commercial. And the amount and substantiality of the use is complete and total. I don't see AI being fair use - at least, not in every one of the many, many training lawsuits currently ongoing against OpenAI and Stability.

[0] Starting with any body of text, an LLM, and an empty context window, compute the next-token probabilities and take the highest one. If it matches the source text, output a 1 bit. If it doesn't, output 0 followed by the ID of the correct next token. Add the correct token to the context window and repeat until the text has been fully compressed. This produces a list of perplexities (wrong words) for the given text which can be used to guide the LLM to output the original work.

[1] Hey, remember when both WotC (biggest art commissioner on the planet) and Wacom (hardware vendor that sells art tools and payment terminals[2]) both got caught using AI art after making very loud and public pledges to not do that? They both wound up buying stock photography on marketplaces that are absolutely flooded with AI trash.

[2] All the credit card readers in Japan are built by Wacom, which is really funny as an artist

otterley
6 replies
1d20h

2 and 3 are not copyright violations, even if the data is copyrighted, because they fall under fair use (at least in the US).

This cannot be known until it is litigated. Fair Use is not something you can unilaterally declare and have it be so, just like you can't be like Michael Scott in the Office shouting "I declare bankruptcy!" OpenAI is currently defending itself against the New York Times for this very reason.

There's a multi-factor test that courts weigh the facts against in making a determination as to whether a prima facie copyright violation would be protected under a Fair Use defense:

Factor 1: The Purpose and Character of the Use

Factor 2: The Nature of the Copyrighted Work

Factor 3: The Amount or Substantiality of the Portion Used

Factor 4: The Effect of the Use on the Potential Market for or Value of the Work

See https://copyright.columbia.edu/basics/fair-use.html for a pretty good overview of what the analysis entails.

fluoridation
2 replies
1d20h

Whether something is fair use or not is not determined by a court, but by the definition of what fair use is. A court interprets that definition and the situation, and if their interpretation matches yours you may have a ruling in your favor. But saying "this is fair use" is no more incorrect than saying "this is red". You're interpreting your perception and putting that interpretation into words.

otterley
0 replies
1d20h

But saying "this is fair use" is no more incorrect than saying "this is red".

When a court determines that it isn't, you can continue to argue it as much as you like (to deaf ears), and yet you're still liable to the copyright holder. Whether it's "incorrect" or not is then irrelevant. Let's not argue semantics here.

dkarras
0 replies
1d18h

But saying "this is fair use" is no more incorrect than saying "this is red"

No, they are different. "fair use" is a legal term. It is not like saying "I use it like this, I think it is fair!", the term "fair use" literally is a legal term that means a particular thing in the court of law.

https://en.wikipedia.org/wiki/Fair_use

wiremine
1 replies
1d19h

This cannot be known until it is litigated. Fair Use is not something you can unilaterally declare and have it be so.

Correct, but what isn't clear here is their rationale for why they think they're covered by fair use. Does anybody have that information?

I'm not saying their interpretation is correct, but seems to be germane to this discussion. The parent comment seems to assume none of this has been litigated yet, which might also be true. Or not.

nickpsecurity
0 replies
1d16h

They're hoping that $10 billion will buy them the kinds of lawyers Oracle had when they basically rewrote copyright law on API's. The AI companies hope to do this in a way that makes everything they're doing either legal or too muddy for lawsuits.

ryukoposting
0 replies
1d20h

Thanks. This is really informative, and really important information given the growing relevance of IP law in everyone's daily life. Part of me wonders if these four factors will ever become part of core curriculum for civics classes.

By no means am I an expert in copyright law, but factor 3 seems like very bad news if you're OpenAI.

dougb5
3 replies
1d20h

I don't know what the right answer is to the copyright questions, but I hope that in 2024 we'll have a better attitude about the human labor that went into these models than "Data exists out there in the world" and the passive-voice "It has been collected into datasets"

clooper
2 replies
1d20h

Have you heard of data unions?

dougb5
1 replies
1d19h

I have not! What's your preferred background reading on this?

tycho-newman
1 replies
1d19h

Fair use is a defense to infringement. Do not start your copyright argument by admitting you infringed.

MacsHeadroom
0 replies
1d14h

Fair use is an exception to infringement. Use which is fair is non-infringing. For example, Google books containing a searchable copy of every book ever written is fair use, as is Google's cache containing every news article and web page.

chinathrow
1 replies
1d22h

Nicely stated copyright violations. Has noone filed suit yet?

SEGyges
0 replies
1d22h

Huckabee v Bloomberg, Meta, et al

whimsicalism
0 replies
1d21h

scraping libgen and downloading copyrighted content and redistributing it isn’t illegal?

call me skeptical, seeding a torrent of movies that you downloaded from elsewhere on the internet isn’t “fair use” and the pile isn’t just code for transforming data, it is the redistributed data itself

by this logic i could legally run a libgen mirror

nickpsecurity
0 replies
1d21h

They’re distributing copyrighted works without the authors permission, using them in ways that compete with the author, many make money off AI’s, and the AI’s reproduce some verbatim. These datasets seem to fail most tests ("four factors") in copyright law. Even laypeople I’ve explained LLM’s to think the AI companies are ripping others’ work off.

For those concerned, I have an article that covers legalities, each dataset (including The Pile), legal issues with them, alternatives that are legal, and a copyright amendment that balances all sides.

http://gethisword.com/tech/exploringai/

Looking back at my proposal, I think we need at least three rules passed immediately in at least one country:

1. All copyrighted works can, if a person has legal access, be used for training AI systems. Any terms restricting copyrighted works from use in training, charging more for that, restricting downloads for it, etc are illegal. Every act of publishing can benefit both a human mind and AI training equally.

2. People can copy and transform for their own use any work they have access to only for AI training. This might include reverse engineering for extraction, multiple copies in different formats, and so on. They can do whatever is needed to get it into the AI system. Other uses or abuse of this data is subject to existing law.

3. Any work published online for free and with public access can be copied, shared, processed, and bundled for AI training. That’s regardless of its terms.

Note: In No. 2 and No. 3, the resulting AI’s copyright will be determined by existing law about AI’s and mixing copyrighted works. Or no copyright if that’s the law.

4. If AI outputs are copywritten, their status will be the same as if the user published it themselves while relying on prior works. AI training sets will also be public to determine this.

With those rules, we can share works like those in The Pile, still pay creators that want to be paid, be less likely to just steal existing work, and infringement in outputs is still illegal. What do you all think of that?

artninja1988
0 replies
1d23h

Hopefully that is correct. The pile has been very valuable for open model work. It's a really high quality dataset

anticensor
0 replies
10h59m

In Europe, 2 and 3 are subject to compilation copyright and database rights.

3abiton
0 replies
1d20h

Interesting take on the copyright law.

swatcoder
35 replies
2d

Where do I find the license reproductions and credits/attributions for the content being distributed in this data set? Is it all in there? Are all inclusions compliant? Can I know?

I'm open to the argument that generators built with models that consumed copyrighted data may evade copyright obligations on their output, but surely the data sets themselves are bound by any copyright on their content?

__loam
20 replies
2d

They stole it because they think building their toys is more important than everyone else's rights to the product of their own labor.

johndough
8 replies
1d23h

I doubt that anyone is going to download and search through over 800 TB just to find a badly formatted copy of some book that could be found much quicker on different websites with better formatting. Authors are losing fractional cents here at most.

gosub100
5 replies
1d22h

so just like Office Space? (paraphrasing) "We steal a fraction of a cent from each transaction, who do we hurt? Nobody. We just put the remainder into our account!"

Sorry that's not how damages are calculated in the US tort system.

johndough
4 replies
1d22h

I do not know how damages are calculated in the US tort system. What do they say about the books3 dataset?

I also think that the case is different here, since in your example, there is a specific amount of money being stolen, while in the books3 case, there is an unspecified amount of money not being made by the authors.

SEGyges
3 replies
1d21h

I am pretty sure if the authors were trying to license their works for this purpose we would just not use them at all; it is difficult to see under what circumstances they would stand to profit from this other than by suing people after the fact over it.

doug_durham
2 replies
1d21h

I think you could argue that authors could profit from their works being cited in an LLM response. It could drive sales of their works much like citations do on the web. The counter argument is that and LLM could give you the Clif Notes version of the work and thus taking away a portion of sales.

SEGyges
1 replies
1d21h

In a world where the options were to

1) pay the author,

2) implement guaranteed citation of the author any time the model gave an answer that was directly derivative, with an option to not do so if the summary was sufficiently vague, or

3) ignore the author's book completely as training data

we would all choose 3).

__loam
0 replies
1d20h

And the authors would probably be very happy that you did.

__loam
1 replies
1d20h

The penalty is up to $150k per violation.

sp332
0 replies
1d2h

For uploading, not downloading.

idle_zealot
7 replies
1d22h

I congratulate all of the authors whose work is included in this dataset on contributing their knowledge, skills, and perspective to humanity's various endeavors, both creative and technical. I hope that the fruits of their labors are returned to them, rather than being selfishly hoarded by the few with the resources necessary to produce those fruits, be they publishers, middlemen, or big tech.

Which is all to say that information shouldn't be hoarded and guarded. If it can produce something more than the sum of its parts we should use it to do so. The result of that should, on the same grounds, not be hoarded and guarded, doubly so being based on the work of others.

gosub100
3 replies
1d22h

It will produce "something more" for the already-wealthy who control the technology. For instance, LLMs will eliminate the need for some customer service jobs, increasing the profit margin for the existing executives and shareholders, while eliminating entry-level jobs from the job market.

idle_zealot
2 replies
1d22h

Call me an idealist but I don't think humans should be spending their time on jobs a computer can do.

The solution to wealth disparity cannot include "invent menial untalented high-paying labor for people to do".

__loam
1 replies
1d20h

Yeah why should humans do bothersome labor like...creating literature?

fennecfoxy
0 replies
1d7h

You're romanticising the creation of literature for some reason.

There's no difference between a machine replacing a human writing a book and a machine replacing a human making a piece of wooden furniture. Literature and craftwork are equally rewarding and beneficial imo.

And people still make wooden furniture by hand even after IKEA. If someone wants cheap and uninteresting they'll buy IKEA, if someone wants an interesting and unique piece, or to support handmade things then they'll buy from a woodworker.

nickpsecurity
1 replies
1d16h

That sounds nice except that it didn't happen. They scraped this off many sites whose authors published the material in ways that wouldn't legally allow that. They often publish for a mix of self interest and public benefit. The more you look, the more you find that much altruism is actually self-interest at work. Some have specific terms, too. Let's look at examples in The Pile.

It might be something that's part of their job, requires citations since they value credit, optionally bans commercial use, and maybe has a patent. Arxiv papers are a mix of that. Many on StackOverflow and Hacker News want attribution with some asserting copyright in their comments. Film producers usually want the subtitles to accompany sales of their movies. For FreeLaw, the material is public by law about people who might have never even wanted to testify or be remembered. FreeLaw itself was seeking donations for its service with an additional request to protect the names of people in the dataset. StackOverflow's license explicitly bans copying their data without permission with many individual users also wanting self-promotion to happen side by side with their answers.

So, many things that are in The Pile are works people published hoping to gain some benefit in return. They often had terms that banned their reproduction in ways that prevented them from getting that benefit. They were just public for humans to read and learn from. Some allowed sharing but just wanted credit.

The A.I. users of these works ignore all of that by taking what they made conditionally available without meeting the conditions that benefit the authors. Whereas, if you got the same content on Hacker News or Arxiv, the authors might benefit from it. Even a pirate would benefit them more than A.I. companies because the users would often at least know the author or source site. So, the fruits of their labor were taken, not given, by those who are the least beneficial to them.

I will note that some people do publish truly free content that has no strings attached. Mostly public domain or CC-0. Those are exceptions that might fit your description.

idle_zealot
0 replies
1d13h

I specifically did not say that they had contributed willingly. They have contributed nonetheless.

nonrandomstring
0 replies
1d22h

I congratulate all of the authors whose work is included in this dataset on contributing their knowledge, skills, and perspective to humanity's various endeavours

Thank you. You know in some ways it's an honour and a privilege to live in such times of progress. The very act of publishing is to "let go", and hope that your words and ideas contribute to something bigger and beyond your life. I never believed much in "intellectual property" as it's all stuff that flows through us.

I hope that the fruits of their labours are returned to them

They rarely are, because knowledge and creativity are not greatly valued in our time. But authors, artists and scientists go into that with eyes wide open these days. The rewards come in other ways, as the more you give and put into life the more you get out.

rather than being selfishly hoarded by the few with the resources necessary to produce those fruits

This is not what we fear. Hoard away. We will simply take back what is ours, whenever we desire it. The hoarders will never win against what they call "piracy", because they have no moral right. In the long run, they are on the wrong side of history.

Far worse, and more likely is that the creative and technical works of generations of artists and scientists are going to be turned to exactly the opposite of what they would want. They will be used to harm and disempower humans, divide society instead of heal it, and even make the pursuits of art, science and knowledge irrelevant.

We cannot take back our words, or our formulas, or our paintings or our songs. But we can take back tech.

pk-protect-ai
2 replies
1d23h

they have stole nothing, they make no profit from it as well.

idle_zealot
1 replies
1d22h

Oh, so there aren't AI companies charging for access to private models?

pk-protect-ai
0 replies
1d20h

Who are they? Why do you mix up the guys who prepared the data with the other guys who used this data and making money from a vague memory of that data?

jsheard
13 replies
2d

This dataset includes "books3", which is a comprehensive dump of Bibliotik, a torrent tracker dedicated to pirated ebooks.

Throw a dart at a wall filled with every notable author/publisher ever and whoever you hit probably owns some of this data.

Apparently you can just do whatever as long as you say it's for AI research, go post Blu-ray rips online, it's fine provided you have a .ai domain :^)

oldgradstudent
5 replies
1d23h

It also contains an archive of opensubtitles, which is also not very open source.

refulgentis
4 replies
1d22h

The subtitles aren't open?

If you meant transcribing dialogue from a TV show is violating copyright, I'm not so sure, it's relatively common to quote dialogue for varied purposes, ex. TV critics

Definitely understand if you're saying the whole dialogue for a TV show is copyrighted, but I'm curious about the opensubtitles part, used to work in that area.

layer8
1 replies
1d22h

Quoting excerpts is different from transcribing an entire work, which is unambiguously copyright infringement. (Otherwise you would find the “book” version of any and all TV shows on Amazon.) The subtitles in question are generally translations, which likewise fall under copyright, being a derived work.

refulgentis
0 replies
1d22h

Yeah, I was just curious about the opensubtitles site because I used to work in that field (subtitles) and wasn't sure if there were some new pirate sites that were monetizing subs.

n.b. not being argumentative, please don't read it that way, I apologize if it comes off that way:

Not every derived work is a copyright violation, that's why subs and dubs don't get kicked around, you can quote dialogue in an article, etc.[^1]

Answering if it applies to AI is playing out in court currently with ex. NYT v. OpenAI[^2] and Sarah Silverman et al v. OpenAI[^3] and v. Meta.[^4]

[^1] "Copyright doesn't protect against all use of the work or use of derivative works. There are a few exceptions that fall under what's commonly known as the fair use doctrine:" (https://www.legalzoom.com/articles/what-are-derivative-works...)

[^2] https://www.nytimes.com/2023/12/27/business/media/new-york-t...

[^3] https://www.theverge.com/2024/2/13/24072131/sarah-silverman-...

[^4] https://www.hollywoodreporter.com/business/business-news/sar...

PavleMiha
1 replies
1d22h

Quoting is very different from posting the full contents of something. I can quote a book but I can’t reproduce it in its entirety.

refulgentis
0 replies
1d22h

Right, you can't reproduce a book. W/r/t subs and dubs, fair use has applied historically.

pk-protect-ai
3 replies
1d23h

I wish it had included the books3, but it doesn't anymore. I wish it was possible to download that 36GB books3.tar in the wild these days. Herewith, I promise to use this dataset according to the "fair use" only...

SekstiNi
2 replies
1d22h

I wish it was possible to download that 36GB books3.tar in the wild these days.

There... is a torrent.

pk-protect-ai
1 replies
1d21h

I know. But here where I am, using torrent means participate in distribution of the content and that is where I'll get huge bill for illegally sharing this file.

MacsHeadroom
0 replies
1d14h

Use a debrid provider or seedbox to download the torrent. They torrent it for you and then you direct download from them. Should cost $10 or less.

fsckboy
1 replies
1d23h

Throw a dart at a wall filled with every notable author/publisher ever

copyrights do expire, and any books older than Mickey Mouse are public domain, so it's not every notable author ever

jsheard
0 replies
1d23h

Technically true, narrow that down to merely "every notable living author and a subset of dead ones" then.

Bram Stokers bones will be relieved to hear that their work isn't being misappropriated.

gosub100
0 replies
1d22h

not the domain per se, but the high-powered law firms at your fingertips. Copyright law is much easier to enforce against working-class parents of 12-year-olds than SV elites.

zellyn
32 replies
2d1h

Is the "books3" dataset mentioned in the Pile paper the one that authors are suing over? The one that includes a whole bunch of popular and copyrighted material?

DiggyJohnson
25 replies
2d

Do they claim that none of their data came from copyrighted sources / is copyrighted?

seanhunter
20 replies
2d

The claim (which I don't personally agree with, but I'm trying to represent here in good faith) is that although the data is copyright, training models constitutes "fair use" under US copyright law and therefore you're entitled to use copyright material for this.

Fair to say that whether or not this is correct is pretty important to all the outstanding court cases on this matter.

jdiff
12 replies
1d23h

That seems to fall apart quickly. Even if training could be considered fair use, surely just distributing the raw masses of copyrighted works can't be under any reasonable definition. Otherwise, why did TBP, KAT, and MegaUpload shut down if you could defeat copyright with sheer numbers?

justinclift
3 replies
1d23h

Since when has TBP shut down?

gosub100
1 replies
1d22h

I think they are referring to the many times the domain name has been seized, and shut down temporarily.

RecycledEle
0 replies
1d23h

Some of the founders were convicted of crimes but the database and code are out there.

PeterisP
3 replies
1d22h

One thing that we did with distributing certain copyright-protected textual material was to scramble them at the paragraph level.

If you take every paragraph in the Harry Potter saga and sort the paragraphs in alphabetical order, it's just as good for training short-context-window models, but not a "harm to the market" leading to a lost sale for anyone who wants to read the books.

seanhunter
2 replies
1d8h

It absolutely could be a harm to the market if people use the resulting model to generate "Harry Potter" books instead of buying the real ones.

PeterisP
1 replies
1d4h

The resulting model doesn't have access to the information about what follows what, so it can recreate paragraphs but can't recreate their proper order for a chapter or book. Well, it can try to guess..

seanhunter
0 replies
1d3h

Totally get that but the law doesn't care about that as I understand it. For the four-fold test it matters whether the use is going to affect the market for the original work (not the technicalities of how the model works). If people generate pseudo-Harry Potter via a model that was trained on Harry Potter then the court may well decide that the market for real Harry Potter is affected. That doesn't seem an unreasonable conclusion to me.

I'm pretty sure that's what the lawyers will argue in the Silverman case for example. It's going to be interesting to see how the courts decide.

seanhunter
2 replies
1d23h

Indeed. Also in the US, whether or not something is fair use involves a four factor test[1] and two of the factors are the amount and substantiality of what's taken and the effect on any market. In this case, the amount is "everything" and the effect on the market is potentially very large for authors/publishers.

[1] https://fairuse.stanford.edu/overview/fair-use/four-factors/

fsckboy
1 replies
1d23h

two of the factors are the amount and substantiality of what's taken and the effect on any market

books.google.com has been allowed to copy all the books they can lay their hands on, so long as they don't regurgitate them in full, so it's not really the taking, but any subsequent reproductions. And the effect on the market is insubstantial if the alternative wasn't going to be the equivalent sales.

ascorbic
0 replies
1d20h

You can download the whole dataset, so they're certainly able to regurgitate them in full.

YeGoblynQueenne
0 replies
1d21h

Megaupload et all went against the entertainment industry in a time when that industry had the money to pay the lawyers to convince the judges what the law means.

In the present moment on the other hand, it is the entities in the AI industry (e.g. MS) that have the money and can hire the lawyers to convince the judges. Realistically speaking, it's very likely that things will swing the way of AI companies, which will benefit, albeit indirectly, these guys, even though by themselves they're too small to push their agenda, they're just bit players.

retrac
6 replies
1d23h

I think there is actually a good argument that an AI model is transformative, and that training a model is therefore not infringing of the copyright. (An analogy: if you rolled dice to select words randomly from the Lord of the Rings and rearranged them into a poem, it's not infringing the Lord of the Rings even if in a sense, every word was taken from that book.)

But you still have to get your hands on the copyrighted data legally. It might be legal to scan every book an institution owns, and train off it, so long as those scans are not distributed. But it is probably not legal to scrape copyrighted content off torrents - creating the copy to train with is infringing, even if the model's final product maybe isn't.

fsckboy
4 replies
1d23h

while there is a good argument that AI produces transformative outputs, it's refuted when the models are shown to regurgitate literal text, which they have. Then it just starts to look like a neural memorization agent, compressed storage algorithm, etc.

7moritz7
1 replies
1d20h

This very rarely happens, usually when trying hard to get it to regurgitate, and I don't think it has ever happened for anything longer than 2 paragraphs, or at most a short article. Certainly not something like a book or even the whole issue of a newspaper.

fennecfoxy
0 replies
1d7h

Yup, exactly. Passages from GoT appear because they're frequently referenced across the Internet, including the multiple wiki style sites for GoT fans, not necessarily because the LLM is regurgitating whole-book content it was trained on.

zettabomb
0 replies
1d22h

I've seen examples of this, but they're nearly always isolated, rather difficult to obtain, and not in fact exact copies. You need to specifically ask for an exact copy, and then attempt to defeat the safeguards the model has in place to prevent this, and hope that it was "memorized" - which for the record is considered to be a flaw in the model as it's a reduction in information density and capability, compared to if that "memory" was used for something else. Good models seek to reduce this as much as possible. With the size of the datasets involve (see OP) this feels more like an understandable and reasonable issue to have.

bee_rider
0 replies
1d23h

Definitely open to the idea, that couldn’t be the whole argument. I mean, my brain can output some quotes, but I’m not a compressed storage algorithm. Or at least I hope I’m not.

seanhunter
0 replies
1d23h

Yes agreed, and transformative use itself also has limitations. You don't have carte blanche to use something just because you think it's transformative, for example the Lynn Goldsmith vs Andy Warhol Foundation case over the "Orange Prince" work. https://copyrightalliance.org/warhol-decision-reins-transfor...

numpad0
1 replies
2d

Why do everyone assume "open source" imply legality?

(/s)

fsckboy
0 replies
1d1h

humans have the innate hunter-gatherer sense that generosity and sharing are good for us, and selfishness is bad. And that ethical should be legal and unethical should not be.

Open source is generous sharing, and ethical. Start to nibble away at those ideals and at what point do you slip into unethical? IMDB was a crowd-sourced database put together by a wide community pitching in small efforts which one guy was maintaining like an FAQ. Then the guy maintaining it said, "It's worth money, I own it, screw all of you." How would people react if this happened to wikipedia? But wikipedia is safe because it's a non-profit... you know, like OpenAI, right?

qwertox
0 replies
1d23h

There's odd stuff in there. I just randomly downloaded a file,

https://the-eye.eu/public/Books/ThoseBooks/Puzzles.tar -- 20-Jan-2023 14:54 -- 6M

and it pretends to be a jigsaw puzzle, but is actually eISBN 9781594868573 - The South Beach diet cookbook / Arthur Agatston

jsheard
0 replies
2d

"Open source" implies that, no? A definition of open source which includes blatantly pirated material on the condition that the people who collated and released the pirated material did so for free is really stretching it past breaking point. By that standard everything on The Pirate Bay is open source.

bt1a
1 replies
2d

Pouring one out for the future litigators, jurors, and judges who will have to pore over this inextricable web of legal and technical details

PeterStuer
0 replies
2d

They'll just let their ai do it over lunch.

taylorfinley
1 replies
2d

Yes, from the linked paper:

"Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020). Bibliotik consists of a mix of fiction and nonfiction books and is almost an order of magnitude larger than our next largest book dataset (BookCorpus2). We included Bibliotik because books are invaluable for long-range context modeling research and coherent storytelling"

pimlottc
0 replies
1d22h

This is the most ridiculous legal hand wave I’ve ever seen.

“They’re not books, man, they’re a dataset!”

arthurcolle
27 replies
2d

I can't believe people would do this, just share and republish copyrighted works over the internet. I'm in shock and in disbelief.

Anyways...

Is RedPajama 30T and The Pile "all you need" ? ;)

artninja1988
25 replies
1d23h

There is currently a project going on to create the pile v2 which has only permissively licensed data, because of all the bickering about copyright.

jeffrallen
18 replies
1d23h

because authors prefer to be paid for their labor

FTFY.

ben_w
13 replies
1d22h

Naturally, but I wonder what writers are going to do when the AI trained purely on suitably licensed content is still good enough to make most redundant.

(The authors in best seller's lists may well be immune for a bit longer than others writers, as they're necessarily the top 0.1% of writers, but not forever: nay-sayers claimed that AI could never beat humans at chess or go because the games required special human insight).

wizzwizz4
5 replies
1d22h

Once upon a time, nay-sayers said that nobody could travel to the moon, regardless of what vehicle they used. They were wrong. Once upon a time, nay-sayers said that nobody could transmute lead into gold using alchemical equipment. They were right.

Nay-sayers who said that no possible algorithm could beat humans at chess and go? They were wrong. Nay-sayers who say that these algorithms cannot write better books than humans? Well…

SEGyges
3 replies
1d21h

By "these algorithms", do you mean the ones that currently exist, or the ones that will exist next month, next year, or in 2034?

wizzwizz4
2 replies
1d21h

We're not developing new algorithms all that quickly. My point is that one shouldn't dismiss criticism out-of-hand, just because some critics of some other thing turned out to be wrong: for this point to be valid, I don't need to be making criticism. On an unrelated note…

Personally, I'd be referring to the family of algorithms that purely take as input a context window and provide as output a prediction of the next token likelihood. (Plus or minus iteration, to generate strings of text.) Pejoratively, one might call these "fancy Markov chains", though as with most pejoratives, that's overly reductive.

All the approaches we're seeing marketed heavily are just fancy Markov chains. I expect every "new algorithm" for the next 5 years at least to be a fancy Markov chain, because that's what I expect to get funding. (I do expect that some people will be working on other approaches, but only for amateurish reasons.)

oasisaimlessly
0 replies
1d5h

Strictly applying the definition, the entire universe is a Markov chain (thanks to quantum discretization!) People who use "Markov chain" as a pejorative are just idiots.

SEGyges
0 replies
1d21h

These are fancy Markov chains in the sense that humans are just chemicals and computers just do math. Technically true, but not even "overly reductive"; it is just wrong if it is used to imply that, e.g., humans just swirl around in beakers or the most complex thing you can do with computers is trigonometry.

You can make anything sound unimpressive if you describe it sufficiently poorly.

And: So many different variations are published every month. There are a good number of people in serious research trying approaches that don't use cross entropy loss (ie, strictly next-token prediction).

I don't know what the trajectory of the technology is over the next ten years, but I am positive no one else does either and anyone who thinks they do is wrong.

ben_w
0 replies
1d20h

Once upon a time, nay-sayers said that nobody could transmute lead into gold using alchemical equipment. They were right.

Now I'm wondering if, with modern knowledge, you could build a 0.5 MeV heavy ion accelerator with only the things available to a medieval alchemist.

I'm thinking probably yes? Triboelectics can get the right voltage. But how good does the vacuum need to be?

Nay-sayers who say that these algorithms cannot write better books than humans?

They may be right or wrong in the specific, but I think they're asking the wrong question, too specific.

mejutoco
3 replies
1d22h

The authors in best seller's lists may well be immune for a bit longer than others writers, as they're necessarily the top 0.1% of writers

The top best selling. Only one of many possible reasons for that might be the quality.

ben_w
2 replies
1d21h

Quality is subjective, therefore I think it is reasonable to say the best are those most able to profit rather than, e.g. winners of the Nobel Prize in Literature, or the list of books people most pretend to have read.

mejutoco
1 replies
1d5h

I disagree. While quality is subjective popularity is different to quality. Otherwise we would only have marvel movies.

Your argument assumes no marketing is manipulating the best selling lists, something known to happen in new york best selling and others.

Regarding people pretending to read "fancy books" (my term) I think most people just dont read, but I find it annoying, for ex. when people see my shelves, that some people think that I buy those books to impress somebody. It is as if people that do not enjoy reading cannot conceive somebody enjoying it. I think it is slightly antiintellectual, and cynical. I have a better opinion of people in general.

ben_w
0 replies
1d

I disagree. While quality is subjective popularity is different to quality. Otherwise we would only have marvel movies.

I disagree on two fronts. First, given I responded to "> because authors prefer to be paid for their labor", I think the economics are the key, rather than the artistic merits. The authors and artists suffer economically purely on the basis of the AI doing their work for less, not on the basis of actual artistic merit.

Second, the subjectivity of artistic merit means different people like different things. From a market perspective, this is why horror films get made even though they disgust people like me, it's why kids films get made even though adults outnumber kids, and it's why RomComs exist despite the stereotype of men cringing at them.

You are however correct that it assumes no marketing exists to manipulate the best seller lists. But the marketing is also being outsourced to AI, and I suspect there were more writers writing copy than writing novels and screenplays. Now? Now I'm not so sure, though I'd guess it's still true.

I can sympathise with you about other people thinking you're just virtue signalling with your book collection. What I meant more along the lines of how War and Peace has a reputation for being a book that people like to claim to have read but actually have not, or how many loud atheists state that only atheists have read the bible and that's how they ended up being atheists, though in either case I don't know how accurate the reputations are.

jfvinueza
2 replies
1d22h

Dunno. Writing fiction myself; asked AI to read it aloud. Narrative paragraphs worked fine: a clear, if a bit deadpan, slightly tone-deaf delivery. But dialogue was horrendous: it didn't understand emotional reactions and connotations at all. More so than cringey and robotic, it felt soulless. And the distance from "something that makes sense" to "something that feels human" felt unsormountable. Yes. Many novels will be written with LLMs in the coming years. They might even touch us. But this little Text-to-Speech experiment felt like an evidence that this technology has a void at its core: it doesn't have access, like a human does, to a gargantuan emotional spectrum, which allows us to understand all sorts of subtleties between what is being said, and why, and what does it actually mean, and why does it affect us (or, hell, how should the next line be read in this context, because it has no context, it doesn't feel).

fennecfoxy
0 replies
1d7h

You wrote a book with an LLM and it has bad dialogue _today_. So why do you act as if all this progress hasn't only been in the last couple years? Fast forward 10 years from now and you think LLMs won't have the capability to write compelling fiction, that what we have then will be exactly what we have now?

ben_w
0 replies
1d21h

I'm also writing a novel, and using text to speech to hear how it sounds. One of the ones built into Mac OS. And I'd agree with your assessment, I value the synthesiser for bringing my attention to things my eyes gloss over, such as unnecessary repetition and typos which are still correctly spelled words (a common one for me is lose/loose).

But: AI was seen as "decades" away from beating humans at go, even 6 months before it did.

I don't know how far we are from them writing award winning novels (awards we care about, it doesn't count if it's an award for best AI), though my gut feeling is we need another breakthrough as significant as the transformer model… but even then, that's only a 1σ feeling.

onion2k
1 replies
1d22h

If the data is available online for the pile, surely it's also publicly available to ordinary people in a way that means authors aren't getting any money.

sangnoir
0 replies
1d22h

What sort of defense is this? "Your honor, after someone broke in, they left the door open. Since the door was unlocked anyone could have committed the crime I'm accused of."

zettabomb
0 replies
1d23h

This is pretty reductive - "FTFY" is rarely the witty response you think it is.

evilduck
0 replies
1d23h

I asked an AI tool to create a cheery poem about ringworms infecting kids from the 1600s and it created something that's never existed before. Which author gets paid for this labor they performed?

idle_zealot
4 replies
1d23h

So a bunch of extra work to create a downgrade? I'm sure that's going to be very popular.

arthurcolle
3 replies
1d22h

The training data distribution is the only thing that matters, not the actual content

observationist
2 replies
1d21h

Unless you want something like style from a range of authors, knowledge of a fictional universe or storyline, or other domain specific data or style characteristics.

A blanket removal of copyrighted data would make a bot sterile, boring, unrelatable, and ignorant of culture and common memes. We have amazing AI technology. Let's lean into it and see where it goes.

arthurcolle
0 replies
1d17h

I agree, hypothetically, if I were ever to have an opinion on the matter. Just playing to the audience and potential audiences if this ever gets read into evidence

;)

Haha... just kidding... unless.. ?

__loam
0 replies
1d20h

By the violating the copyright hundreds of authors.

chasd00
0 replies
1d22h

if the pile contains the code to go from step 1 to step 2 and then to 3 then couldn't you just remove the parts you don't want from the raw dataset and re-run the code?

doctorpangloss
0 replies
1d23h

It’s enough for pre training to later tackle specific NLP tasks.

To get something interesting you would have to generate an instruct dataset from it. It would have to cover a diverse range of tasks. The completions themselves do not make LLMs manifest knowledge and reasoning, a large and diverse instruct dataset does.

jwitthuhn
23 replies
2d

Is this still available somewhere? I attempted to download it several months ago and saw the download link 404ing, seems it is still like that.

TrueDuality
17 replies
2d

Most of the distribution for this is via torrents/magnet links and in person hard drive exchanges. I'd go look at some public trackers if you want a copy and don't know someone that already has it.

Do be aware that it does include copyrighted content so distribution is piracy.

Der_Einzige
16 replies
2d

Almost all LLM training datasets include copyrighted content so almost all open source LLM distribution is piracy and almost all API based LLMs, including ChatGPT, are also piracy and copyright laundering.

Also, most image-text dataset pairs contain far worse than that. You might want to check out LAION-5B and what stanford researchers have found in there. Technically, anyone who even touched that could in theory be in some serious, serious trouble. I find it quite remarkable that nothing has happened yet.

vineyardmike
5 replies
1d23h

The courts (in the US) have not found LLM model weights to be piracy, nor the outputs, but it’s really surprising that LAION was used for so long consider the content you allude to.

Filligree
4 replies
1d23h

LAION is essentially a list of every image on the public internet. It was filtered, of course, but do you really expect perfection?

It's impossible to create such a list while evading all such material.

vineyardmike
3 replies
1d23h

There exists databases of “the hash of problematic photos” (CSAM), so it seems trivial to search your billions of photos against them before training an AI. You can’t catch everything, but this seems like an obvious miss considering the explicitly tried to scrape pornography.

These hashes is exactly how researchers later discovered this content, so it’s clearly not hard.

SEGyges
1 replies
1d21h

You are uploading 5 billion examples of <something>. You cannot filter it manually, of course, because there are five billion of it. Given that it is the year 2024, how hard is it to be positive that a well-resourced team at Stanford in 2029 will not have better methods of identifying and filtering your data, or a better reference dataset to filter it against, than you do presently?

It is a pretty hard problem.

vineyardmike
0 replies
1d19h

You don’t have to do it manually. There is a database of file hashes.

And this isn’t just “one engineer”. Companies like StabilityAI, Google, etc have used LAION datasets. If you built a dataset you should expend some resources on automated filtering. Don’t include explicit imagery as an intentional choice if you can’t do basic filtering.

duskwuff
0 replies
1d23h

The Stanford researchers also found a substantial number of CSAM images in the LAION-5B dataset which were not recognized by PhotoDNA, probably because the images in question were not in wide distribution prior to their inclusion in LAION.

Full paper: https://stacks.stanford.edu/file/druid:kh752sm9123/ml_traini...

Workaccount2
3 replies
1d21h

Models are not information archives. The size of the final model is orders of magnitude smaller than the size of the training data.

Somehow people are just not able to get this through their heads. Stable diffusion is like 12GB or something and you have people convinced it's a tool that is cutting and pasting copyrighted works from an enormous image archive.

feoren
1 replies
1d19h

The size of the final model is orders of magnitude smaller than the size of the training data.

Good to know I can avoid copyright on a book just by zipping it up!

Workaccount2
0 replies
1d4h

No, you can't. But you can by reading the book, writing down the general gist of it (even including some passages), and then storing that.

LLM's are not compressing petabytes of information down to a few gigabytes.

7moritz7
0 replies
1d20h

Stable Diffusion 1.5 is 1.5 to 6 GB depending on the finetune and trained on like 5 billion images

littlestymaar
2 replies
2d

It's only piracy if it's private individual doing it, otherwise it's just “ask for forgiveness not for permission”-type Capitalism.

gosub100
1 replies
1d22h

It'll be some epic lawsuit like google-v-samsung that will get drawn out for a decade, awarded, and reduced, appealed, etc. where the only winners will be both party's lawyers.

littlestymaar
0 replies
1d20h

It's gonna be way worse than this:

- OpenAI and others will just settle with MPAA, RIAA and the likes for a revenue stream (a single digit billion a year, likely) + some kind of control over what people can and cannot do with the AI + the access to the technology to produce their own content.

- artists will see peanuts from the deal, and the big names are going to be able to stop doing any kind of business with artists which are just expenses in their eyes. They will have been replaced by machines that where trained using their art with no compensation whatsoever.

IP is already predatory capitalism, AI will definitely be weaponized against the workers by the owners of the means of “production”.

visarga
0 replies
1d23h

almost all open source LLM distribution is piracy and almost all API based LLMs, including ChatGPT, are also piracy and copyright laundering

That's an amplification of copyright, original expression is protected, but not the ideas themselves, those are free. And don't forget when we actually get to use these models we feed them questions, data, we give corrections - so they are not simply replicating the training set, they learn and do new things with new inputs.

In fact if you think deeply about it, it is silly to accuse AI of copyright violation. Copying the actual book or article is much much faster and cheaper, and exact. Why would I pay a LLM provider to generate it for me from the title and starting phrase? If I already have part of the article, do I still need to generate it with AI? it's silly. LLM regurgitation are basically attacks with special key, entrapments. They don't happen in normal use.

doctorpangloss
0 replies
1d22h

I find it quite remarkable that nothing has happened yet.

While I don't think it's because you're wrong, per se, it's just that none of this drama really matters.

beeboobaa
0 replies
2d

Turns out you can ignore copyright law if your company has enough money.

natch
0 replies
1d22h

Super odd message since the stack v2 seems to be exclusively code and The Pile is (mostly?) text.

HanClinto
1 replies
1d22h

Is it kosher to post magnet links here? I'm not sure.

magnet:?xt=urn:btih:0d366035664fdf51cfbe9f733953ba325776e667&dn=EleutherAI_ThePile_v1

SEGyges
0 replies
1d21h

This is the correct one.

spindump8930
0 replies
1d22h

Also good to note that that the Pile contains lots of curated sources and recent trends have been to take curated data sources and combine them with filtered webcrawls (i.e. commoncrawl with heavy processing). See dolma or the stack v2 (for code models) as others have mentioned.

quatrefoil
8 replies
1d22h

While a lot of attention has been given to books3, another large component of this dataset is the deceptively-named "OpenWebText2". What's that? It's a scrape of 15 years' worth of third-party websites that were linked to from upvoted Reddit submissions. I know this includes some of my writing.

observationist
3 replies
1d19h

Relevance and impact aside, if you publish something to the internet on a site with no access restriction in place, I don't know how you can keep a straight face while claiming some sort of moral right to the content. It's the equivalent of broadcasting it over radio, or printing and delivering it straight to the doorsteps of millions of random individuals. Methinks you doth protest too much, or something.

There are ways of copyrighting data, and establishing ownership of intellectual property. Your tumblr fanfic, youtube comments, or HN discussions are not legitimate copyright avenues. Stuff you post to legally scrapeable websites are fair game for fair use.

I can do anything I want in private to any data I collect. I could create an awesome HN LLM on the scraped datasets, and use it privately to my hearts content. I can even set up an API to that LLM that generates content, and, given recent rulings, even if i had all the written copyrighted data in the world, as long as I was making good faith efforts to ensure copyright was being respected and works weren't being recreated verbatim, then I could even use that model commercially. I just couldn't sell it to other people, or distribute it, without entering a different legal regime.

I can collect any data I want from public facing websites.

That's how the internet works; it's how it was designed. There are authentication mechanisms, network configurations, and a myriad other access control schemes you can implement to prevent public access. If you post to sites without those mechanisms, you're tacitly agreeing to give up any plausible claims of protection against a wide array of fair uses well established by precedent cases at this point. If you don't prevent public access, and you've got a domain name on a server, you're tacitly inviting the world to come download whatever it is you have on your server. This is a social good. This is what we want when we participate in the internet.

Insisting on some sort of vague entitlement as to how "your" data gets used completely bypasses the fact that anything you consider to be misused in OpenWebText2 fundamentally stems from the fact that you posted the content to a publicly visible website and gave up any say in what happens thereafter. It was scraped fair and square.

Don't complain that you didn't know the rules, or that life isn't fair.

It's not even clear that terms of service or those little popups on public websites have any legal relevance. If your website is open to the public, then it's fair game. If you post content to a public website, then that content's fair game.

quatrefoil
1 replies
1d19h

It feels like you're picking apart an argument I didn't make. But I would note that most people don't see this so unambiguously as the position you're defending. To give you an analogy: doxxing is "fair game" too if you posted your info online or gave it to others. But it's not exactly cool to do it, right? It's a subversion and abuse of the system we have in place.

Finally, here's a fun experiment: decide that terms of service don't matter and start building a product by scrapping Facebook or Google. See how they'd react. Actually, no need for guesswork - they clutched their pearls and threatened legal action more than once before. It's a bit of a "have your cake and eat it too" kind of a deal. Their data is precious intellectual property; your stuff is, well, up for grabs.

observationist
0 replies
1d18h

Oh, for sure, they get all pearl clutchy when others try to do exactly what they have done, and they get all "not like that!" about it. The US is a society run by lawyers, and the big corps have the best lawyers. Maybe we can legislate out of the hole at some point, but it's a pretty grim outlook. Google et al also don't have to have the law on their side, they can simply litigate people and businesses into bankruptcy, regardless of the legal merit of their actions.

At any rate - there are ways of staking legitimate claim to content you publish online. Even by doing so, it may not be relevant. Robots.txt is a convention, not a regulation or law. It's respected out of social nicety, not because it's strictly legally required.

If you publish your data to a website where it's publicly visible, you are inviting the world to come download your data. When that data leaves your server and goes to live on the downloader's computer, the downloader can do whatever they want with that data.

It's not clear that it's legally possible to prevent the use of data in training models unless you require someone to sign a contract to that effect before being allowed to download your data.

That would be obnoxious, and I wouldn't bother with your content anymore. Like Instagram, LinkedIn, and Twitter, your site would get a 127.0.0.0 hosts file entry.

The US needs a clear, modern update to copyright law that upholds and maximizes individual rights, as well as privacy and property concerns. We shouldn't be playing this game where we pretend a website is somehow an analogy for a page of text scribed with a quill pen and using laws developed to handle issues when quill and parchment were relevant.

Let's write some new laws where we regulate what things are, and not play tortuous mental gymnastics to contort and butcher existing laws and precedents to say whatever the most expensive lawyers want.

Maybe the social contract allows for people to prevent their conversations from being scraped and used by third parties without explicit consent, even if the conversation is entirely public. I don't like that view, but I see the argument for it.

As things stand, though, fair use and public access make things pretty bright and clear, and rulings in various AI cases so far have favored broad fair use interpretations, and are requiring complainants to show specific, particular harms. If/When those harms are shown, then we'll see if any carveouts will be made, or if broad fair use interpretations will be the baseline for content scraping going forward.

UncleEntity
0 replies
1d17h

It's the equivalent of...printing and delivering it straight to the doorsteps of millions of random individuals.

Which, incidentally, the New York Times does and they seem to think they have some legal right to the redistribution of their work.

Maybe they're right, maybe they're wrong, it's up to the courts to decide.

7moritz7
3 replies
1d20h

Care to give me your domain name so I can check all major llms for plagiarism? I have a feeling none of them can produce a sentence from your writings

quatrefoil
2 replies
1d19h

It takes deliberate effort, but I was actually able to get pieces of my writing out of one of the leading LLMs (not ChatGPT). This is not particularly unique, a number of folks demonstrated the same.

7moritz7
1 replies
1d19h

How long were those pieces?

fennecfoxy
0 replies
1d7h

I would probably ask more how unique the string of text was; can't lay claim to something where the words naturally follow one another and searching Google comes up with several results.

turnsout
3 replies
2d1h

The Pile is pretty old—is this an updated version?

bt1a
2 replies
2d

It is not.

In related news, v2 of the "stack" dataset was recently released

3.28B unique files belonging to 104.2M github repositories were collected by traversing the Software Heritage 2023-09-06 graph dataset. Additional repository-level metadata was collected from GitHub Archive data up to 2023-09-14. The total uncompressed size of all files is 67.53TB. Near-deduplication was implemented in the pre-processing pipeline on top of exact deduplication.

V1 vs V2 by Deduped Size Tokens

V1: 2.9TB and 200B

V2: 32.1TB and 900B

I imagine we'll see some fairly powerful open coding models soon. The ones I'm looking at testing are:

dolphincoder-starcoder2-15b-iMat.GGUF

CodeFuse-DeepSeek-33B-iMat.GGUF

OpenCodeInterpreter-DS-33B-iMat.GGUF

starcoder2-15b-instruct-iMat.GGUF

more info

dataset https://huggingface.co/datasets/bigcode/the-stack-v2

gguf quants https://huggingface.co/dranger003

bick_nyers
1 replies
2d

Do you happen to know what the v2 dedup size is when compressed? 32.1TB is quite a bit, but if that compresses down to say 3-6TB, it would be much more manageable. Code has a lot of whitespace, repetition, and structure/predictability, so I imagine it would compress better than average text.

spindump8930
0 replies
1d22h

Those sizes refer to the data before processing and filtering. The actual training size was about 3 TB:

   The Stack v2 is ten times larger than its predecessor, yielding a raw dataset of 67.5 TB. Through extensive cleaning, filtering, and subsampling of the source code, along with the incorporation of other high-quality code-related datasets, we created a training set of approximately 3TB (900B+ tokens). 
Source: the paper, Section 10 (https://arxiv.org/pdf/2402.19173.pdf)

brokensegue
2 replies
2d

"open source" as in gratis but not as in libre?

o11c
0 replies
1d23h

It should be understood in contrast to most traditional corpora, which are heavily paywalled/restricted ... or else based solely on century-old books. It has long been a major obstacle for linguistics tooling.

If the current push of AI companies to get their way (to allow copyright laundering) succeeds, this would almost count as open source by the real definition.

If not ... lots of people/companies are committing copyright crimes, some are committing civil infractions, and some may be able to claim fair use.

Legend2440
0 replies
2d

more like, it's a scrape of the entire internet, use it at your own risk

joshuakogut
1 replies
2d

If you’d like to contribute it, feel free to submit a PR

Stella was waiting for you to submit your dataset. Did you? She closed the ticket many months later.

Der_Einzige
0 replies
1d23h

They did a significant amount of work themselves of taking other peoples datasets and including them without the work of the original author needing to submit the full PR to do it. I was then and to this day remain extremely busy

Also this was before most datasets were hosted conveniently on huggingface.

It's all tears in the rain now.

mjtechguy
1 replies
1d23h

Would be interested to see what is in there. Luckily noone has posted the magnet link on Twitter.

SEGyges
0 replies
1d21h

The counterparties on related legal action are sufficiently litigious that it is probably smarter to DM the magnet link.

_obviously
1 replies
1d22h

Seems kind of small tbqh.

beiller
0 replies
1d19h

It seems small, until you try to download it.

Fornax96
1 replies
1d8h

It seems the Pile is currently inaccessible. I would be willing to mirror it on Pixeldrain if I can get my hands on it.

h0p3
0 replies
1d3h

v1:

magnet:?xt=urn:btih:0d366035664fdf51cfbe9f733953ba325776e667&dn=EleutherAI_ThePile_v1&tr=http%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2F9.rarbg.to%3A2710%2Fannounce&tr=udp%3A%2F%2F9.rarbg.me%3A2710%2Fannounce&tr=udp%3A%2F%2F3rt.tace.ru%3A60889%2Fannounce&tr=http%3A%2F%2F5rt.tace.ru%3A60889%2Fannounce&tr=udp%3A%2F%2Ftracker.cyberia.is%3A6969%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce&tr=http%3A%2F%2Fexplodie.org%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker3.itzmx.com%3A6961%2Fannounce&tr=http%3A%2F%2Ftracker1.itzmx.com%3A8080%2Fannounce&tr=udp%3A%2F%2Fp4p.arenabg.ch%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Fwww.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fretracker.lanta-net.ru%3A2710%2Fannounce&tr=udp%3A%2F%2Ftracker.ds.is%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker4.itzmx.com%3A2710%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.tiny-vps.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.zerobytes.xyz%3A1337%2Fannounce

DiggyJohnson
1 replies
2d

Awesome name. Reminds me of the "original" "Pile" from the Manhattan Project.

I read about it in "The Making of the Atomic Bomb" (1986), but presumably it's featured in the recent movie.

groby_b
0 replies
2d

Not really. There's an ultra-brief scene where it's mentioned, but that's it, IIRC.

The movie... is a bunch of anecdotes strung together to make a ham-handed point at the end. It was a decent movie if you treat it as a fictional story instead of an actual retelling.

I'd stick with the book. (And if you specifically care about Fermi, I recommend "The Last Man Who Knew Everything" by David Schwartz)

willvarfar
0 replies
1d21h

Are there any simple text editors or wysiwyg that have local LLMs and can tidy up and auto-suggest whole paras of slick verbage as you type?

racee
0 replies
2d

i love the Illuminati vibes of The Eye

kristianp
0 replies
1d19h

Please add (2020) to the title.

joering2
0 replies
2d

The pile can be downloaded here.

404 Not Found nginx

825 GB is a great candidate for torrent use, whatever was under that broken link better be a torrent magnet.

jMyles
0 replies
1d22h

So much of this thread is concerned not with the achievement of this data set, but with the (by comparison) silly and outdated spat over how to frame it as "property" for the purposes of government intervention (pursuant to which jurisdiction?).

The era of intellectual "property" is over. Let's be at peace with that and just move on into the next age.

intalentive
0 replies
1d21h

This is 4 years old. Why the top of HN now?

fddrdplktrew
0 replies
1d23h

825gb seems really small

clooper
0 replies
1d19h

The big Hollywood studio pay a lot of money to various cyber security companies to look for pirated content and send cease and desit letters to hosting companies for letting their users distribute copyrighted content.

If authors and artists were to join a data union they could do the same thing as studios. If copyrighted law has any real teeth then the data union can send legal requests to whoever is hosting the content and requesting it to be taken down.

I'm not a lawyer but I know the studios definitely do this.

__lbracket__
0 replies
1d22h

LLM are of use to megacorps.

Megacorps assume authors, painters, etc are poor and powerless (which lets face it, they are)

we can b** and moan on HN, but megacorps will find ways to use copyrighted works for free.