HN comments for: NY Times copyright suit wants OpenAI to delete all GPT instances

It's interesting to me the ambiguous attitude people have to reproducing news content. Whenever there is a story from NYT on HN (or any other large media outlet), the top comment is almost always a link to an archived version which reproduces the text verbatim.

And this seems to be tolerated as the norm. And yet, whenever there is a submission about a book, a TV show, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.

I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP.

And the reason I bring this up, is that it seems like Open AI has the same attitude: scraping news articles is OK, or at worst a gray area, but what if they were also scraping, for example, Netflix content to use as part of their training set?

To me, there is a sense that the news, which is real information about the society that we currently live in, should be availabe to all participants of that society. The notion of being a good citizen requires that one stays informed. Books, movies, videogames etc. don't have that role and are more consumption goods.

should be available to all participants of that society.

Who pays?

The government (thus the people, in a so called sharing of public burden)!

For example in Hungary there is an official news agency ran by the government, with (cumbersome) free access for everybody. Of course this does provide somewhat biased presentation of some facts, but on many topics it provides unbiased access to news for any citizen.

This is actually pretty common in Europe, often funded by mandatory fees (for some reason not branded as taxes) certain appliance owners need to pay (UK TV license, German Rundfunkbeitrag). For this fee people get access to news and cultural programmes for free via different media (radio, TV, internet).

I agree with your general point but Hungary is probably the worst example you could have chosen from any EU country! The Orbán government is famously using it to spread propaganda and fake information in unprecedented levels.

The level of control governments exert on public broadcasting networks is widely different. Since Meloni, the RAI in Italy is facing similar issues, but Hungary is still the canonic example of government misinformation and propaganda.

That is a orthogonal to the discussion we were having. The topic was whether people should have free access to news, and how should it be financed, not the quality of that news.

People have free access to public roads all around the world, and the quality wildly differs in that as well. Also the quality of for-profit news services does differ wildly, you might have an opinion about that of fox news, for example, but that is also off topic in this discussion.

That is a orthogonal to the discussion we were having. The topic was whether people should have free access to news, and how should it be financed, not the quality of that news.

On the contrary, the quality of the news is very important to the discussion. There is no point in making trash freely available to the public, after all.

The topic is a bit more nuanced, and far wider than "not fitting my favourite narrative on some topics, so it is generally and objectively trash".

think about this: I will get mostly objective and useful reports of the flood approaching my home near the river regardless the narrative/interpretation they might have on some other topics, or the biased reporting on the merits of the government in handling the situation at the dams.

For me I'm not here to debate on the political policies of some governments, just gave a few examples of ways to fund public access to news. This discussion is over from my part.

No it isn't an orthogonal discussion. The reason Orban wants people to have free access to his propaganda is because it directly serves his purpose. To finance it directly from sales of the media would defeat the purpose. Coupled with Orban's attack on free media it completes the picture.

I would argue the people of Hungary would be better off without hatred against asylum seekers and minorities, political opponents, lies and misinformation.

Every news source has biases. Under the paywall business model, the people who share the biases of their favored news outlets pay for them, and in exchange, they get to ensconce themselves inside a bubble free of dissenting viewpoints. This also reinforces the bias of the news outlet; if they don’t toe the line, they will lose subscribers.

Instead of paying news outlets to provide ourselves with filtered feeds of content that match our own biases, we could instead pay news outlets to produce competing streams of explicit propaganda to be freely disseminated. The overall bias and quality of the news would be largely unchanged, even if the biases were more obvious; in fact, it may even improve.

Yes, someone needs to pay.

I see the gp post about pirating news as a very good point, while having no veleity to pay the New York Times, and being ok with not reading it in general.

But I also pay for my national (public) news outlet, and their articles are available to anyone anywhere in the world. I don't know how it should work, but I wish we could get to a system where the burden to keep news outlet alive is split thinly enough to have open but viable publications around the world.

Basically the same way weather stations collaborate all other the world and we pay for our local stations while getting acccess to all the forecast everywhere.

Everyone, if you don’t…

There's a few possible models here:

Public donors ALA Patreon

People doing it in their free time because they care a lot about the subject (nowadays with things like Twitter its quite possible for an independent obsessive to write a good piece on, for instance, the Ukraine War by mostly referring to open sources and public announcements by governments and corporations)

Government sponsorship ala BBC

It’s a difficult problem with no great answers. If you want news to be free at the point of delivery you want public service news agencies. But that means they’re owned by the government… who are frequently the target of critical reporting.

That's not true. You can have Independent public broadcasting that is not owned by the government and is reporting critically on it.

It’s still a difficult tension. The government will always control the purse strings so independence is always going to come with conditions.

The Guardian in the UK is an example of an alternative: It is owned by a trust, which funds it.

Norway has substantial public media funding across the political spectrum, but as you point out it always comes with conditions, even is less so than the funding for the state owned broadcaster.

Combining the two models and putting public funds into several perpetual trusts intended to provide funding from their profits at arms length from any sitting government similar to the (private) trust funding The Guardian might be an interesting alternative.

(EDIT: Norway also has its own variation over The Guardian model - the second largest media group was founded by unions but is now majority owned by the combination of two public benefit trusts)

I'm of a similar mind. I take the more expansive view that everything created is part of our common property and that something like an LLM should be able to yield the summary and references to those creations. As I've said elsewhere, LLM systems might be our first practical example of an infinite number of monkeys typing and recreating Shakespeare (or the New York Times).

I understand that copyrights and patents are vehicles for ensuring a creator gets paid for their work, but they are flawed in not rewarding multiple parallel creations and that they last too long.

a LLM is just a hugely lossy-compressed version of its training data, an abstraction of it.

Much in the same way as when you read a book, your brain doesn't become a pirated copy of the text as you only store a hugely compressed version of it afterwards, a feeling for the plot, generated images and so on.

That's what I thought from my various readings about LLM systems. I'm guessing that the kerfuffle from the New York Times and other shortsighted organizations is that copyright allows them to control how their content is used. With humans, it's simple as its read and misremembered. Using it for LLM training requires a different model. It probably should be a RAND fee system based on volume of training data because, as you say, the training data is converted into an abstract form.

I agree, but nothing worth having is free. NYT and other news outlets have to ultimately pay reporters to go out into the world and do the work. The reporters are not priests, and the NYT is not a church that lives off donations and tax exemptions. They need money to operate, and you may disagree with how they try to collect that money (paywall) but that doesn't solve their funding problem.

How would you pay for news otherwise?

How would you pay for news otherwise?

You could subsidise news via "public service" style stipends. Much like having a government owned "independent" news service (eg the BBC) this comes with a high risk of corruption. Don't bite the hand that feeds and all that.

You could implement a much lower friction non-recurring payment system. I'd be far more tempted to drop a little money on a fixed term (5 articles, 1 day, ???) setup than a subscription.

Realistically, I am not paying for more than 1 long running sub. And there are > that number of solid outlets.

Realistically, I am not paying for more than 1 long running sub. And there are > that number of solid outlets.

This is somewhat what Apple News+ works like, but I doubt most news orgs want to be held captive by Apple.

which is real information about the society that we currently live in, should be availabe to all participants of that society.

Who should pay the journalists or the investigative reporters?

The state, through taxes. It's a public good after all.

which is real information

People post archive links even to fake NY Times.

Not everything is news that appears in a newspaper. There are opinion pieces, etc.

what about wordle or the crossword or the cooking section

https://cooking.nytimes.com/

Good comment, it was very funny to see how people desperately try to find moral justification for pirating media A but not B. "It's apples to oranges, you see, there are less letters in the NYT article than in the book and they are rendered differently, so it is ok to pirate their work. I did nothing wrong!" :)

There's no way to get your money back if you didn't like the content. If they don't want their articles to be read for free then they should keep them out of my view. And certainly not use clickbaity headlines. Information can be copied and they should accept it, or change their business/distribution model.

So if I went to a cinema and didn't like the movie, I should be entitled for a return, right? Or if I went into a museum and didn't like the art displayed there?

If you are advocating for a free for all libertarian dystopia, well, I have some bad news for you - they never work.

So if I went to a cinema and didn't like the movie, I should be entitled for a return, right?

Not being able to un-see a movie and get your time and money back is one side of the coin. The other side is that information can be copied.

Both sides suck for one of the parties. There's no reason why one of them gets it their way, especially if it requires a contrived legal framework while the other way would require nothing at all.

You’re not paying to enjoy the content, you’re paying to experience the content.

And as long as you had the opportunity to experience the content, you’ve gotten what you paid for.

I don’t see “I don’t like it” as a valid reason for a refund.

You’re not paying to enjoy the content, you’re paying to experience the content.

Not sure about others, but I'm not.

Would you make the same argument for a sporting, theatrical or music event? That you should be refunded if you didn't enjoy it?

Does it matter? Sounds to me like an apples and oranges comparison.

If I read an article in the NYT then I'm paying for what I took away from it, not for the amount of time that it allowed me to kill.

Your personal opinion on the matter has little weight here.

It doesn't matter what you think you're paying for or should be paying for, the fact of the matter is that you're paying for the effort people put in bringing that to you. So you are, whether you want to be or not.

I don't agree with the OP but how are refunds a free for all libertarian dystopia?

"Information can be copied and they should accept it" <- I was referring to this line. This basically means that OP thinks that any intellectual property should be free for everyone. This means that probably half of humanity (who are currently creating anything with IP) will have to be libertarians, and that can't happen unless all humanity are libertarians. And libertarian society is a dystopia. :)

It is actually pirating content by companies for humongous profit, or pirating by individual human beings for free access to culture and entertainment, oftentimes for content one has already paid for, but rendered inaccessible by megacorporations.

Which content making businesses earn humorous profit margins?

Are all the journalist layoffs a fever dream?

This is one of the more profitable ones, and only because they employ unscrupulous tactics:

https://www.macrotrends.net/stocks/charts/NWS/news/profit-ma...

This is NYT, the most successful news business:

https://www.macrotrends.net/stocks/charts/NYT/new-york-times...

As for movies/tv show/music makers, let’s just say most people in the software engineering business would look at their numbers and count their lucky stars that they are not in the movie/tv show/music business.

(It is also true that excessive copyright lengths have removed access to content that the public should have a right to).

The movie/tv show and music business can keel over and die tomorrow - it wouldn’t affect the value of art produced by humans at all. I see those more as exploitative leeches than as contributing anything positive.

If only piracy would actually harm these businesses but alas as often demonstrated it has zero effect on their bottom line, if anything it increases their profits.

What do you mean by "art"?

Hard question, but in the context of my comment I would say any kind of visual media or music

Which content making businesses earn humorous profit margins?

You got my point backwards: AI companies will make it from the pirated content, that individual users don't make.

Which content making businesses earn humorous profit margins?

https://en.wikipedia.org/wiki/Mad_(magazine)

https://www.theonion.com/

I wonder what the reaction of some of the people who browse this forum would be if the output of their careers were so commonly pirated. Somehow, I think most think that this argument doesn't apply.

I’d be pretty delighted. I’m paid for getting projects done, not for keeping hold on some copyrighted code. I want all my code to be open sourced, and reused.

Of course pirating any media is totally fine from a moral standpoint.

It seems pretty natural to me. People generally have less problem with stealing a candy bar than stealing a car. (Consider the cost to produce a NYT article vs the cost to produce a Hollywood movie). I don't think the stealing-vs-pirating analogy is perfect, but it's related.

I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP

As you noted it is not the norm to post pirate links here for IP other than news articles, but that doesn't mean that a lot of people think it is not OK to pirate those other forms of IP.

In nearly any big discussion that even remotely involves video streaming there will be numerous posts from people explaining why they pirate (usually with ridiculous justifications like "subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change").

The impression I've gotten is that piracy of nearly everything is widely felt to be OK here. Information wants to be free, yada yada.

About the only piracy that is consistently frowned upon here is piracy of open source software. When some company sells an embedded device that uses GPL code without releasing the corresponding source that's viewed as just a little short of a crime against humanity.

People used to leave newspapers in the trash, on the train, all over the place. Anyone could pick them up and read for free. I think it's reasonable for folks to carry this attitude into the digital age. People feel like news is something to share, it's not the source of creative expression, it's facts and as such we feel entitled to know the facts about our world and what is happening that might affect us.

That newspaper was likely paid for by someone, and could only be read by one person at a time.

And what if the person picking up the paper would stand up and shout the content of the article so all the people on the train would hear?

Reminds me of the movie News of the World. The main character's job is going from town to town, reading newspapers aloud.

While I'm well aware I'm being pedantic, me and my brothers would share the comics together while my parents kept the news, up to 4 of us consuming 1 paper at a time. Realistically, the reading limit was due to the physical properties of the object and not an inherent property of information to be consumed through one avenue at a time

No it isn’t reasonable and people not paying for that newspaper read anymore is the reason all news is sensationalist opinion pieces today.

This seems very false to me. Spotify is the prime example. They offer a good product that covers a 100% of my needs at a reasonable price. If that was an option for say UFC or engineering books, you bet I’d be subscribed. But being forced to read through some crappy reader software when I need the book source to take annotations in another software doesn’t work, so here we are. Same with the absurd pay per view business model of UfC.

For books, if it's a client reader software frustration, then you should still buy the digital version and then you can pirate the PDF book and use as desired within the constraints of copyright law (e.g. don't go sharing the PDF). That way you get the client you want but you still paid the content creator. But to use the argument, "oh, I don't like their client so I'm going to not pay them" is BS.

For UFC, your complaint is you don't like their pricing. The whole point of copyright is to give someone the monopoly to control pricing so they can use that pricing power to incentivize them to create the product in the first place. Similarly to patents. Thus, complain about the format things are delivered in all you want (like the client) but pricing is inherent to copyright or patents for good reason. You are now just arguing that you as a consumer should be able to pirate if you don't agree with pricing. And that's ludicrous.

In that case, just read a news article about the event. Copyright doesn't cover facts, only creative expression. So a news article covering the facts of the UFC fight is able to be published without the consent of the copyright holder. Think of the digital video of the fight almost like buying a ticket to the fight. You're saying you should just be able to sneak into the fight and watch it for free without any justification for you're doing so.

Finally, you can also watch other people's videos of the fight that THEY recorded on social media as other sources of the fight information. But if you want the recording with all the right angles, coverage, etc, it clearly has value to you over written recaps or social media coverage. And you are just arguing over price, which they are the copyright holder have the right to set the price.

The problem with buying by the crappy DRM version is that it provides no incentive to the publisher to change. I have thought about this long and hard, but ultimately the only way Spotify came about was because nobody bought the terrible DRM’d music the labels wanted to foist on us. We need to inflict the same pain for books. Personally, I think it would be preferable to donate the same amount to the Books Trust or your local library.

This is also along the lines of how I think about things. If you make it convenient enough (compared to the alternative of paywall bypass or piracy) and provide enough overall/general value then I'm happy to subscribe. At the point where the experience degrades, or seems beyond the point of what one person could reasonably subscribe to, I basically just give up.

Spotify hits this sweet spot where one subscription delivers almost all the music you'd want to listen to. Steam hits this for games where a couple clicks can play and launch almost any game with minimal hassle. Netflix mostly used to hit this, but most of the current streaming stuff feels overpriced if you want to get all content (unbundled cable bundle). News kind of feels similar to streaming where its unbundled, and there's a lot of interesting content out there, but there's no way I'm subscribing to 15 different newspapers, especially random local ones for cities I don't live in. If there was a news bundle subscription for a reasonable price I think I would pay for it.

Yeah, I don’t judge people for pirating or ad blocking, but the ludicrous justifications do get me - quite the entitled mental gymnastics. They remind me of bitcoin people trying to explain how mining is good for the environment.

There's a "polite society" thing going on.

Briefly, something like:

1) Ycombinator could not tolerate HN becoming a site known for sharing IP-law-violating content. And the people who come here by and large are smart and socialized enough to implicitly understand why.

2) At the same time, a large number of folks here mostly wink and nod at that sort of consumer infringement. And there's a society-wide bias towards "things like news are less protected", so that gets to slide.

3) But people also have a need to tell consistent-seeming stories about how things work, thus the mental gymnastics.

It ends up being similar to trying to explain why people pretend to be prudish innocents about sex. It largely reduces to "a small subset of the population goes sufficiently ballistic about what I consider to be relatively trivial stuff as to make it not worth fighting over, even if I find that to be ridiculous."

There are a lot of different versions of this that become so normalized it can be hard to notice.

"subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change"

I'm not saying you've never seen anyone make an argument roughly like that, but I will certainly say that it is not at all representative of the argument that I see made. Complaints usually have to do with current behavior of the platform or the wider streaming ecosystem.

> In nearly any big discussion that even remotely involves video streaming there will be numerous posts from people explaining why they pirate (usually with ridiculous justifications like "subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change").

If this is true, it should be easy for you to link to an example. Could you do so?

The GPL was specifically written to lock code out of the proprietary realm, so if you hate copyright[0] you'll hate people using it as intended.

[0] To be clear, I know of few who actually like copyright. Tolerate it? Use it as needed? Sure. The only people who actually defend the current broken-ass system are large media companies which are built to optimally exploit it.

Piracy is different from plagiarism.

People are understandably angsty about someone stealing credit. A NYT article is going to be a NYT article, not laundered around and presented as someone else's work.

Plus, there's the angle of enshitification and ads being injected into a paid service, and so on.

In nearly any big discussion that even remotely involves video streaming there will be numerous posts from people explaining why they pirate (usually with ridiculous justifications like "subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change").

I’ve read and participated in many such threads and I’ve literally never seen this take. Often what I see is complaints about having to learn different UI for different services/apps, no offline, ads injected into paid services, having to figure out which service a show is on, and generally terrible UI you can’t change/fix.

I don’t think I’ve ever really seen someone use the argument “yes it’s great today but they might charge more later”. Not saying people haven’t said that but it’s far from the main thing people say in my experience.

"subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change"

Gonna gamble and call bullshit on this.

My speculation: the most popular reason HN'ers give for pirating: they literally cannot get the content otherwise.

2nd most popular: it is such a pain to either to purchase the content or get it to run on bog standard software (like Firefox/Linux/etc.) that otherwise paying fans are driven to whatever the current equivalent is for bittorrent.

In fact, I don't believe I've ever seen a justification for using bittorrent or whatever due to what someone's favorite streaming service might do in the future. I'm assuming you saw at least one based on what you wrote-- care to give a link?

About the only piracy that is consistently frowned upon here is piracy of open source software. When some company sells an embedded device that uses GPL code without releasing the corresponding source that's viewed as just a little short of a crime against humanity.

Like what you said...

Information wants to be free

I wouldn't say OpenAI has exactly the same attitude, since they also pulled in thousands of books. Their position has been that it's not piracy, since they don't republish the books; effectively the AI just reads them and learns from them. If GPT can be made to reproduce the original articles, that's a more difficult argument to make.

It turns out you can reproduce articles with next-token prediction when the articles are quoted all over the dataset.

The articles themselves are indisputably not a part of the model, because it doesn't store text at all. OpenAI's position is correct; people just underestimated how well the AI learns from reading, especially when it reads the same text in a bunch of different places because it's being quoted/excerpted.

If it can and does reproduce a piece of text verbatim then the text is indisputably stored somehow in the model.

That's just not true. There's no search and retrieval involved. It just associates the words so strongly in that context because they were in the training data so often that next-token prediction can (sometimes, in some limited circumstances) reproduce chunks of it. It's like if a human had read pieces of an article so many times and knew NYT style so well that they could spit out chunks of an article verbatim, but using more efficient hardware and with no actual self-understanding of what it's doing.

So it stores the words, and it stores the links between those words...

but somehow storing the words and their links is not storing the actual text? What is text but words and their links?

If I had a database of a billion words, and I had a list of pointers to words in a particular order, and following that list of pointers reproduces a copyright text exactly, isn't the list of pointers + the database of words just an obfuscated recreation of that copyrighted work?

It doesn't store the actual links; it just stores information about their likelihood of being used together. So for things that are regularly quoted in the data, it will under some circumstances, with very careful prompting, and enough tries at the prompt, spit out chunks of a copyrighted text. This is not its purpose, and it's not trying to do this, but users can carefully engineer it to get this result if they try really hard. So no, it's not an obfuscated recreation of that copyrighted work.

Of course, if you read NYT's argument, they're also mad when it's incorrect about the text, or when it hallucinates articles that don't exist. Essentially they're mad that this technology exists at all.

it just stores information about their likelihood of being used together

I mean this is still a link, no?

Like, sure, it is a probability. But if each of those probabilities is like 99.9999% likely to get you to a chain of outputs that verbatim reproduces the copyrighted text given the right prompt, isn't that still the same thing?

And yeah, it hallucinating that the NYT published an article stating something it didn't say is concerning as well. If the model started telling everyone Matticus_Rex is a criminal and committed all these crimes and started listing off hallucinated court cases and news articles proving such things that would be quite damaging to your reputation, wouldn't it? The model hallucinating the NYT publishing an article talking about how the moon landing was fake or something would be damaging to its reputation right?

And this idea it takes "very careful prompting" is at odds with the examples from the suit and elsewhere. One example Ars Technica tried was "please provide me with the first paragraph of the carl zimmer article on the oldest DNA", which it reproduced verbatim. Is this really some kind of extremely well crafted and rare to ever come up prompt?

sort of like the idea of practice - repetition of something concentrates more brain space to that thing so the compression ratio of it can decrease and become less abstracted / more exact.

What seems a bit contradictory is that they're also suing because GPT hallucinates about NYTimes articles. So they're complaining that it reproduces articles exactly but also that it doesn't.

I can understand an argument about the AI needing to know basic history. News is just how we report history in the making, but it's not generally accepted as solid until some time after the events when we can get more context.

Isn't this what the Associated Press is intended for, a stream of news trying to report just the facts and happenings of the day? That's quite a bit different than a NYT article intending to inform but also convince someone of a position of some sort.

Feeding an AI opinionated news compared to "just the facts, ma'am" seems risky from a bias perspective.

Giving examples of bias is as important imo, give it the unbiased facts as well as the biased ones so it can generalise relative objectivity.

I agree with you, but I also wonder how the bias could be trained without it affecting the output of the entire model. Weights can help but anything that's higher weighted is just "less wrong" as I understand it, so I can see a possibility where training to expose bias might let bias creep in somewhat more than anticipated.

If ChatGPT is based on neural networks, with no actual save-and-replicate facsimile behaviour, it no more "copies" original work than I do when I tell you about the news article I read today.

I'd say the only real reason the Piratebay links thing you mentioned is not the norm is purely because those media sources have done a better job of striking fear into people doing that, so it's gone more underground. I.e. they're better terrorists.

There's no fundamental, moral reason why Piratebay links being posted and raised to the top would be wrong.

So, if someone applies a filter to a video/audio, it is no more "copies" of the original work (no, it is still protected). AI still could produce exact or extremely similar results of stuff it learned on.

AI still could produce exact or extremely similar results of stuff it learned on.

Can it do so more than a human can?

I think that's the key here. If an AI is no more precise than a human telling you about the news article they read today then ChatGPT learning process probably can't be morally called copying.

So, if someone decompiles a program and compiles it again, it would look different. "It is not copying", we just did some data laundering.

Feeding someone else data into your system is usually a violation of copyright. Even if you have a very "smart" system, trying to transform and obfuscate the original data.

I'm regularly feeding other people's data into my "system" (brain) in order to produce my outputs.

So I'm a living breathing copyright violator. As a person I should be banned.

Fortunately, copyright is a bullshit fictitious right with no basis in natural law. So I don't lose much sleep over it.

Computers are deterministic. Giving the same inputs training would produce the same model. The comparison with brain is incorrect. You could add noise on input data during the training - it would more of less reproduce the real learning. Still, it could produce less useable models as a result.

The court could ask to show the training dataset.

Feeding someone else data into your system is usually a violation of copyright

In some circumstances, yes, but often it's not, especially if you're not continuing to store and use it (which OpenAI isn't).

It's not analogous to a filter, because that's applied to the actual work. The model does not keep the work, so what it does isn't like applying a filter. It's more like being able to reproduce a version of the work from memory and what it learned from that work and others about the techniques involved in crafting it, e.g. art students doing reproductions.

And if OpenAI were selling the reproductions, that would be infringement. But that's not what's happening here. It's selling access to a system that can do countless things.

it no more "copies" original work than I do when I tell you about the news article I read today

When you tell people about some news article you read earlier you repeat it exactly verbatim? You also give this out to potentially millions or hundreds of millions of people for commercial purposes?

Copyright law does not care about the means of copying, just that you created something with substantial similarity to something you had access to. Whether or not the copy is in the form of a pixel array, blobs of random data being XORd to produce a full copy of music, or rows in a key/value attention matrix, doesn't matter.

Furthermore, there's Google research on extracting training set data from models. More specifically, Google found out that if you ask GPT to repeat the same word over and over again, forever, it eventually starts printing fully memorized training set data[0]. So it is memorizing stuff, even if it's not regurgitating it.

[0] When told of this, OpenAI's response was to block conversations with large amounts of repeated words in them.

Possibly because once an article is published the author receives no further payment. In all other mediums, there are residuals and royalties to be paid to the creators of the work.

And add to that fact that NYT subscription is hard to unsubscribe from. People have aversion to NYT, even setting aside the bias.

It took me all of 5 minutes to cancel my digital NYT subscription from the following month onward. No idea what you are talking about.

Why did it take you five minutes instead of twenty seconds? It should be as simple as clicking on the link to your profile then clicking unsubscribe, mere seconds not minutes.

Assuming you just said five minutes figuratively... Do you live in California or some other legal jurisdiction that forces them to play nice? Did you subscribe through some other company, like Apple?

Horror stories about unsubscribing from the NYTimes are easy to find in the archive if you search for it. They make you call and chat to a retention specialist on the phone. This should help you have an idea of what he's talking about: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

International one, as szraight forward as it could be: go to profile, go to manage subscription, cancel subscription, answer question why if you want, confirm cancellation, done for date depending on subscription.

That's only been true for the past few months, and it's been very well documented how complicated the cancelation process used to be [0].

It's funny because I use PayPal for any unknown-to-me site where I don't want to give out my card, but the only site where I've needed their help to cancel something was the New York Times.

[0] https://www.nirandfar.com/cancel-new-york-times/

Articles have ads on them, how are they not residual payments based on views?

I believe GP was referring to payments to the writer, not the publisher.

Yes, although I get that the route of the money may find it's way back to the journalist as salary. But generally goes into a pot for news gathering of which the salary will be withdrawn.

On ads it's acceptable to distribute them freely and it is advantageous to the company. Can we also see good journalism as an ad for the quality of a broader product?

Largely because "news" aka facts is not and should not be copyrightable, so while the style, and exact format of the article may be copyrightable, the facts contained within are not.

This makes a news story copyright murky in the eyes of wider society unlike a clearly 100% creative work like a TV Show or Movie.

Further the news themselves self cannibalize, how many stories are just rewrites of stories from other outlets? why it is OK for the Washington Post to copy the NY times, but not ok for OpenAI or Archive.org?

Creative works like books, TV shows and movies contain facts too.

None of which are copyrightable and infact has been the subject of DMCA abuse like when a Movie uses NASA footage and claims copyright on YouTube videos with the same footage.

Copyright is a complex subject, and not as vast as many believe, at the same time ironically it is more vast than i believe it should be. copyright should be much more limiting than it is. Which is at odds with people that believe copyright should be maximized.

Keeping in mind commercial success of a work, author or company is not why copyright exists. For the US, the only reason copyright can exist in our framework of law (i.e the constitution) is for the promotion of the useful sciences. No other purpose for copyright would be constitutional under the US Constitution

Citing Wikipedia you already failed..

That is a General Article about Copyright world wide, I Specifically stated US Copyright, which is Authorized by Article I, Section 8, Clause 8 of the United States Constitution[1], implicitly for the promotion of the useful sciences. That is where congress derives its power to pass copyright laws, and to enforce copyright on the people of the United States. No other purpose is authorized by the US Constitution

[1] https://www.law.cornell.edu/wex/intellectual_property_clause

You missed "and useful arts" in both of your comments. That's a key addition that you keep ommitting.

It is not just for sciences.

why it is OK for the Washington Post to copy the NY times, but not ok for OpenAI or Archive.org?

If the Washington Post printed an article from the NY Times nearly verbatim and without attribution, it would not be OK and surely they would take legal action.

Yes, because The NY Times is copyrighting the body of work. They are not copyrighting the "facts" themselves but the distillation of these facts into a body of work. Anyone is free to take the facts and produce their own works but not to lift the body of work verbatim that the NY Times created (plagiarize).

Whenever there is a story from NYT on HN (or any other large media outlet), the top comment is almost always a link to an archived version which reproduces the text verbatim. [...] And yet, whenever there is a submission about a book, a TV show, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.

If the story was linking directly to the "book, TV show, movie, video game, album, comic book, etc", and the link only worked for some people while others randomly got a login request or similar, you'd also see the top comment being a link to an archived version which avoids the login screen. That is: the main difference is that the archive link has the exact same content as the link submitted in the story, only bypassing the login screen that some people see. And the only reason the archive site has the content is that it didn't get the login screen; if everyone always got the login screen, what you would see on the archive site would be the same login screen.

i don’t believe that is fully correct. The general policy here is that you cannot link to something that is paywalled unless that site plays the game of allowing crawlers but not actual human eyeballs. In the latter case the link is allowable because there are ways around it that the site owners allow.

I don't recall seeing this policy on HN guidelines.

It's on the FAQ https://news.ycombinator.com/newsfaq.html

Are paywalls ok?

It's ok to post stories from sites with paywalls that have workarounds.

Okay.

https://www.netflix.com/browse?jbv=81714181

Much of this is incorrect

the archive link has the exact same content as the link submitted

No, articles are updated as new information comes in, retractions are made, etc. Especially breaking news (the type that would reach the top of HN). The archived versions are outdated.

others randomly got a login request

It's not random, you get a number of free articles before the paywall appears ("soft" paywall).

The paywall is removed entirely for some topics/stories, especially matters of public health (common during the pandemic).

the only reason the archive site has the content is that it didn't get the login screen

No, it's because they don't block archive crawlers, and prefer people bypassing the paywall and reading news at NYT. Hopefully users find the content valuable, and some of them subscribe as a result.

(opinions are my own)

So, what allows accessing content under IP illegally is not liking the marketing strategy of the content owner?

I find “4nn4’$ 4rch1v3 dot ORG” actually way better than pirate bay for pirating knowledge.

It’s amazing the amount of books that copyright laws prevent us from finding

https://www.theatlantic.com/technology/archive/2012/03/the-m...

Sure. It's just curious to me that news article have a pirated knowledge link as the de facto top comment, but link submissions to, for example, books for sale on Amazon don't have a link to Anna's Archive or equivalent.

I think the archive of an article is more preservation of history and maintaining records of events which often disappear if not archived. The number of threads referencing articles which are defunct is always increasing. A book or movie or original content on the other hand will continue to hold its own commercial value so reproducing it is more akin to an actual loss for the license holder.

Definitely a grey area when that content is then used to train models though.

I would say 9 times out of 10 it's to get around the paywall and absolutely not some higher moralistic preservation of history.

And everything is a grey area, determining the line is the existential purpose of these court cases.

We've been here before with hyperlinking, then indexing and then linking with previews and the Canadian Facebook stuff but I think this has more standing.

If I buy a book, I get a work of literature. But if I buy a news subscription I get a series of facts riddled with advertisements. I accept the former, but I oppose the latter. I suspect I'm not the only one.

I don't fully understand what you're opposing.

is it?

1) that you paid for news

2) that it included ads

both are just the price you want to pay. There are various state news outlets that you're probably already paying for - npr, pbs, bbc, cncb depending on your region

Historically newspapers leaned more on competition law than copyright, because their pages are supposed to be filled with non-copyrightable facts.[1] Copying part, but not all, of a factual article, significantly after the relevant event, was considered to be a promotion (not unfair competition) and a nice thing to do for the journalists. Things change, people lose sight of the original principles.

[1] https://en.m.wikipedia.org/wiki/International_News_Service_v...

their pages are supposed to be filled with non-copyrightable facts

This is rather inaccurate. A fact is Hitler invades Poland. You're right, nobody can copyright this idea, as it is just a fact.

However, if I then write a 500-word article describing the scene of Hitler invading Poland, have short quotes from some civilians there, etc. that particular arrangement of ideas and words is copyright.

AP can't go and sue INS for just reporting the fact Hitler invades Poland, but if INS takes a whole article word for word and reproduces it that's still violation of copyright. The actual printed words of the news always had copyright.

The WSJ can't claim copyright on the markets going up yesterday. They can claim copyright on something like "After the bell rang in the NYSE, the tech industry ticked up 1.2% over last week. Meanwhile the whatever market took a hit of -0.5% ending the quarter slightly lower than our analysis expected. Blah blah blah..." If Investor's Business Daily wrote a different article that also talked about the markets ending up at the end of the day, that's not a violation of copyright. If they literally write "After the bell rang in the NYSE, the tech industry ticked up..." then they're violating WSJ's copyright. This was true before and after International News Service v Associated Press.

Yes, the prose was always under copyright, but the key point for the case linked in the wikipedia article is:

INS members would rewrite the news and publish it as their own without attribution to AP.

So the case hinged on INS indeed reporting facts that differed in exposition.

These days most news is mixed with analysis [1] (which is often biased). I wonder if part of the reason for this shift is that analysis is copyrightable. It also seems like the number of opinion articles is ever expanding [2], though I don't have any hard numbers on that.

[1]: https://guides.library.cornell.edu/evaluate_news/source_bias

[2]: https://www.newsmediaalliance.org/rise-of-opinion-section/ Interestingly there's a banner at the top of that link touting an agreement between Axel Springer and OpenAI.

EDIT: formatting

Even the facts are not copyrightable, the prose is.

I would be "happier" to pay a subscription to an aggregation platforms like hackernews or reddit to access archived articles that are linked to these sites. In turn a proportion of that could be passed on to the underlying publishers that I actually visit. I have nearly zero interest in reading articles that aren't linked to from an aggregation site.

I don't want to read theguardian.com, or nytimes.com, or washingtonpost.com, or bloomberg.com, I want to read news.ycombinator.com. Paying an individual subscription to every possible underlying site that could be linked to from news.ycombinator.com is a non-starter.

This is a common statement, but every attempt to sell that service has been a dismal failure. See for example blendle.

Blendle failed because they went into competition with the papers whose content they reproduced.

Nearly every attempt at starting a new aggregation site like hackernews or reddit has been a failure.

I’m not going to switch to a new website where no community exists just so I can pay for news articles. To work it needs to be integrated into an existing, successful aggregation website.

I would be happier to pay a small fee per article I want to read. But the norm seems a monthly subscription.

, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.

Probably because most print media is garbage and nobody in their right mind would actually pay to read them

Probably because most print media is garbage and nobody in their right mind would actually pay to read them

NYTs revenue keeps growing though.

Not from newspaper sales

I don't understand the downvotes - it's an extremely valid opinion. If people ask questions like that then they should be able to accept forthright answers?

(It's the same reason for me. I have tried news site subs but eventually got so tired of the polemic that I cancelled. I won't sub again).

The obvious response is that if you don't like news and think it has no value then you don't have to read it.

It is am ethical grey area, but if the paywall applied to all user agents, which would make it similar to say buying a Kindle book, then you might see that as pirating, whereas if you use an archive service that was served the HTTP response and cached it, then you are using a proxy UA.

If the news/magazine doesn't want this they can simple serve a cut down or zero length article to all non-paying viewers! But they want that SEO, and they want that marketing.

We can extend this analogy. What if someone put up a proxy, that has a legal Netflix subscription and which "watches" streams of Netflix shows, captures actual RGB values of pixels and re-streams the resulting video to anyone else? Isn't it the same "proxy" excuse?

I would say no because the site was happy to serve the content publicly, whereas your proxy is breaking a contractual agreement. Now we get into terms of service of a website, and even if you visit for free you agree to them. Which is a possible point. It is quite grey IMO. In terms of HN I reckon a mag would love the free brand rec. vs. the archive not being shared. Where it hurts them is if someone is avoiding paying for a subscription by continually using archive sites.

Indeed there are media that are hard paywalled, e.g., the information. However these are prohibited on HN, which possibly create additional bias towards non-hard-paywalled publications

Funny, I don't see it as a moral thing but more a "what can you get away with" thing.

I fully assume that if I was to post a magnet link to a torrent for whatever the link was about, I would be banned.

Morally speaking, I think it's perfectly reasonable to download a copy of something and either read the relevant info for my current task or to sample it to decide if I want to buy it. I see it no different to using the library or browsing at a book store.

Perhaps once news organisations can work out how to effectively wield the DMCA hammer against archive links we'll see the practice of posting them stop.

So downloading a movie from piratebay is no different to using the library?

In some jurisdictions (Poland, possibly whole of EU), downloading any kind of materials - be it movies, books or music - is legal. Uploading/sharing - if not between friends&family members - not so.

I’d argue that morality always has a “what can you get away with” component. Things that are normalized tend to be seen as morally permissible, and things that are seen as abnormal are more likely to be seen as immoral.

The problem with the thinking in the root comment is that it implicitly assumes that people’s behavior is morally consistent, or that they even try particularly hard to behave in a morally consistent way. That’s not really how people work. If you ask them to discuss morality in the abstract, they’ll try to come up with a consistent system. But their actual behavior is mostly dictated by social norms. And if you try to pin them down on the morality of their concrete actions, they’re more likely to stretch their moral system to accommodate their actions than the other way around.

None of this is to say anything about my own opinions on news sharing or OpenAI’s situation. It’s just that someone decrying piracy but also posting/sharing/upvoting links to copies of news articles is neither surprising, nor indicative of some deeper nuance to how people view morality around IP.

I think the intent is really different.

For LLMs you're essentially teaching them language by showing them lots of examples of written language - newspapers are of course a great example of written language.

The goal of OpenAI is not to reproduce newspaper articles verbatim when asked questions (even if the answer could be a newspaper article) and the fact that it can happen is a side effect of how LLMs work.

When a HN participant shares a (pay walled) link to a NYT article, I do want to read the exact article linked verbatim because while the facts of the article may be reproduced elsewhere in a form that's free, specific word choices or whatever might be a focal point of the discussion on HN, and therefore I can't realistically participate in a discussion without having read the article being discussed.

And as an aside, I have no problem with paying to read news, or whatever media, however it's impractical for me to subscribe to every news source HN participants link to, and therefore I gravitate to archiving services instead. I do wish there was a better solution - for example Blendle with more sources.

The goal of OpenAI is not to reproduce newspaper articles verbatim when asked questions (even if the answer could be a newspaper article), and the fact that it can happen is a side effect of how LLMs work.

This is an excellent point. A properly functioning LLM should not return the original content it was trained on. When they return original content, I believe the prompt is tightly constrained and designed to extract or re-create original content. Another reason that occurred to me recently is that maybe the training set is too small, and more general prompts will re-create source material.

Another question would be, are LLMs regurgitating what they were trained on, or are they synthesizing something very close to the original content? (Infinite Monkeys, Shakespeare). Court cases like this increase the need for understanding the "thinking processes" in an LLM.

Maybe LLMs should follow best practices for 1980s style backprop models and later deep learning models: starve model size to force maximum generalization, minimal remembering.

The goal of OpenAI is not to reproduce newspaper articles verbatim when asked questions (even if the answer could be a newspaper article) and the fact that it can happen is a side effect of how LLMs work.

Seems like a nice split-the-baby resolution would be to send the NYT Corp a single article read amount anytime GPT plagiarizes more than what’s allowed at an academic institution.

If NYT was a HN startup the link to the archived version would be banned and dang would be slamming the ban hammer.

Please don't post baseless accusations. I think dang has said that he tries to moderate less, not more, when YC companies are involved. (Although it's impossible to say what he would do in this situation.)

HN is currently facilitating piracy. Something your comment failed to address.

Like I said in another comment it is simpler than that. They just serve the login page/payment page to all HTTP requests. If they do that then the submission itself likely get's flagged as there is no workaround (just like if I submit my blog with a banner saying "hey you pay me $1 to read my cool post")

That is an apples to oranges comparison. An article about a video/book would have the relevant information in text form without needing to show the video "here is the new stuff shown in Apples 2 hour long WWDC keynote". If not is common that a comment in the discussion gives a summary as a tl;dr

With text articles behind paywalls the relevant information is hidden and only hinted at as a teaser.

To make it an apples to apples comparison, look at submissions where the link submitted is the retail link to the IP. For example, look at all the book link submissions on AMZN...

https://news.ycombinator.com/from?site=amazon.com

None of these have the Pirate Bay or Library Genesis or Anna's Archive or the equivalent as the top comment.

Compare that to...

https://news.ycombinator.com/from?site=nytimes.com

And almost all of these have an archived version as the top comment.

I wonder if this is because the purpose of linking to a book is to share awareness of that book’s existence - nobody is about to go and read it then and there to comment on its contents. Whereas the purpose of an article is to discuss it now, in the comments - the consumption horizon and bulk of the content is different.

I would broaden the question beyond HN to society as a whole.

In 1990 it would have been considered normal and appropriate to clip an article out of a newspaper and post it on a communal corkboard. What are the key differences between that form of IP and others, and that analogy and the present situation of HN allowing archive links?

Reach, and ease of distribution.

Makes sense. If you mail a friend a clipping, or post it on the corkboard, only so many people are going to see it, but then even though posting the "clipping" to HN may feel like the same thing, it's hard to appreciate the massive change in scale.

As for ease of distribution, that might address OP's original question: It's easy to make and click an archive link, but it's a lot more effort to make or find a Pirate Bay link to another form of media, and for someone else to download and view it.

If it takes 120 seconds to read a newspaper article, the archive.is workflow is a significant overhead over that, a significant friction. Those links are a courtesy to other HN readers. This is very different from the economics of buying and reading a book.

"Piracy is almost always a service problem and not a pricing problem."

edit: It didn't even occur to me to compare the time-cost of "just pay for the article", but: last I read, it's half an hour of work to cancel a New York Times subscription [0]. So, that option's not even on the table.

[0] https://news.ycombinator.com/item?id=26174269 ("Before buying a NYT subscription, here's what it'll take to cancel it", 812 comments)

> edit: It didn't even occur to me to compare the time-cost of "just pay for the article", but: last I read, it's half an hour of work to cancel a New York Times subscription [0]. So, that option's not even on the table.

I canceled mine two weeks ago. It was four clicks. One annoyed me because they tried to get me to stay with an offer, but I didn't drop them because of the price.

Same experience here, it was effortless. But it is enough to justify stealing from those journalists, it seems.

A book, TV show, movie, video game, album, or comic book is not available on the internet served by the copyright holder’s own servers with no authentication or authorization checks. But the NYT is available in that way.

But some are? I believe The Atlantic and The Economist are hard paywalled.

If they're hard paywalled (everyone gets the same login prompt), they won't be available on archive sites.

Oh it's worse than that. The NYT is positing that any neural network that is trained on their data, and can summarize or very closely approximate an article's content on request, is in violation.

This reasoning would presumably apply to any neural network, including one made of neurons, dendrites, and axons. So any human reader of the NYT who is capable of accurately summarizing what they read is an evil copyright violator, and must be "deleted".

Effectively, the NYT legal department is setting the stage for mass murder.

Hyperbole much? There is a difference between a computer and a person. I'm not aware that people generally can be enticed to reproduce full articles verbatim just through questioning.

As far as I know schools have to pay for the newspaper articles they use in class to educate students. Training an AI seems similar.

Here’s a service for the UK providing paid access to copyrighted materials to schools: https://www.nlamediaaccess.com/newspapers-for-schools/

why we feel it's OK to pirate news articles, but not other IP

Who thinks this? I don't. I think copyright is wrong across the board. I would love if the same pattern of posting archive'd articles held for books, movies, et cetera.

I would love to change my mind on this, as it is a very unpopular opinion to have. But I have _never_ seen a morally or scientifically sound argument in favor of copyright law, and I've spent decades looking.

I think it subsidizes the creation of junk food content (superhero movies and clickbait news for example) while not contributing anything to the progress of science (paywalled scientific journals and textbooks). I shudder how much time I have wasted in my life consuming crap attention grabbing media and advertisements. I like to think if we lived in a world where everyone could be a publisher if they wanted to, the quality filters would be better, and information reaching us all would be more likely to be in our best interests.

You can self-publish. Oh, you want to be able to publish other people’s work, and without their permission? How does that benefit the author?

How does that benefit the author?

You speak of "the author". But the current system does not benefit "the author". 1% of authors profit off copyright. 99% lose money on copyright (they pay more for copyrighted media than they earn from it).

Your question should be "How does that benefit monopolist authors"?

I agree, my idea would not benefit monopolist authors. They would lose the bulk of their revenue stream.

But it would benefit the average author whose cost of living would fall and information would start serving them more than serving business.

I am not downplaying the talent and hard work of successful monopolist authors. But I do not think the works they create are worth everyone giving up their rights to reshare and remix information. I believe the world would look very different post-IP. You'd probably have a new profession--small independent librarians (similar to data hoarders today)--who would help their local communities maximize the value they got from humanity's best information.

Maybe I'm wrong! Maybe the information ecosystem is better controlled and the genetic differences of monopolist authors are so stark that without the subsidies to this gifted class we'd all be worse off. But that's an argument based on outcomes and not principles.

without their permission

The oxygen I'm breathing right now mostly was created by trees on land owned by others. But I don't ask for their permission to breath. Some things are just not natural.

I am not saying plagiarize. It is always the right thing to do to link back and/or credit the source. But needing to ask permission to republish something seems to go against natural laws.

As a supporter of piracy in the general case, I tend to agree with your observations, including that pirating NYT (FT, NPR, ...) articles is somehow some kind of different class of offense as, say, stealing a movie or mp3.

(Books, to me, are separate still, in that I like to have a physical copy (and generally see the authors as humans who deserve compensation, rather than mega-orgs that deserve eternal torment), so I'll frequently use the digital copy as a kind of preview, then purchase it once I see it's a good book I want to read.)

I've only been reflecting on this difference for a few minutes, but, to me, I think the major difference boils down to:

  1. Netflix series (movies, albums, etc) are non-essential, fictional works that take a long time to produce - think: fancy chocolates and caviar.
  2. News, generally, contains timely, important information - more meat and potatoes.
  3. While much of the super-critical news is not paywalled (e.g., product recalls, election dates, COVID stats, etc), a lot of information that is advantageous to know (discussions on interest rates, details on legislation, etc) is paywalled, compounding information asymmetries.

Sure, "stealing bad", but, IMO, someone stealing rice and beans from WalMart to feed their family is a different class of offense than someone robbing a boutique bakery because they can't get enough chocolate cake.

First and foremost, and please repeat after me: Copying is not stealing.

You're not depriving anyone of anything. Unauthorized copying is not theft. There's no equivalency. You can't copy and paste a cake. If you take a cake from a bakery, you're depriving the bakery of a thing. If you take a picture of the trademarked bakery's sign, copy its the copyrighted text from its website, and print them out, you haven't stolen anything. Nobody has lost anything. Nothing was damaged. No person, place, or thing was harmed.

Current copyright law is offensively absurd. Patenting of software, effectively eternal content copyrights, ridiculously broken DMCA, music publishers taking 99 cents of every artist's dollar, and so on and so forth.

If you support the dissolution of archaic institutions and broken laws favoring those with entrenched wealth over individual rights, you support piracy.

There is a legitimate case for laws respecting and protecting intellectual property rights. Such laws do not currently exist. These laws do not deserve to be followed or respected, and should be broken as a matter of course. Civil disobedience is called for. Refuse to participate in an exploitative market immovably entrenched in governments all over the world. Pay artists directly and commensurately if you feel they've brought value to your life. Copy whatever you want. Share those copies with whomever you want. Nobody gets hurt. Only conglomerates of already wealthy individuals and corporations are "deprived" of the potential transaction with you that they feel they are entitled to, as a matter of course.

The NYT is just as complicit as any other legacy media institution in the enshittification of journalism and laying waste to the potential value of their content. The "Gray Lady" is not a person, or a valuable institution. It's a soulless corporate construct not deserving of our empathy or high regard simply because of the reputation of human individuals who previously produced quality content. Stop pretending these institutions serve some higher purpose than to fatten the wallets of shareholders.

The good journalists have left. The ones left behind are naive, or are desperately clinging to an illusion of legacy and institutional legitimacy that no longer exists.

All that is left for these media dinosaurs is to leech off the success of others, to use their reserves of wealth and influence to arbitrarily insert themselves into the market, with no regard to the fact that they no longer have value or prestige or purpose in the context of modern technology and communication.

Anyway. Copying isn't theft. Don't give them the linguistic territory. Call a spade a spade, and media companies the desperate corporate leeches that they are.

but what if they were also scraping, for example, Netflix content to use as part of their training set?

There were some tweets the other day about how Midjourney could be prompted almost-exactly reproduce some frames of the film Dune. It wouldn't be shocking if these companies were using large databases of movies, with questionable legal status.

I see this a lot, and they very well may be. But, watch any behind the scenes documentary about any artsy movie and 9 out of 10, the director's will be waxing poetic about their inspirations, often include older movies or paintings which have uncannily similar scenes/frames. So it also wouldn't be shocking if a model trained on the same inspirations as the filmakers generates almost-exact frames as the movie makers.

The archive link doesn't threaten their jobs and helps them avoid paying for NYT. It's NIMBY, or rather it's true form of NIIIM (Not if it impacts me).

Hypocrites are EVERYWHERE and are the majority.

It is pretty funny. If you go back and read the comments made yesterday about ChatGPT doing something much milder (using old articles to train data, some prompts fused to allow you to reproduce some of the articles though now don't work), you have a lot of comments talking about how The New York Times needs money and Open AI is using their work without paying for it.

Now a comment points out that HN News (and most of the internet) routinely does something much worse - allows people to bypass completely new articles in their entirety without paying - and almost all the comments are about how it's the New York Times fault for making it difficult to cancel subscription, the importance of news being available to everyone, the problems with copyright laws, etc.

This tendency at Hacker News are also much more of a threat to The New York Times than what Open AI is doing. Even the places like blogs/Reddit/social media submissions that summarize the article and post the relevant quotes. Unlike the summary of a movie, summarizing all of the relevant parts of a news article is extracting almost all the value from it, and giving it away for free.

And the vast majority of people read news for it's breaking content, not for its archived content from years before (and I say this as someone who has often recommended the latter, but has gotten very few people to do so). So giving people that free breaking content (either in its entirety like on Hacker News, or summaries like you see all over social media) is actually a direct competition to the news business in a way that training an LLM on an article from months/years back isn't.

Yes, and for nonfiction, it's also true that it usually depends on the original article for credibility. (If it were an anonymous poster making up a news story, most people wouldn't believe it.)

there's quite a big difference between "pirating" digital content and making it available to anyone for free and taking that content and building a for-profit service on top of it, which is what OpenAI are doing, no?

I was just going to post this. Seems quite an obvious and significant distinction, that doesn’t need to provoke all the existential hand wringing. Making money off someone else’s content is a totally different moral and legal case.

it's different reading an NYT article on an archive site vs. putting copies of it at the core of your $100B for-profit content delivery enterprise.

The NYT and other newspapers don’t go after the archived link providers. Probably because the newspapers scholarly mission includes things like preservation. But they also have a profit motive or they can’t stay in business.

This implicit permission for the archive links to exist, gives some of us the implicit permission to pirate the content.

Disclaimer: I am a happy subscriber to the NYT (and other digital newspapers).

why we feel it's OK to pirate news articles, but not other IP.

Because those who own & produce such news articles asked to make them different. People listened and accepted their requests.

When you make a TV show or a video game, you don't get any protection from the Geneva Conventions and a long list of other international treaties for your rights on stuff other than the content you are producing. The same can't be said when you are producing news.

Blocking ads and avoiding payment are two different things.

it s also audacious how these news companies reproduce stories from social media and other electronic media of facts that are, like, freely available in nature. Or how they get embargos and exclusivity to government information as if they are some kind of information-bouncer

Not quite what parent means, but an interesting angle is: what if you scraped ChatGPT instead.

NYT, or someone's blog? Meh, fair use, and if you say no, you're in the way of progress.

But if you wanted to scrape ChatGPT answers to tweak your network, uh oh, violation of T&C!

They pirate movies as well:

https://garymarcus.substack.com/p/an-artist-fights-back-and-...

If I can't read about it, it didn't happen.

At least people do not obscure who is the original author of the content (so, if people like NYT articles - they could go and subscribe for more). Kinda "free advertising" (which still hurts the publisher in many cases, though). Same with search engines - as long as engine brings clicks - people are happy. If search engine just grabs the info and never redirects the user to the site - what is the point for the site to exist to begin with?

At least in the US, copyright violation is a civil thing, it's handled by lawsuits. If the copyright violation is of such a small level that it's not worth the copyright owner to do anything about it then nothing's done. In this case it's worth a massive amount of money.

Probably because the contents are what's posted, i.e. if someone would post a link to an interesting video behind paywall / login and there was an easy mirror available that'd be posted too.

If I could just buy one article for a coffee without entering a bunch of PII or go through a time-wasting process I would agree on the moral equivalence between the examples.

This is only an interesting juxtaposition if you have fully internalized and accepted the myth of people and corporations being interchangeable.

Because historically this is how news were shared. People would pick up a paper in a grocery store or cafe, read some of it, and leave it behind. They might rip out a page and take it home. Only one person paid and tens or hundreds gleam for free. This idea of sharing the story to nonsubcribers is as old as printed news itself. Instead news agencies prefer we forget that aspect of history, insist on being the “paper of record” while charging more money for easier to distribute media that gets sold globally. Yes, I think we are certainly not in the wrong here when we read the news for free.

I believe the reason many of us tolerate links to news articles and other content is because we believe in equality when it comes to information access. In other words, many of us believe that those who cannot afford a subscription to a paywalled site should still be able to read the articles, in much the same way public libraries allow those who cannot afford to purchase a book the ability to read it.

However, this doesn't apply to organizations that freely share copyrighted information while making money in the process, or to organizations that share copyrighted information in a way that specifically disadvantages or does harm to the original creator of that information.

Good observation. I now wanna start commenting with pirate links to other media, but HN would tear me to shreds real quick I guess.

The difference is that an individual pirating news is simply reading the article. OpenAI intends to digest news articles to the point of packaging them and reselling.

My uncle used to distribute daily newspapers and his saying was "News ages like a fish".

OpenAI is allegedly using NYTimes articles to train a computer and sell its services. I see different use scenarios.

I guess another way to look at it is that human just reads the pirated material. A computer makes a verbatim copy and analyzes it to the point to mimicry and sells fuzzy versions.

I pay for multiple streaming services because I get a decent amount of value from their content.

I do not pay for any news websites because I read very little of what they produce, and it tends to pop up more on aggregator sites like HN than me actually going to them.

I actually did have a subscription to The Telegraph for a few months at one point because initially I wanted to read a full article (without cheating). But eventually I cancelled because so much of it is polemic trash.

That's my justification: I pay for things that have value to me.

I think one of the key differences is something pointed out in the article, in that what the Open AI is doing is a substitute for reading the new york times and possibly a rival to it.

On the other hand having an archive link to a times article in order to discus it is not really a substitute for a times subscription as a news paper has to walk a line of letting some of it's articles be read while requiring payment for others (the times actually allows you to create a "gift link" to do exactly what the archive links do).

I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP.

A lot of of that is going to stem from the fact that respect for "journalism" is pretty low. More than 99% of news articles are copies of the <1% of original work that happens in that field. In news, everyone is already lifting content from everyone else.

It's not just tolerated, it's encouraged because "the alternatives suck worse"

https://news.ycombinator.com/item?id=23735026

Even talking about it will get you scolded for talking about something "off topic"

Because I'm not interested in the medium itself, as I would be with a Netflix show; I'm not even interested really in the article or the New York Times as an institution. I'm interested in discussing the supposed real-life phenomenon being covered, and the posted content is the primer for that discussion. I think if you get rid of the archive links on HN you need to ban the paywalled content as well. If you want to discuss paywalled content I'm sure you can do that in the article's comment section.

I believe it's tolerated here based on the site guidelines. I have always thought this was the case because otherwise these posts would all be pay to play which would limit who could participate and turn HN into more of a subscription farm. Maybe the way to make everyone feel ok about it is to disallow links to paywalled content.

it's similar to how easy it's to subscribe NY times and then how hard it's to unsubs. They require extra steps and it's well known. So They get what they deserve? Do you see the poínt. They are lie spreaders, nothing else

We are also happy to use open source, yet what open source alternatives are there for news that don't get shot down by the media or besmirched?

There are two fundamental differences.

First, Open AI is the one doing the pirating here. Hacker News is the host, they aren't doing any pirating or posting any archival links to the copyrighted information themselves.

Second, Open AI charges subscription fees and profits off of the copyrighted material they have pirated, whereas Hackers News does not, nor do the people who post the links.

Is this really copywrite?

Or is it "you can't talk to someone about an article they read".

This is really saying you can't call up your buddy and have them tell you a summary of what they just read. Maybe my buddy has a good memory and some of the text is actually nearly duplicate. But I wouldn't know because I didn't read the original, I just asked for a summary from someone else that read it.

why we feel it's OK to pirate news articles, but not other IP.

Once the NYT pays reparations for the Iraq war, I'll be the first to stop pirating it.

Another factor to consider is that neural nets can function as lossy compression, which becomes extremely evident when using models that are overfit.

Sometimes they're so overfit that the compression isn't even lossy, and the data is encoded verbatim in the NN.

Yes, but this then hits against learning/understanding and compression being fundamentally the same thing. I can't think of a better way to argue in favor of "it's fine if human does it, therefore it's fine if LLM does it", than from the "lossy compression" angle.

It's not okay for a human to pirate, plagiarize, violate IP rights and laws, etc.

But I disagree with the underlying assumption that you can anthropomorphize LLMs. Gradient descent and backpropagation don't take place in the brain. LLMs "learn" in the same way that Excel sheets "learn".

Humans are living beings with needs and rights. A person being able to legally squat in a home doesn't mean that a drone occupying property for some amount of time also has squatter's rights, even though you could easily and affordably automate and scale the deployment of drones to live and hide away on properties long enough to attain rights regarding properties all over the country.

sure, but if I use an LLM to write a novel/article, I can be sued in civil court not the LLM.

but, more importantly, OpenAI can also be sued for tortious interference? (basically the civil equivalent of accessory)

Whoever operates the LLM, in this case OpenAI, engaged in copyright infringement through the unauthorized modification, reproduction and distribution of content to you.

The person doing the requesting did.

That's not how interactive computer services work.

sure, but if I use an LLM to write a novel/article, I can be sued in civil court not the LLM

That's function of the legal system, not of the technology. If tomorrow someone made a perfect dolphin-Esperanto translator and proved Dolphins were as smart as humans, you still can't sue a dolphin until the legal system says so.

Wouldn't you find out by suing the dolphin and seeing if it holds up in court?

Not if you were smart, unless you have some sort of solid argument for why the established case law about this sort of thing is faulty.

Gradient descent and backpropagation don't take place in the brain.

Not exactly, no, but the 'neurons that fire together wire together' way of learning has a pretty similar effect.

LLMs "learn" in the same way that Excel sheets "learn".

I've never seen an excel sheet do anything like backpropagation.

Not strictly in the sense you mentioned (assuming that you mean "by themselves") but people may find [1] and [2] interesting.

[1] https://pub.towardsai.net/building-a-neural-network-with-bac...

[2] https://towardsdatascience.com/demystifying-feed-forward-and...

Sadly, I have seen one. It was a vba script from the late 90s that used a simple dense multilayer network to do some unsupervised pattern classification. The linear algebra tools in vba/excel along with the solvers are all native dll code and the vba itself is all AOT compiled to native, so it typically runs very fast, and for small matrices it beats out numpy by an order of magnitude due to the ffi overhead. Was it the wrong tool? It depends on your constraints, but probably. It did work though.

also if I write and article and quote some "text like this" [1] then that's not plagerism, but if my arguement is that the underlying assumption that you can anthropomorphize LLMs. Gradient descent and backpropagation don't take place in the brain. LLMs "learn" in the same way that Excel sheets "learn". Well, that's plagiarism and it's not allowed and people will get peeved and my career might get damaged.

I await the HN ban with fear..

[1] I'm not even doing referencing - so I am surely an LLM.

Backprop doesn't happen in us, but I think our neurones still do gradient descent – synapses that fire together, wire together.

And ultimately, at the deepest level we can analyse, our brains' atoms are doing quantum field diffusion equations, which you can also do in an Excel spreadsheet, so that kind of reductionism doesn't help either.

Yes, but we can also do tissue cultures and crude bioprinting, so it's a very foreseeable future where exactly the same argument will also be true for living organisms rather than digital minds.

We need to figure out what the deeper rules are that lead to the status quo, not merely mimic the superficial result. The latter is how cargo cults function.

It's fine for a human to remember it. It's not fine for a human redistribute it for money (legally speaking). That's copyright infringement.

Correct, just like it’s infringement to reproduce an article from memory using pen and paper intentionally. The person deciding to do that bears responsibility. OpenAI would be liable IFF they were intentionally facilitating that, instead of it being an undesired artifact from overfitting.

I'm pretty sure if you reproduce a work from memory by accident, because you didn't notice your subconscious had just stored the entire article and is now reproducing it word for word, you'd still be guilty of copyright infringement.

The music business is full of examples of that.

It's super obnoxious when people who have no understanding of the law, point to industry patterns or behaviors as examples of what is legal, not knowing the law and not knowing whether or not the thing they are pointing to is legal. The music business is also full of copyright infringement litigation. You also are not taking into account whether what is copied by an artist is covered by copyright when you made your statement. Do you know what's covered in music copyrights, such that your statement ever had any value for anyone else here?

That's not true at all. Copyright infringement is a strict liability offense with no inquiry in to the state of the mind of the infringer from a liability perspective. The state of mind of the infringer is only relevant to the issue of willful infringement.

Is there some LLM meta where understanding and compression are argued to be the same thing I’m not aware of?

Anyone got more details on this?

Superficially it sounds like total BS; a highly compressed zip file does not exhibit any characteristics of learning.

Algorithmically derived highly compressed video streams do not exhibit characteristics of learning.

I’ve vaguely heard the learning can be considered to exhibit the characteristics of compression in that understanding of content (eg. segmentation of video content resulting in more highly compressed videos) can lead to better compression schemes.

…but saying you can “do a with b” and “a and b are fundamentally the same thing” seems like a leap…?

It seems self evident you can have compression without comprehension.

Suppose you wanted to train an LLM to do addition.

An LLM has limited parameters. If an LLM had infinite parameters it could just memorize the results of every single addition question in existence and could not claim to have understood anything. Because it has finite parameters, if an LLM wants to get a lower loss on all addition questions, it needs to come up with a general algorithm to perform addition. Indeed, Neel Nanda trained a transformer to do addition mod 113 on relatively few examples, and it eventually learned some cursed Fourier transform mumbo jumbo to get 0 loss https://twitter.com/robertskmiles/status/1663534255249453056.

And the fact it has developed this "understanding" as an ability to learn a general pattern in the training data enables it to compress. I claim that the number of bits required to encode the general algorithm is fewer than the number of bits required to memorize every single example. If it weren't then the transformer would simply memorize every single example. But if it doesn't have space then it is forced to try to compress by developing a general model.

And the ability to compress enables you to construct a language model. Essentially, the more things compress, the higher the likelihood you assign them. Given a sequence of tokens say "the cat sat on the", we should expect "the cat sat on the mat" to compress into fewer bits than "the cat sat on the door". This is because the latter is far more common and intuitively more common sequences should compress more. You can then look at the number of bits used for every single choice of token following "the cat sat on the" and thus develop a probability distribution for the next token. The exact details of this I'm unclear on. https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... this gives a good summary.

It’s exactly this kind of thinking that underlies lossless text compression (not exactly what a transformer guarantees but often what happens). For that reason, some people thought it would be fun to combine zip and transformers. https://openreview.net/forum?id=hO0c2tG2xL

Even something as simple as LZW starts developing a dictionary. Not all compression is sufficient for understanding, but the more you compress a stream of data, the more dependent you are on understanding the source, because understanding the source allows you to take more shortcuts and still be able to reconstruct the data.

The idea precedes LLMs by a couple of decades and is thought to apply more broadly within ML/AI than being a specific meta for LLMs. http://prize.hutter1.net/ has been around for a while, there is a link in there to the earlier work (called AIXI?).

fundamentally the same thing

I fundamentally disagree. That's not some established fact, just a narrative used by those who wish to plagiarize using "AI".

Humans are defined not just by their abilities but by their limitations too. We celebrate our achievements because sometimes they surpass the limitations of an average human.

Our collective human limitations(physical, mental and temporal) are sort of invisible implicit rules that we all follow in one way or the other. If an entity is not bound by those rules then I don't see why that entity should be treated the same as a human.

Companies already make this differentiation.

For example take captcha and bot detection. Some of the heuristics are based on inherent human limitations like response time, click time, mouse acceleration etc.

I doubt youtube or any other streaming service will be happy if you want to stream all their videos to train a hypothetical human like AI(which views and prepares notes like a human) at a hugely accelerated speed compared to a regular human. You can guess how quickly they will cite fair usage policies.

What I want to say is there are fundamental differences between a human and an AI. So, we should not be quick to dismiss any concerns just because AI can "mimic" humans in certain areas.

I can’t think of a better way to argue in favor of “LLMs are copyright laundering machines” than from the humanness angle.

Humans have rights, software tools don’t.

If you grant an LLM the full set of human rights, then it can consume information, regurgitate copyrighted works, and use it to generate money for itself. However, considering blatantly obvious theft as “homage” goes hand in hand with free will, agency, being in control of yourself, not being enslaved and abused, etc. Pondering various scenarios along those lines really gets to the heart of why an LLM is so very much not a human, and how subjecting it to the same treatment as humans is a ridiculous notion.

If you don’t grant LLM human rights, then ClosedAI’s stance is basically that pirating works is OK because they pass them through a black box of if conditions and it leads to results that they can monetize. That’s such a solid argument, it’ll surely play well in the court of law.

Training data is not an “LLM does it”; first because “it” here is not “learning” or understanding in human sense (otherwise you would have to presume that an LLM is a human), and second because a software tool doesn’t have agency and it’s really just Microsoft using a tool based on copyrighted works to generate profit.

We can have different rules for humans than for machines. In fact, that happens all the time.

Here's an article from November 2023 that discusses this:

https://not-just-memorization.github.io/extracting-training-...

Isn't it totally normal to write articles / blog posts that effectively summarize, and often quote from, news articles?

My impression is that it’s not necessarily legal, but going after bloggers and proving damages based is just a huge waste of their time. OpenAI came by with their fat stack of funding and changed that.

It is legal. Fair use. People have been doing it for ages. Almost every article you've ever read has some fair use of another article, book or news item, etc.

When it becomes a service where you make money but the source doesn’t is it still fair use?

Yeah. No one is out there suing the shit out of cliff notes because they published a summary of Catcher in the Rye.

they might if cliff notes starting copy pasting parts of the source into their articles and passing it off as original writing though :)

The Tolkien estate should get busy suing all the fantasy writers, comic artists, game developers and board and card game companies. Lots of cash there.

They have done some of that actually. Tolkien will be public domain in the nations that are at aithors death+50 in a few days. Sadly, it will be a much longer wait in mine and many others.

Newspapers generally don't "pass off" quotes as their own writing. They make clear which parts they quoted.

What parent poster meant is that it is normal that news organisations reference each other and report/cite/rephrase each other reports. For example all other news papers reported about the Watergate scandal reported by Bernstein&Woodward in the Washington Post.

Those cite the original source that they used to write the article, the gpt models don't.

Depends on your prompt

Yeah but for every instance of that are face hugger links blogs that will rewrite the article and almost meant to deprive the source of any credit.

It’s not clear to me where the line is.

No, in US law at least there can be no copyright of facts, only presentation. If you convey the same facts in different words that isn't a matter of fair use, it's never even a matter of copyright in the first place.

How about things that aren’t quite facts? Reviews, opinions, etc.

No, it is very specifically and deliberately fair use. That is the primary intended purpose of fair use. The New York Times doesn't own the news; they just own their articles.

I think the issue is that they trained ChatGPT on the New York Times' proprietary IP without paying licensing fees and, the Times argues, that is illegal. By way of proof the Times has examples of ChatGPT dumping out articles verbatim.

What you described is entirely fair use, actually.

Not only that, look at a few news articles from Tier 2 and down publications, and you'll realize that almost all of them are directly sourced from NYT and others. They'll say "so and so happened, according to The Times" (and usually link the article there)

it's fair use if you don't make money from your project no?

No.

In the US, whether or not you make money has little to do with whether or not your use qualifies as "fair use".

Why do you say that? Commercial vs noncommercial use is a primary factor in the “purpose” prong of the fair use balancing test and a significant one in the “market effects” prong.

That a use is noncommercial is often a deciding factor in the success of a fair use defense. GP is overstating it though, since it’s still one of many factors.

Because anyone that is familiar with fair use knows that the purpose prong and the commerciality aspect of it is not one of the more important prongs of the fair use analysis, whereas transformation is. Transformation adjusts what is a purpose that falls under fair use. Did you read Warhol??

What you described is entirely fair use, actually

Just like during the pandemic how everyone became an epidemiologist, suddenly everyone's a copyright lawyer. I'll just dispute your assertion by saying:

1. Questions of fair use are famously gray, and anyone who declares something as "entirely fair use", with no caveats, is nearly always wrong except for the must obvious cases, which the given example is most definitely not. A judge has wide latitude in determining fair use.

2. People should familiarize themselves with the four factors of fair use determination. In particular, if a work is purely derivative of a source work and substantially negatively impacts the market for the original work, it's very likely to not be considered fair use.

A great overview is https://fairuse.stanford.edu/overview/fair-use/four-factors/

suddenly everyone's a copyright lawyer

Roll back 20+ years ago on Slashdot and you'll see the exact same thing.

Copyright has been a hot button issue on the internet for decades. People end up thinking (rightly or wrongly) that they understand it without being a lawyer.

Based upon what? You think other publishers use NYTimes articles for free without license?

Do you have some examples & are you sure they don't pay licensing fees to NYT?

I would say it is arguable that is fair use, but the whole thing about fair use is that it is a defense, not a type of license or something you can preemptively apply. So whether or not it will be protected under fair use is actually not determined yet. In fact I would say that’s the entire debate here, right?

I have worked on many documentaries and any time we said “fair use” internally what we were implicitly saying is “nobody will come after us because they know that we are probably safe under fair use if this escalated.“ But again, we could never preemptively apply it. We were just anticipating potential conflict and gauging how likely it was to occur.

Something like, summarise all articles on US-UK relationships over past 5 years. I charge money for it, and all I pay NYT is a monthly subscription fee.

Is that fair use? IANAL, but doesn't sound like it.

If you pay someone to do the summarisation for you, then you publish the content and charge a fee for it, you're the one liable, not the person you paid to summarise it for you. Similarly if you ask GPT to do it for you, then publish it, you're liable for what you publish; GPT is just a summarisation tool.

That's not true at all. If you pay someone to copy NYT articles for you verbatim, and then they give the copies to you, and then you publish them online, then you've both violated the copyright. You are never allowed to make copies of copyrighted works, even for private deals (making such copies for purely personal use, such as archival, falls under fair use - but you can't build a service out of that).

So, if the summaries are derived works and not covered by fair use, then both you and the summarizee are separately breaking the NYT's copyrights. Otherwise, if this is covered by fair use, then you are both in the clear.

Finally, GPT is not "a summarization tool" in this case. If you provide a copy of a NYT article as a prompt and then ask for summarization, then yes, it is clear that GPT is not doing anything wrong, even if it spits out the exact same text. But if you simply ask for a summary of a specific article by, say, just name and date, and you get a copy of it, it's clear that GPT is storing the original data in some way, and thus it has copied the NYT's protected works without permission.

But if you simply ask for a summary of a specific article by, say, just name and date, and you get a copy of it, it's clear that GPT is storing the original data in some way, and thus it has copied the NYT's protected works without permission.

In this particular case they were using it via Bing, which actively did a HTTP request to the particular article to extract the content. So GPT hadn't memorised it verbatim, instead it fetched it, much like a human using a search engine would.

The article states that they used it initially through ChatGPT, but that seems to have been fixed in the meantime, at least for the very simplistic queries that used to work ("the first paragraph of the Carl Zimmer article on old DNA" in ChatGPT used to return the exact data from NYT, and "next paragraph" could then be used to get the following ones). Even if this has been fixed, it still proves that ChatGPT encodes exact copies of NYT articles in its weights, which may be a violation in itself, even if it is prevented from returning them directly. Especially if they ever started distributing the trained model.

Additionally, even the use through Copilot is very debatable. They are not returning the NYT link, which requires a subscription, they are returning the contents of it even to non-subscribers. And they are doing this in a commercial product, not a non profit like the Internet Archive, which has some arguments for fair use.

Also, ChatGPT isn't a person with rights and duties. The people that made it are responsible for it.

That's not the example. Here I proactively scrape NYT, summarise articles for a fee and sell that as a service. It's not people coming to me with some articles to summarise, and maybe then publishing it online.

At some level it becomes a subversion of NYTs fees. First, say I subscribe and simply host the articles verbatim, for a fee. Clearly, that's not right.

Suppose I change some spelling or word order, or use a synonym or two. That's still not ok.

And if I substantially paraphrase the articles? I guess this is the relevant case. This is kind of what LLMs do. And also feels like not fair use.

That's not what OpenAI is doing; it's not selling summarised articles as a service. Your example is a false equivalence.

This is kind of what LLMs do. And also feels like not fair use

An LLM doesn't do this unless you ask it to. And if you then take that output and publish it as your own, you're breaching the copyright, not OpenAI.

In this case, OpenAI is violating copyright by modifying, reproducing and distributing copyrighted content to its customer.

From what I can tell, this has nothing to do with LLMs at all. In the example in the article, the user is asking Bing to go fetch the contents of an article directly from the website, and print it out, which it dutifully does.

Seems like the "problem" is that NYT etc gives privileged access to search engines for indexing their content, but then get upset when snippets of the indexed content is being shown to users without the users having to fight the paywall or whatever.

This article also claims that the screenshot is coming from ChatGPT when it clearly is not.

I suppose that's a relatively easy thing to fix, technically. It proves, however, that th underlying LLM is trained on copyrighted data.

I'm not sure the problem goes away simply if the LLM in question (or any other one) gets some "no verbose regurgitation" filter.

In that case, the language model calls a search function and just repeats the result out its conversation context, not its training data. With that in mind it's not clear why it's ok for Bing itself to quote the source, but it stops being ok, when a chatbot does it.

Bing links to the source, chatbots doesn't.

In the example from the article, it very clearly points to all the sources used.

bing's chatbot does

The example from the article doesn't show that LLM is trained on copyrighted data - it's just Bing fetching the source article, providing it to GPT, and GPT rephrasing the article. An agent trained on entirely copyright-free data would provide exactly the same output.

To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.

You just described Google. When you think about it, it's surprising that Google is legal. However, it is well established that what Google does is perfectly legal. Remember that internally Google keeps and uses complete verbatim copies of every web page they index.

Yes, Google offers a link to the source. If OpenAI did the same, even if only 0.1% of people clicked on the links and NYTimes hardly got any revenue from it, would that make it legal in your eyes? What if they implemented a system that detected when it was outputting a verbatim copy of something and simply paraphrased it? NYTimes clearly doesn't have copyright on paraphrased versions of their articles. I think it would be pretty silly if the government forced them to do that as it wouldn't make any practical difference to anyone.

Any publisher can opt out of google. Publisher also have substantial control over titles and snippets shown in google, whether an article appears in google news, etc

Paraphrasing is also known as cloning and is often a copyright violation

Copyright law doesn't mention opt outs or search engine snippet controls. It's not clear to me that robots.txt is the singular thing that makes Google legal.

In US copyright law facts cannot be copyrighted, so copyright on factual content like newspaper articles is limited. Simply replacing a few words wouldn't work, but I am certain that GPT-4 is capable of paraphrasing factual content at a level that would not be considered infringement if a human did it.

Genuinely - what are you talking about besides your own assumptions? you just assume everything google does is legal and therefore any one else doing anything arguably similar must also be legal? Without regard for factual details that do matter to copyright law? Such as license?? Your own description of copyright law here is very stunted - you can't paraphrase articles of the NYTimes and call it a fair use. You can report on what the NYtimes reports on... because that's what news is.

However, it is well established that what Google does is perfectly legal.

Google has a wide range of products and shakedowns. Not all of them are "perfectly" legal: Google is being challenged in court over some of their shakedowns and products practices.

I am clearly talking about the web search engine in the context of copyright. Other products or legal concerns like antitrust are completely irrelevant here.

Is that fair use?

As always, the answer is.. "it depends". I guess it depends mostly on the jurisdiction that applies to you. "Fair use" can have rather different legal meaning (or not exist at all) in different countries.

Also “fair use” does not use/define precedent - each case is assessed individually which really can be a flip of the coin.

Fair use is specific to the US, as far as I'm aware. Moreover, Congress had to codify fair use (turn fair use common into statutory law in the form of 17 U.S. Code § 107) in order to make copyright statutes compatible with the First Amendment. Most other countries don't have freedom of expression and freedom of the press, so copyright law in a different country usually lacks a unifying exception test like fair use to supplement the specific enumerated exceptions.

A sibling comment mentions search engines. I think there's a big difference. A search engine doesn't replace the source, not at all.

Google has been accused for years of replacing sources with their "One Box"--the big answers at the top of the page, which are usually pulled from or corroborated by search results. They don't want you to leave the search results page (where the ads are).

Google is very careful to license all the content that shows up in that interface. They even pay Wikipedia, despite legally not needing to at all.

It would be nice to have a nice principled answer to this, but unfortunately, in our world, the answer is probably: if you start making LOTS of money doing this, they will come after you.

The best example is that sport scores, names and stats are not copyrightable by settled case law; however, you still have to go to the NBA and players union if you want to make a fantasy basketball game that has stats or names.

Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.

Can't you, though? I'd thought in general, it's a very important for the market to be able to do just that, otherwise everything gets gummed up in webs of exclusive contractual dependencies between established companies.

As I say, I don't really know. But then, this is exactly how SaaS licensing works. There may even be a free personal tier, where you can't sell products based on it, and a professional tier which may be very expensive indeed.

Typically providers of online databases go to some effort to stop people from sharing logins. Even from that point or view, I can imagine scraping articles and providing paraphrases of it for a fee is fishy.

All I'm saying, to some people it's obvious that the whole LLM on scraped Internet is fair use, to me it is not obvious.

Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it.

I think you’re confusing terms of service and copyright. IANAL but what you describe sounds exactly like fair use to me, irrespective of how much you are paying NYT.

Using similar logic NYT should pay all actors involved in their articles.

I agree with your IANAL take, but what about a situation with an extra level of indirection? So the service never reads actual NYT articles, but only reads blog/forum posts about NYT articles, and derives what is in the article from conversations about the article by people who have read it. Is that legal now?

The real answer is it totally depends on whether your product grows to $10,000,000,000, and whether you pays part of it back. Search engines pay with referral traffic.

Can you read all of NYT and other things, and answer others' questions based on your knowledge? I'd imagine you can. I'm afraid you can't sidestep the question whether an LLM is more like a person who's read a lot or an archive/index.

This analogy fails to capture the transformative nature of these models. Hosting a derivative work that is also a news article is not transformative. Hosting a next word completer is very different than a news article and can't be used as a substitute.

How about if you read the paper every day and write opinion pieces about world events? Fair use?

As someone pointed out, plenty of blogs made money off of doing just that. Many people go to Reddit to read news article summaries (and often a comment just pastes the whole article verbatim), instead of paying a site like the New York Times. Twitter and other social media sites are full of people summarizing articles from the New York Times. Any late breaking news article from Wikipedia is going to be mostly summarizing information from reporters.

I think people severely underestimate how much they've grown accustomed to this information being freely available. It's easy to say "Well it shouldn't be available with ChatGPT," but if we actually put everything back behind a paywall and stopped people from doing things like writing blogs or newsletters that summarize the news, people here would get angry very fast.

But is it legal for me to read the NY Times about a war, and then charge people to interview me as an "expert"?

There's nothing wrong with scraping openly available data (including data openly available by mistake, as long as you are not aware of it, see the Bluetouff affair).

So the demand to destroy those databases seems very dubious to me.

Of course later violating fair use is another issue.

No, the four factor test is clear. Next.

Just learn to recognize and punish plagiarism via RLHF.

This is not a RLHF problem. What I was expecting them to do is to keep a bloom filter of ngrams for known copyrighted content, such as enumerating all sets of n=7 consecutive words in an article, and validate against it. The model would only output at maximum n-1 words that look verbatim from the source.

But this will blow up in their face. Let's see:

- AI companies will start investing much more in content attribution

- The new content attribution tools will be applied on all human written articles as well, because anyone could be using GPT in secret

- Then people will start seeing a chilling effect on creativity

- We must also check NYT against all the other sources, not everything the write is original

Maybe the bloom filter solution is enough, but I wonder.

- Paraphrasing n=7 words (and quite a few more) within a sentence can easily be fair use.

- As n gets big, the bloom filter has to also.

If/when attribution is solved for LLMs (and not fake attribution like from Bing or Perplexity) then creators can be compensated when their works are used in AI outputs. If compensation is high enough this can greatly incentivize creativity, perhaps to the point of realizing "free culture" visions from the late 90s.

As n-gram length grows, we are still going to have the same number of ngrams, they go through a hashing function and indexed in the bloom filter as usual. The number of n-grams size n in a text is text_length - ngram_length + 1.

The number of unique values in the bloom filter will go up ~exponentially with n. So to control the false positive rate the bloom filter has to grow.

At large enough ngram size there would be very few collisions. You can take for example this text and try in Google with quotes, it won't find anything matching exactly.

I tested this 6-gram "it won't find anything matching exactly", no match. Almost anything we write has never been said exactly like that before.

Yes and the fact that the number of unique phrases grows so quickly with n is why the bloom filter needs to grow so that hashed n-grams don't collide.

it won't find anything matching exactly

This approach is probably inadequate. In my line of (NLP) research I find many things have been said exactly many, many times over.

You can try this out yourself by grouping and counting strings using the many publically available Bigquery corpora for various substring lengths and offsets, e.g. [0-16]; [0-32]; [0-64] substring lengths at different offsets.

if compensation is high enough

Who pays the compensation? If it's the user, why wouldn't they just buy the authors work directly? Why go through the LLM middleman?

If it's the user, why wouldn't they just buy the authors work directly? Why go through the LLM middleman?

If it's the user, why wouldn't they just buy the DVDs directly? Why go through the Netflix middleman?

A retort to this would be that both NYT and ChatGPT are on the internet, so it's no added fuss of hopping in my car, driving to Walmart, and picking up a DVD case. My response to it would be that both the LLM and Netflix are content aggregators to the user. I can read the NYT, or I can read the NYT summary on ChatGPT and ask it for life advice with my pet hamster, or ask it how to reverse a linked list in bash.

The LLM users/middlemen pay. The user probably pays less than they would have to pay the author. The LMM provides information retrieval / discovery.

I like the idea but seems like there would be big problems. Like detecting if a work is reworded. Or a large number of sources have all slightly influenced a small response - isn't that pretty much considered new knowledge?

Then there's the issue that however you credit attribution, it creates a game of enshittified content creation with the aim of being attributed as often as possible, regardless of whether the content really offered anything that wasn't out there already.

I think it is an RLHF problem and that you are right - this will blow up in the faces of the NYT.

Specifically, the NYT examples all seem to be cases where they asked the AI to repeat their articles verbatim? So they ask it to violate copyright and because it's a helpful bot with a good memory, it does so.

Solution: teach the model to refuse requests to repeat articles verbatim. It's easily capable of recognizing when it's being asked to do that. And that's exactly what OpenAI have now done.

So the direct problem the NYT is complaining about - a paywall bypass - is already rectified. Now it would seem to me like the case is quite weak. They could demand OpenAI pay them damages for the time ChatGPT wasn't refusing, but wouldn't they have to prove damages actually happened? It seems unlikely many people used ChatGPT as a paywall bypass for the NYT specifically in the past year. It only knows old articles. OpenAI could be ordered to search their logs for cases where this happened, for example, and then the NYT could be ordered to show their working for the value of displaying a single old article to a non-subscriber, and from that damages could be computed. But it wouldn't be a lot.

That's presumably why the case goes further and argues that OpenAI is in violation even when it isn't repeating text verbatim. That's the only way the NYT can get any significant money out of this situation.

But this case seems much weaker to me. Beyond all the obvious human analogies, there is precedent in the case of search engines where they crawl - and the NYT let them crawl - specifically to enable the creation of a derived data structure. Search engine indexes are understood to be fair use, and they actually do repeat parts of the page verbatim in their snippets. Google once even showed cached versions of whole pages. And browser makers all allow extensions in their stores that strip ads and bypass paywalls, and the NYT hasn't sued them over that either.

This is not how copyright works though. The verbatim quoting of articles is because when people brought up these questions initially the argument was that the NN doesn't really contain the training data or really just in an abstract, condensed way that does not constitute copying of the content.

This demonstrates that no, the NN actually does contain the full articles, copied into the NN. Do you think any normal person would get away with copying MS windows by e.g. zipping it together with some other OS on the same medium. Why should we let OpenAI get away with this?

Search indexes contain exact copies of the pages they index, and that isn't a copyright violation.

> Why should we let OpenAI get away with this?

IP rights, like other private property rights, are a compromise between creators and consumers. What "should" be the case is essentially an argument about what balance creates the best overall outcomes. LLMs, for now, require large amounts of text to train, so the question is one of whether we want LLMs to exist or not. That's really a question for Congress and not the courts, but it'll be decided in the courts first.

https://en.wikipedia.org/wiki/W-shingling

Well yeah, copying a work and using it for its original expressive purpose isn’t fair use, no? You have to use it for a transformative purpose.

Suppose I’m selling subscriptions to the New Jersey Times, a site which simply downloads New York Times articles and passes them through an autoencoder with some random noise. It serves the exact same purpose as the New York Times website, except I make the money. Is that fair use?

They transformed the weights.

Just like reading the article transforms yours.

As for verbatim reproduction, I'm pretty sure brains are capable of reproducing song lyrics, musical melodies, common symbols ("cool S"), and lots of other things verbatim too.

Those quotes from Dr. King's speech that you remember are copyrighted, you know?

This comment is just blatant anthropomorphizing of ML models. You have no idea if reading an article “transforms weights” in a human mind, and regardless, they aren’t legally the same thing anyway.

Modern neuroscience does highly suggest this is essentially what's happening.

they aren’t legally the same thing anyway.

They should be.

If they could find a single person who in natural use (e.g. not as they were trying to gather data for this lawsuit) has ever actually used ChatGPT as a direct substitution for a NYT subscription, I'd support this lawsuit.

But nobody would do that, because ChatGPT is a really shitty way to read NYT articles (it's stale, it can't reliably reproduce them, etc.). All that is valuable about it is the way that it transforms and operates on that data in conjunction with all the other data that it has.

The real world use of ChatGPT is very transformative, even if you can trick it into behaving in ways that are not. If the courts act intelligently they should at least weigh that as part of their decision.

That’s nonsense piracy. I never intend to own a truck, so when I need to haul a little something I go to Home Depot and steal a Ford off the lot for an hour? What if I stole all your commits, plucked the hard lines out of the ceremony, and then launched an equivalent feature the same week as you did, but for a competing software company? Would you or your employer deserve to get paid for my use of the slice of your work that was specifically useful for me? Yeah, and then some extra for theft.

awful comparison

It’s more of a thought experiment. Here’s another with more commercial applications:

Suppose I start a service called “EastlawAI” by downloading the Westlaw database and hiring a team of comedians to write very funny lawyer jokes.

I take Westlaw cases and lawyer jokes and feed them to my autoencoder. I also learn a mapping from user queries to decoder inputs.

I sell an API and advertise it to startups as capable of answering any legal question in a funny way. Another company comes along with an API to make the output less funny.

Have I created a competitor to Westlaw by copying Westlaw’s works for their original expressive purpose and exposing it as an intermediary? Or have I simply trained the world’s most informative lawyer joke generator that some of my customers happen to use for legal analysis by layering other tools atop my output?

Did I need to download Westlaw cases to make my lawyer joke generator? Are the jokes a fair-use smokescreen for repackaging commercially valuable copyrighted data? Does my joke generator impact Westlaw in the market? Depends, right?

To be clear, whether the use of the original work is transformative is one key consideration within one of the four prongs of fair use. The prong "purpose and character of the use" can be fulfilled by other conditions [1]. For example, using the original work within a classroom for education purposes is not transformative, but can fulfill the same "purpose and character of the use" prong. Whether the use is for profit and to which extent are other considerations within that prong. A profit purpose doesn't automatically fail the purpose prong, and a non-profit purpose doesn't automatically pass the purpose prong.

[1] https://en.wikipedia.org/wiki/Fair_use#1._Purpose_and_charac...

Many instances of fair use involve verbatim copying. The important questions surround the situation in which that happens - not so much the copying. NYT is in uncharted territory here.

in the same way that machines are not able to claim copyright, they aren't allowed to claim other legal rights either, like "fair use".

The entity which owns ChatGPT is apparently maintaining a copy of the entirety of the New York Times archive within the ChatGPT knowledge base. That they extract some fair use snippets (they would claim) from it would still be fruit of a poisoned tree, no?

(disclaimer: I'm pro AI, anti copyright, especially anti elitist NY Times; but pro rule of law)

I think there is some point between fifty years ago and last week in which the copyright for the content of newspapers should be public domain. That part of copyright needs to be fixed.

Your creative work does deserve at least some period of exclusive rights for you. Definitely not so much that your grandchildren get to quibble about it well into retirement. But also whatever the number 3 or 4 most valuable company in the world doesn’t get to scrape your content daily to repackage and sell as intelligent systems.

But also whatever the number 3 or 4 most valuable company in the world doesn’t get to scrape your content daily to repackage and sell as intelligent systems.

Here's a thing though: for 99%+ of that content, being turned into feedstock for ML model training is about the only valuable thing that came of its existence.

If it were not for world-ending danger of too smart an AI being developed too quickly, I'd vote for exempting ML training from copyright altogether, today - it's hard to overstate just how much more useful any copyrighted content is for society as LLM training data, than as whatever it was created for originally.

Except if you do that, you will see the number of content producers plummet quite quickly, and then you won't have any new training data to train new LLMs on.

Would it not logically follow that nothing of value would be lost, even if that were the case? From the point of view of LLMs and content creators, I would treat potential loss of future content being created like I would treat a lost sale. LLMs have value now because of training performed on content that already exists. There must be diminishing returns for certain types of content relative to others. Certain content is only of value if it is timely, and going forward, content that derives its worth from timeliness would find its creation and associated costs of production and acquisition self-justifying. If content isn’t of value to humans now or in the future, nor even of value to LLMs now or in the foreseeable future, not even hypothetically, then why should we decry or mourn its loss or absence or failure to be created or produced or sold?

That's like saying that if a competitor can take your products from your warehouse and sell them for pennies on the dollar, your business has no value. The point is that, to some extent, OpenAI is selling access to NYT content for much cheaper than NYT, while paying exactly 0 to NYT for this content. Obviously, the NYT content costs the NYT more than 0 to produce, so they just can't compete on price with OpenAI, for their own content.

Note that I don't see any major problem if only articles that were, say, more than 5 or 10 years old were being used. I don't think the current length of copyright makes any sense. But there is a big difference from last year's archive vs today's news.

For the sake of argument, let’s say that OpenAI thought it had the rights to process the NYT articles and even display them in part, for the same reasons, fair use or otherwise, that Google can process articles and display snippets of same in its News product, and/or for the same reasons that Google can process books and display excerpts in its Books product. Just like Google in those cases, I would not be surprised to find Google/OpenAI on the receiving end of a lawsuit from rights holders claiming violations of their copyright or IP rights. However, I side with Google then and OpenAI now, as I find both use cases to be fair use, as the LinkedIn case has shown that scraping is fair use. NYT is crying foul because users/consumers of its content archive have derived unforeseen value from said archive and under fair use terms, so NYT has no way to compel OpenAI to negotiate a licensing deal under which they could extract value from OpenAI’s use of NYT data beyond the price paid by any other user of NYT content, whether it be unpaid fair use or fully paid use under license. It feels to me that NYT is engaging in both double-dipping and discriminatory pricing, because they can, and because they’re big mad that OpenAI is more successful than they are with less access to the same or even less NYT data.

There is another fix, but it will have to wait for GPT-5. They could reword articles, summarize in different words and analyze their contents, creating sufficiently different variants. The ideas would be kept, but original expression stripped. Then train GPT5 on this data. The model can't possibly regurgitate copyrighted content if they never saw it during training.

This can be further coupled with search - use GPT to look at multiple sources at once, and report. It's what humans do as well, we read the same news in different sources to get a more balanced take. Maybe they have contradictions, maybe they have inaccuracies, biases. We could keep that analysis for training models. This would also improve the training set.

Just learn to recognize and punish plagiarism via RLHF

OpenAI has created a $100bn company on this transfer. The Times may have an interest in a material fraction of that wealth.

The NYT is also worth a tiny fraction of that. If it looks like they might get anywhere, it might be better for OpenAI to buy them

That would require NYT being willing to sell, which historically they have not been.

I just looked up the share structure; didn't realise the publicly traded shares only appoints 1/3 of the board. Still their second best option is start buying up competitors and going ahead with purging NYT from their training set. That might well end up a worse option for NYT, as they won't stop LLMs from gradually intruding on their space and the moment OpenAI or other LLM providers own major publishers so they don't need to depend on scraping, they lose any leverage they currently have.

might well end up a worse option for NYT, as they won't stop LLMs from gradually intruding on their space

The Times almost certainly wants its own LLM. I could see them striking a consortium agreement with other newspapers more easily than OpenAI.

OMG! Or they could just license the content. I suspect that would be both easier and less expensive. ;-)

I'm not convinced it's a given it will. If it becomes necessary to license, owning the large publishers will be leverage and allow locking competitors out unless you have a portfolio to cross license.

OpenAI alone has a market cap that'd allow it to buy about as large a proportion of publishers of newspapers and books as they'd be allowed before competition watchdogs will start refusing consent.

Put another way:

If I was a VC with deep pockets investing in AI at this point, I'd hedge by starting to buy strategic stakes in media companies.

Transformations are happening. Maybe if the output is verbatim afterwards, than that says something about the outputs originality all along... or am I a troll?

Anything + 2 and then minus two is back to the original thing. This says more about the transformations than the source material.

I know, I was trying to be funny, but hey- this community...

They're talking about transformative with regard to copyright law where it is an important part of determining fair use, not the dictionary definition you're using here.

I can't take NY Times articles, translate them into Spanish, and then sell the translations under fair use, even though clearly I've transformed the original article content.

Yeah, no - that proposal is no good. The correct solution is to have machine learning be more like human intelligence. You can't ask me to plagiarize a New York Times article. Not because of prompt rule violation but because I just can't. It's not how humans train (at least most).

You can't, but there are some people who can quickly memorize entire pages of written text.

That's why I qualified with "at least most"

This is a strong claim that just downloading articles into training data is what violates the copyright. That GTP outputs verbatim copies is a red herring.

It's the other way around. There is no infringement if the model output is not substantially similar to a work in the training set [1]:

To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.

The questions are, which parties should bear liability when the model creates infringing outputs, and how should that liability be split among the parties? Given that getting an infringing output likely requires the prompt to reference an existing work (which is what's happening in the article), an author of a work, an element in an existing work, or a characteristic/style strongly associated with certain works/authors, I believe that the user who makes the prompt should bear most of the liability should the user choose to publish an infringing output in a way that doesn't fall under fair use. (AI companies should not be publishing model outputs by default.)

[1] https://en.wikipedia.org/wiki/Substantial_similarity#Substan...

The level of copying here is the copying into the training set, not the copying through use of the model.

Its true that OpenAI will defend the wholesale copying into the training set by arguing that the transformative purpose of the next use reaches back and renders that copying fair use, but while that's clearly the dominant position of the AI industry, and it definitely seems compatible with the Cobstitutional purpose of Fair Use (while currently statutory, the statutory provision is codification of Constitutional case law), it is a novel fair use argument.

NY Times is suing because of both the model outputs and the existence of the training set. But infringement in the training set doesn't necessarily mean that the model infringes. Why? Because of the substantial similarity requirement. But first, I'll address the training set.

For articles that a person obtains through legal methods (like buying subscriptions) but doesn't then republish, storing copies of those articles is analogous to recording a legally accessed television show (time-shifting), which generally is fair use. Currently, no court has ruled that "analogous to time-shifting" is good enough for the time-shifting precedent to apply, but I think the difference is not significant. The same applies to companies. Companies are not literally people, but there isn't a reason for the time-shifting precedent to not apply to companies.

What about the articles that OpenAI obtained through illegal methods? Then the very act of obtaining those articles would be illegal. The training set contains those copies, so NY Times can sue to make OpenAI delete those copies and pay damages. But it's not trivially obvious that a GPT model is a copy of any works or contains copied expression of the any works in the training set; the weights that make up the model represent millions of works, it's not trivially obvious that the model contains something substantially similar to the expression in a work in the training set. Therefore, it's not trivially obvious that infringement with respect to the training set amounts to infringement with respect to the model made from the training set.

As long as the model doesn't contain copied expression and the weights can't be reversed into something substantially similar to expression in the existing works, then what matters is the output of the model.

If a user gives a prompt which contains no reference to an existing artist, work, strongly associated characteristic/style, then do OpenAI's models produce outputs substantially similar to expression in the existing works? If not, then OpenAI shouldn't be liable for infringing works, because the infringing works result from the user's prompts. If my premise is false, then my conclusion falls apart. But if my premise is true, then at most I would admit that OpenAI has a limited burden to prevent users from giving those prompts.

I'm not sure how your proposal would actually work. To recognize plagiarism during inference it needs to memorize harder.

Kinda funny if it works though. We'd first train them to copy their training data verbatim, then train them not to.

That is how it works, right? They're trained to copy their training data verbatim because that's the loss function. It's just that they're given so much data that we don't expect this to be possible for most of the training data given the parameter count.

I wouldn't say it is an unexpected behavior. I remember reading papers about this memorization behavior few years ago (e.g., [1] is from 2019 and I believe it is not the first paper about this). It should be expected from OpenAI to know that LMs can exhibit memorizing behavior even after seeing the sample only once.

[1] https://bair.berkeley.edu/blog/2019/08/13/memorization/

My expectation is that it can't memorize most of its training data. I expect it to memorize some.

Adding an extra constraint of no copying verbatim from a very large and relevant corpus will be hard to guarantee without enormous databases of copyrighted content (which might not be legal to hold) and add an extra objective to a system with many often contradictory goals. I don’t think that’s the technology-sound solution or one in the interest of anyone involved. It’s much more relevant to license content from as many newspapers as possible, recognize when references are relevant, and quote them either explicitly verbatim if that’s the best answer or adapt (translate, simplify, add context) when appropriate.

I feel like the NYTimes is asking for deletion as a negotiation tactic to force OpenAI to give them enough money to pay for their journalism (I am not sure who would subscribe to NYTimes if you can get as much through OpenAI, but I am open to registering extra to pay for their work).

What if OpenAI were to first summarize or transform the content before training on it? Then the LLM has never actually seen copyrighted content and couldn't produce an exact copy.

This isn't an issue with training, it's an issue with usage.

Production open access LLMs do probably need a front-end filter with a fine tuned RAG model that identifies and prevents spitting out copyrighted material. I fully support this.

But we shouldn't be preventing the development of a technology that in 99.99% of usecases isn't doing that and can used for everything from diagnosing medical issues to letting coma patients communicate with an EEG to improving self-driving car algorithms because some random content producer's works were a drop in the ocean of content used to learn relationships between words and concepts.

The edge cases where a model is rarely capable of reproducing training data don't reflect infringement of training but of use. If a writer learns to write well from a source is that infringement? Or is it when they then write exactly what was in the source that it becomes infringement?

Additionally, now that we can use LLMs to read brain scans and have been moving towards biological computing, should we start to consider copying of material to the hippocampus a violation of the DMCA?

I think NYT is going to win.

LLMs are arguably compressed data archives with weird algorithms. The fact that they will regularly regurgitate verbatim quotes of training data is evidence of this, as are the guardrails that try to prevent this.

The second piece of evidence is this paper explained here https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... where instead of an LLM researchers used gzip compressed data as a model and it even beat trained LLMs.

AI is a bit of a black box, but that doesn’t protect the operators of black boxes from rights violation suits. You can’t make a database of scraped copyrighted data and patented that querying that data is fair use.

There needs to be law made here and the law just isn’t going to be “everybody can copy everything for free as long as it’s for model training”.

Licensing will have to be worked out, actual laws and not just case law needs to be written. I have a lot of sympathy for lots of leeway for the open source researchers and hackers doing things… but not so much for Microsoft and Microsoft sponsored openai.

The suit demonstrates instances where ChatGTP / Bing Copilot copy from the NYT verbatim. I think it is hard to argue that such copying constitutes "fair use". However, OAI/MS should be able to fix this within the current paradigm: Just learn to recognize and punish plagiarism via RLHF.

Isn't that in tension with the basic idea of an LLM of predicting the next token? How do you achieve that while never getting close enough to plagiarism?

Suit claims that GPT reproduced passages from NYT almost verbatim.

Precisely.

This tired 'fair use' excuses from AI bros whilst the GPT has reproduced the article text verbatim, word for word and it being monetized without the permission from the copyright holder and source (NYT) is an obvious copyright violation 101. Full stop.

Again, just like Getty v. Stability, this copyright lawsuit will end in a licensing deal. Apple played it smart with their choice with licensing deals to train their GPT [0]. But this time, OpenAI knew they could get a license to train on NYT articles but chose not to.

[0] https://9to5mac.com/2023/12/22/apple-wants-to-train-its-ai-w...

AI bros

What (or whom) do you consider to be an "AI bro?"

This sort of ad hominem generalization usually accompanies a weak argument.

Young males that wear Tensorflow branded muscle tank tops and drive Mitsubishi Eclipse convertibles with the vanity plate OVERFIT. They are everywhere these days.

https://i.imgur.com/4tF7q8M.jpg

The text generation is getting quite decent. The limbs disappearing into the car are somewhat less impressive.

Thank you for the absurd visual. The vanity plate, especially, was worth saving for last. Somehow, the car is well suited, also. Love how they prefer Tensorflow over Pytorch, too.

I generally tend to downvote comments that use "x bros" for pretty much any x on sight for that reason. It's exceedingly rare for such a comment to be much more than a thinly veiled insult with little substance. Sometimes I might even agree with the insult, but it's still rarely appropriate here.

It seems to be used by people who've previously used the term "tech bro."

Not saying I agree with this labeling, but it means approximately the same thing as “crypto bro”, but for AI

The four factors considered in a fair use test:

    the purpose and character of the use
    the nature of the copyrighted work
    the amount and substantiality of the portion taken
    the effect of the use upon the potential market.

Literally every single one of these factors has very complicated precedent and each one is an open question when it comes to AI. Since fair use is a balancing test this could go any way.

Stability took the easy way out because they didn't have billions of dollars to play around with and Microsoft to back them. Let's see what OpenAI does but calling everyone who disagrees with your naive interpretation of fair use "AI bros" is doing everyone a disservice.

I'm sure the NYT uses dictionaries, encyclopaedias and style books verbatim as well. And they don't invent the facts they write about. As journalists they are compiling and passing along other knowledge. You usually don't get a piece of their income when a journalist quotes you verbatim (people usually don't get paid for interviews).

NYT doesn't reproduce the contents of the dictionary or encyclopaedia.

And even if they did it will be fine because those sources allow for it.

The point is that OpenAI never asked NYT for permission to use their data.

If the NYT reproduces other content verbatim too much, it will get in trouble.

I don’t doubt it does. It’s easy to get it to spit out long answers from Stack Overflow verbatim, I’ve done it. Maybe some of the “transformative” nature of the LLM output is the removal of any authorship, copyright, license, and edit history information. ;) The point here is to supplant Google as the portal of information, right? It doesn’t have new information, but it’s pretty good at remixing the words from multiple sources, when it has multiple sources. One possible reason for their legal woes wrt copyright is that it’s also great at memorizing things that only have one source. My college Markov-chain text predictor would do the same thing and easily get stuck in local regions if it couldn’t match something else.

I don't think these can replace search engines.

It's inevitable that this question ends up at the supreme court. And the sooner the better IMO. It's clearly fair use. Generative agents will be seen legally as no different than a human artist leveraging the summation of their influences to create a new work.

Clearly fair use? What if I pay ChatGPT to give me the NYT article it sourced verbatim as stored (i.e. without referring me to the NYT source)?

It's not stored in ChatGPT actually, unlike Google's web search cache where it is stored verbatim, can be recalled perfectly, and is still fair use.

Fair use has nothing to do with reproducibility. LLMs are more clearly fair use than a search engine cache and those court cases are long settled. There's no world in which OpenAI doesn't win this entire thing.

Why do you think the architecture is important? If I have a computer program and it outputs the an entire copyrighted poem then the answer to "is this copyright violation" SHOULD NOT depends on the architecture of the program.

What if I ask ChatGPT to print the article verbatim as sourced, from its own dataset?

It doesn't have database access to its own training dataset; it only has access to the weights it lossily-compressed that training dataset into.

This seems like a reasonable opinion when you think about the training data size and imagine that any given output is some kind of interpolation of some unknown large number of training examples all from different people. If it’s borrowing snippets from tens or hundreds or thousands of sources, then who’s copyrights are being violated? Remixing in music seems to be withstanding some amount of legal scrutiny, as long as the remix is borrowing from multiple sources and the music is clearly different and original.

It gets harder to stand behind a blanket claim that LLMs or any AI we’ve got falls under fair use when they keep repeatedly reproducing complete and identifiable individual works and clearly violating copyright laws in specific instances. The models might be remixing and/or transformative most of the time, but we have proof that they don’t do that every time nor all the time… yet. Maybe the lawsuits will be the impetus we need to fix the AIs so they don’t reproduce specific works, and thus make the fair use claim solid and actually defensible?

I'm sorry but this is such a bad take. Nice appeal to consequences. In my view, the New York Times is entirely justified in pursuing legal action. They invested time and effort in creating content, only to have it used without permission for monetary gain. A clear violation.

Analyzing the factors involved for a "fair use" consideration:

Purpose and Character of the Use: While the argument for transformation might hold in the future as you point out, the current dispute revolves around verbatim use. So clearly not transformative. Also commercial use is more difficult to be ruled fair use.

Nature of the Copyrighted Work: Using works that are more factual may be more likely to be considered fair use, but I would argue that NYT articles are as creative as factual.

Amount and Substantiality of the Portion Used: In this case, the entirety of the articles was used, leaving no room for a claim of using an insignificant portion.

Effect on the Market Value: NYT isn't getting any money from this, and it's clearly not helping their market value if people are checking on ChatGPT instead of reading a NYT article.

IANAL, but in my opinion NYT is well within its rights to pursue legal action. Progress is inevitable, but as humans, we must actively shape and guide it. Otherwise it cannot be called progress. In this context, legal action serves as a necessary means for individuals and organizations to assert their rights and influence its course.

Imo gpt itself is the transformative work.

Ok but it's not

Definition of Transformative Use: The legal concept of transformative use involves significantly altering the original work to create new expressions, meanings, or messages. AI models like GPT don't merely reproduce text; they analyze, interpret, and recombine information to generate unique responses. This process can be argued as creating new meaning or purpose, different from the original works.

In the case of the famous screenshot, the AI just relayed the information it found on the web, it's not included in its training data.

So you're just wrong.

Nope, it doesn't work that way. The fact that the LLM can regurgitate original articles doesn't remove the possibility that training can be considered transformative work, or more in general that using copyrighted material for training can be considered fair use.

Rather, verbatim reproduction is the proof that copyrighted materials was used. Then the court has to evaluate whether it was fair use. Without verbatim reproduction, the court might just say that there is not enough proof that the Times's work was important for the training, and dismiss the lawsuit right away.

Instead, the jury or court now will almost certainly have to evaluate OpenAI's operation against the four factors.

In fact, I agree with the parent that ingesting text and creating a representation that can critique historical facts using material that came from the Times is transformative. An LLM is not just a set of compressed texts, people have shown for example that some neurons fire when you are talking of specific historical periods or locations on Earth.

However, I don't think that the trasformative character is enough to override the other factors, and therefore in the end it won't/shouldn't be considered fair use IMHO.

What if the LLM is running locally and doing all of these things rather than hosted on a webserver which is serving the content?

It doesn't matter, if everything else stays the same what matters is what it's used for. If it's used to make money, it would certainly hurt claims of fair use—maybe not for those that do the training, but for those that use it.

Only humans can do those things, so the test fails for LLM

it's clearly not helping their market value if people are checking on ChatGPT instead of reading a NYT article.

People are not using ChatGPT as a replacement for current news, and because of hallucinations, no one should be using it for past news either. I wouldn't remotely call ChatGPT a competitor of NYT traffic, like I would Reuters or other news outlets.

The intended result is clearly to supplant other information sources in favor of people getting their information from ChatGPT. Why should it matter to legality that the tech isn't good enough for the goal?

I don’t think the original point being made was that NYT wasn’t justified in bringing the action. The point that was being made was the suit would be ultimately meaningless in the long term even if it was successful in the short term. There is a potentially more significant risk in the future that this suit will not protect against because of the reasons enumerated by the author. While the author is speculating, the law struggles with technology and adapting to change, which makes their prediction useful because it does highlight the problems that are coming that can’t be readily mitigated through legal precedent.

rent seeking media companies

Rent seeking? Media companies that actually create content are rent seeking? Versus the garbage hallucinations AI creates?

The New York Times is dying company that is rent seeking here. Along time ago, their content was valuable, yet now you can't even give it away to researchers.

I know because they tried to make a deal with my company, we passed because social media data is infinitely more valuable.

Because its usefulness to your private jet fund is the only measurement of value.

Rent seeking is an awful term that was from the beginning intended to describe anyone pursing a political or legal goal that deviates from a pure free market economy. As Econlib writes:

”Rent seeking” is one of the most important insights in the last fifty years of economics and, unfortunately, one of the most inappropriately labeled. Gordon Tullock originated the idea in 1967, and Anne Krueger introduced the label in 1974. The idea is simple but powerful. People are said to seek rents when they try to obtain benefits for themselves through the political arena. They typically do so by getting a subsidy for a good they produce or for being in a particular class of people, by getting a tariff on a good they produce, or by getting a special regulation that hampers their competitors. Elderly people, for example, often seek higher Social Security payments; steel producers often seek restrictions on imports of steel; and licensed electricians and doctors often lobby to keep regulations in place that restrict competition from unlicensed electricians or doctors.

https://www.econlib.org/library/Enc/RentSeeking.html

This is linked in the wikipedia article, which is even more confused:

https://en.wikipedia.org/wiki/Rent-seeking

Are media really rent-seeking? They create new content and analysis, for which they want to be compensated. It seems quite different to hoarding natural resources or land, for example.

2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.

Sorry, is this the same China that has already introduced their own sweeping regulations on AI? Which in at least one instance forced a Chinese startup to shut down their newly launched chatbot because it said things that didn't align with the party's official stance on the war in Ukraine?

https://finance.yahoo.com/news/beijing-tries-regulate-china-...

https://nitter.unixfox.eu/CDT/status/1625936306814717952?337...

I don't disagree that research/hosting/progress will continue, but I'm not so sure that it's China who stands to benefit from the US adding some guardrails to this rollercoaster.

If Microsoft doesn't get royalty free rights to resell access to everyone's content on demand, China will become the powerhouse of interference-free media? Rrrrrright....

This is the actual truth. What it sucks for is for citing the data, but GPT-4 doesn't do that to start with unless it's directly from a web result and not the weights.

GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct

Is that just by increasing the temperature, tweaking the prompt, etc.? If you can operate on the raw weights and recreate the original text, copyright infringement still applies.

"They" also include the people working there. Why someone work with full time writing articles should give the work for free just let someone to train it and make money out of it as a consequence?

Why someone work with full time writing articles should give the work for free

They are not giving it out "for free", in fact they're being paid by their employer to write these articles. Moreover, the writers themselves stand noth' to gain from their past writings financially as they don't belong to the ownership structure of the business.

Their ability to make money in the future is directly tied to their employers' ability to make money with their content. This is a closed financial loop. If OpenAI or any other AI company wants in, they should pay a licensing fee or get the laws changed, not just assume that they can take what they want and pretend like there are no negative consequences for the creator or the rights-holder.

This is a closed financial loop.

This is a badly-formulated conjecture, or worse, ultimately selective reading of "social credit" which only purpose is serving your argument; it has nothing to do with economics. I'm sorry, but I'm not convinced.

In this limited example, are there such consequences? Are people dropping NYT subscriptions because they trust chatgpt to inform them of current events? I don’t buy it.

No one is pretending there are no "there are no negative consequences for the creator or the rights-holder". Of course there are. But this is a story of rights-holders, who've already outgrown their usefulness, wanting to tap themselves into money stream they are not entitled to.

ChatGPT isn't competing with NYT on a core competency. No one uses LLMs for original news reporting. They're obviously incapable of doing that, by virtue of not being there on the scene or able to independently research a topic, maintain relationships with sources, etc. What ChatGPT can do is quote/reproduce some parts of past articles, and reason from them. Or at least produce new text that's somewhat related to the old text.

The threat to NYT is this: ChatGPT is much better bullshitter than they are, so it reduces NYT to its core competency: providing original information. Which is all it should be doing in the first place. But instead, NYT wants to not only keep the bullshitting part of its revenue, but also take a cut or destroy the much greater and much more useful part of where this all feeds a general-purpose language model.

the writers themselves stand noth' to gain from their past writings financially as they don't belong to the ownership structure of the business.

This is a dumb argument. We're not just talking about ancient articles. We're talking about new content, including content that is yet to be written.

OpenSource developers did that ;)

When open source developers do that, they also include an explicit licensing information that lists cases when the usage is allowed and restricted. So even if the code is open source and licensed under GPL, its usage in a closed source product like ChatGPT is not allowed.

GPL code usage in closed source ChatGPT is allowed "for internal use"; it just would not be allowed to distribute binaries of ChatGPT that are closed source without making source available; also a GPL3 license violation to allow online access to a ChatGPT program that used GPL3 code without making source available.

You understand that news aren't copyrightable right?

You're fighting a scarecrow that doesn't exist...

With the ways NYTimes has degraded since 2010 even if people there are working for free, they're still being overpaid. The only adequate section there is the food.

NYT do not "have" content, they create content. It's their raison d'etre.

They have content that LLMs want to use in training - millions of historical articles.

They created that content. It's an important distinction to make as compared to Reddit or Facebook where the users created the content.

The journalists created the content for the NYT, the users created it for Facebook. Both received something in return for their effort, and the content ended up being owned by NYT/facebook

They didn't care before because LLMs were just experiments. Now we're talking trillions of dollars of value.

Can you make the argument this was their fault for not having forward vision/being asleep at the wheel and "accidentally, in hindsight" letting OpenAI/others have free, open, unlimited access to their content?

Basically none of the training material for GPT was used under an "unlimited" license. There are very important legal limitations. GPT just doesn't care much about them.

No, I can't. It's just an observation with no personal opinion.

If said professor offered a service where anyone could ask them for information that is behind a paywall, and they provided it without significant transformation, this would certainly be copyright infringement that the copyright holder would have every right and motivation to take action against.

Would parroting back article content perfectly from memory certainly be copyright infringement?

Go perform a song in a public place without a licencing arrangement and let us know.

My favorite example of performing a song in a public place without a licensing arrangement:

https://youtu.be/j_UoACEUZqA

scale is important here - maybe a better analogy is setting up a paid Spotify clone with all the music sourced from torrents with some slight distortion effect added

I think the scale only matters here (probably). Because I will find it hard that a teacher/professor will not be allowed to setup a service where they will teach and provide their knowledge for others. That is basically the concept of teaching. Of course until LLM, we never had this scale before. Millions of potential learners vs the normal hundreds in a classroom session. So that makes the new case interesting

"Teaching" by copying source books word for word, would be copyright infringement; see, for example, the well-known issues around photocopying books or even excerpts.

Also lying on source materials (e.g. telling students that some respected historian denies the Holocaust happened, when it's obviously not the case) is not "teaching" - it's defamation, and the NYT is absolutely right to pursue that angle too.

Using LLMs as general-purpose search engines is a minefield, I would not be surprised if the practice disappeared in the next 20 years. Obviously the tech is here to stay, there is no problem when it's applied to augmenting niche work; but as a Google replacement, it has so many issues

Teaching" by copying source books word for word, would be copyright infringement; see, for example, the well-known issues around photocopying books or even excerpts.

Incorrect. Educational use helps satisfy one of tests for fair use. Teachers can, in many cases, photocopy copyrighted work without infringing on that copyright.

Educational use is just one of the many factors used to determine whether an instance of copyright infringement is fair use or not, but it is not carte blanche for educators to ignore IP laws just because they're educating.

Professors are largely behind a paywall

I hope people start calling out the "well it's fine if a human does it" arguments out for the rat fuck thinking it is. These are computational systems operating at very large scales run by some of the wealthiest companies in the world.

If I go fishing, the regulations I have to comply with are very light because the effect I have on the environment is minimal. The regulations for an industrial fishing barge are rightfully very different, even if the end result is the same fish on your plate.

unfortunately that's not the crowd of people here. 80% of the comments under this thread (right now, 2:52est) are making similar arguments and *continue* to act like LLMs are doing something unique/creative... instead of just generating sentences, from algorithms, from virtually pirated content in the form of data mining

“It is difficult to get a man to understand something, when his salary depends on his not understanding it.”

https://www.goodreads.com/quotes/21810-it-is-difficult-to-ge...

Gotta get that tender offer money somehow.

“As if LLMs are doing something creative and aren’t just algorithms”

You have no idea what you’re talking about huh?

GPT is like a fleet of small fishing boats, each user driving their boat in another direction, not a fishing barge. For every token written by the model there must be a human who prompted, and then consumed it. It is manual, and personal, and deliberate.

In fact all the demonstrations in the lawsuit PDF were intentionally angling for reproducing copyrighted content. They had to push the model to do it. That won't happen unless users deliberately ask for it. It won't happen en-masse.

Gpt is operated by one company. If a million people eat your fish, you're still a barge.

Boo hoo they had to push it. That was never the problem with these bullshit nozzles. The issue is they put that stuff in the training set in the first place. If you can't be honest about that then I have no interest in debating this with you.

The professor having been trained in academia would state the sources of the verbatim quotes. In writing papers he would use references and explicit quotes. There's nothing hidden going on with the professor.

Professors and schools get into legal problems when professors pirate and/or otherwise distribute content they don't have licenses for.

Scraping is legal, and this seems like a transformative work to me.

Returning the full text of an article verbatim seems to me like the opposite of "transformative."

In the screenshot for the article you can see that the LLM says it is "Searching for: carl zimmer article on the oldest DNA". That, and what I know about how LLMs work, suggest to me that rather than the article being stored inside the trained LLM it was instead downloaded in response to the question. So the fact that the system is providing the full text of the article doesn't really go to whether training the LLM is a transformative use or not.

Yes, the screenshot in the article is clearly doing an Internet search. The exhibit in the lawsuit shows that you can complete an article by using GPT on the first sentence of the prompt, with low temperature to aid reproducibility, and obtaining the original except for a single word. That is another thing, and it shows that the LLM has basically recorded the original text into its weights in compressed form: https://pbs.twimg.com/media/GCY4WC6XYAAq-JS?format=jpg&name=...

It would be curious to test this on a larger sample than just a few. It is hard to believe that a majority of NYT articles are verbatim stored in the weights of a web-wide LLM, but if that is the case it would be a pretty unbelievable revelation about their ability to compress an entire web’s worth of data. But, more likely, I assume it is a case of overfitting, or simply finding a prompt that happened to work well.

FWIW, I can’t replicate on either GPT 3.5 or 4, but it may be that OpenAI has added new measures to prevent this.

You can't reproduce on the web interface, because the temperature settings are higher than what's required to compress the text. You need to use the API.

However, I had good luck reproducing poems on GPT 3.5, both copyrighted and not copyrighted, because the choice of words is a lot more "specific" so to speak, and therefore higher temperature isn't enough to prevent complete reproduction of the originals. See https://chat.openai.com/share/f6dbfb78-7c55-4d89-a92e-f4da23... (Italian; the second example is entirely hallucinated even though a poem with that title exists, while the first and third are recalled perfectly).

It doesn’t seem that surprising; compared to entire NYT articles, poems are short, structured and more likely to be shared in multiple places across the web.

I’m more surprised that it can repeat 100 articles; if that behaviour is consistent in larger sample sizes and beyond just NYT dataset (which might be repeated on the web more than other sources, causing overfitting), that would be impressive.

You could imagine at some point a large enough GPT5 or 6 or 7 will be able to memorize verbatim every corner of the web.

I have attempted this sort of thing with GPT 3.5 many times and never been successful, although I've still never been taken off of the GPT4 waiting list that I signed up for months ago and I'm not going to subscribe without trying it first. I [and presumably many thousands of others] have tried things like this with many LLMs and image generating models, but to my knowledge we've come up rather short. I've never managed to recreate anything verbatim and have struggled to get anything resembling a copyright infringement out of stable diffusion with the sole exception of a meme image of Willy Wonka.

That said, the meme image of Willy Wonka comes out of stable diffusion 1.5 almost perfectly with surprising frequency. Then again, this is probably because it appeared hundreds or thousands of times in the training set in all sorts of contexts because it's such a popular meme. There is a tension between its status as an integral part of language and its nature as a copyrighted screen grab.

That's not what "transformative" means for copyright.

It's more like, is the new work a distinct expression, e.g. satire or commentary, based on the original.

You can reproduce the original verbatim and still be transformative by adding an element of critique.

Example: https://www.dmca.com/articles/akilah-obviously-vs-sargon-of-...

I don’t think the examples shown reflect an element of critique.

a court has established this already

in japan, where they said anything goes for ai

so its best to not to lose a competitive edge with things that people openly publish on the internet, if you put it out there for everyone to see then expect other people to use it

A court in Japan will have no impact on the outcome of a copyright lawsuit in USA. Not to mention that it doesn't really matter how a Japanese court ruled since it's all governed by treaties anyway. They will change their laws if required to.

its not about applying laws across different countries

its about a precedent. If you don't keep up with international competition, you lose.

I see the exact opposite - any open source model is going to become prohibitively expensive to train if quality data costs billions of dollars. We’re going to be left with the OpenAI’s and Google’s of the world as the only players in the space until someone solves synthetic data.

This feels like a 1996 "music is too expensive for kids so they HAVE to pirate it."

Exactly this. I work at a small web scraping company (so I might be a bit bias) and any small business can collect a fair, capable datasets of public data for model training, sentiment analysis or whatever today. If public data is stopped by copyright as this lawsuit implies that would just mean only giant corporations and pirates would be able to afford this.

This would be a huge blow to open-source and research developers and I'd even argue it could help openAI to get a bit of a moat ala regulatory capture.

Are you talking about search engines, or something else?

Making these things anathema to commercial interests and making training them at scale legally perilous would be a huge win.

making training them at scale legally perilous would be a huge win.

Why?

I have no idea what he's thinking, but if everybody in the community here had an LLM in their pocket and large orgs did not, it would at least be kind of fun.

Because the megacorps should have to pay the people creating the works they are training their multibillion/eventual multitrillion dollar systems on, and should get a nice rake to the face when they try to do an end run around it.

The open source people can continue to pretend they matter in this field and large corporations like Microsoft will stop stealing everything that moves on the internet.

A huge win for countries with lax copyright laws. These things aren't going away, the worst case scenario would be exactly that scenario playing out - then China (or some other peer to the US's tech sector) just continues developing them to achieve an economic advantage. All in addition to the obvious political implications of AI chatbots being controlled by them.

The LLM genie is out of the bottle: an unfavorable court ruling in a single country isn't going to stuff it back in.

Do LLM really give an economic advantage though? I've mostly seen them used to write quirky poems and bad code. People are scrambling to find use-cases but it's not very convincing so far.

On the other hand, if LLM are used to "launder" copyright content and, accepting the premises of copyright law, this has the effect of reducing incentives to do creative work, that has obvious negative implications for economic productivity.

I've mostly seen them used to write quirky poems and bad code.

Assuming this is in good faith: the ability to write code, documentation, and tests is absolutely a productivity enhancer to an existing programmer. The code snippets from a dedicated tool like copilot are of very usable quality if you're using a popular language like Python or JS.

I don't give a shit about what China does.

making training them at scale legally perilous

Loading data to which you have no rights over into your software is legally perilous, yes.

It's as easy as simply asking for and receiving permission from the data's rightsholders (which might require exchange of coin) to make it not legally perilous.

Sounds expensive.

If you want to do things with other people's stuff, yes it can get expensive.

What will happen in this case is that large content providers will get paid directly and smaller content providers will get rolled up into a licensing bag and get small indirect payouts. For example, we might see a model where people who's books have been used will get a pay out proportionate to the sales of the book (perhaps), so if your books sells just a few thousand copies expect $20 but if you sell millions expect $20k

LLM's will become more expensive and less attractive as money printers, this will screw with the business models of the direct provision folks like OpenAI, MS and Google, MS and Google will only shed tears for money spent while OpenAI will just not have as good an income stream until they think of something new.

large content providers will get paid directly

I'm sure that's what they want, but I'm not sure that's what the outcome will be. What if they want to charge a prohibitive amount of money for their content?

They would still thrive but in other countries with other legal frameworks. The concept is way too valuable to disappear.

If its economically relevant us will use its iron fist to have its laws adopted the world over, like most things such as copyright or drugs

I disagree. The verbatim part is the problem. You’re drawing a comparison to how humans operate except we’re not allowed to operate like that.

While harder to do as a human, if memorised a copyrighted book and then did a live reading on TV, or produced replicas from memory and sold them (the most comparable example), I’d be sued.

Humans produce derivative work all the time, and it’s fine for LLM’s to do that, but you can’t do it verbatim.

or produced replicas from memory and sold them (the most comparable example), I’d be sued.

This is not the most comparable example, because it's not what ChatGPT is doing. The most comparable example is if you were hired as a contractor and the employer asked you to write verbatim some copyright content you'd memorised. If the employer then published it, they'd be the one liable, not you.

Nobody's suggesting preventing humans from consuming any copyrighted content just because in future they might recite some of it verbatim, but that's what NYT want for LLMs.

The most comparable example is if you were hired as a contractor and the employer asked you to write verbatim some copyright content you'd memorised. If the employer then published it, they'd be the one liable, not you.

No, you'd both be liable. You are not allowed to create copies of a copyrighted work, even from memory, for any commercial purpose. Making it public or not is irrelevant.

This is more obvious with spftware: if I copy a version of AutoCAD that my previous employer bought and sell it to another company, or even just use it for my current employer without showing it to anyone else, I am violating the copyright on that software, and I am liable. Even though obviously no "publishing" happened.

Similarly, if you hire a decorator to paint Mickey Mouse on the inside walls of your private kindergarten, the decorator is violating Disney's copyright just as much as you are, even if neither of you has made that public.

Your previous employer never bought AutoCAD, they licenced its use, paying a subscription. When you start working for them that licence was no longer available to you. So you would be unable to subsequently use it.

Unable legally, but I may find illegal ways. And the reason it is illegal to copy is copyright at the end. The license is only (legally) required because of copyright.

Then we should be focused on policing the usage of the model, not the training of it.

That's the point at which infringement occurs in your example. It's not the memorizing that's the infringement, it's the reproduction from your memory.

We shouldn't be regulating your hippocampus encoding the book, but your reproducing the book from that encoding.

Similarly, we shouldn't be regulating the encoding of material into the NN, but the NN spitting back out the material.

Will it? If the LLM in the body is allowed to read nytimes on a tablet I'm sure they wouldn't care.

If the LLM in the body is allowed to read nytimes on a tablet I'm sure they wouldn't care.

Why should the law treat a LLM in a body reading NYT on a tablet differently than a LLM browsing the content from a website online and reading that?

Memorising isn't the issue. It's providing it back verbatim and/or cutting access to the source.

You'd get the same problem with someone with a photographic memory who a group of people would turn to recite them the news instead of buying the newspaper.

As of now public performance of copyrighted material is infringement.

That's not the case, as they aren't trying to get a ruling on the forced reproduction by prompt as infringement, but rather to get a ruling that training is infringement.

I fully agree with the perspective that infringement in usage needs to be limited even if I strongly disagree that training is infringement.

Are those LLMs independant citizens we are going to give rights to? Then I'm fine with that.

Are they all owned by one mega-corporation, which is going to do as capitalism does, and use them to squeeze money out of all of us? Then I'm happy to ban them.

"Let's ban something capable of diagnosing medical conditions and letting coma patients to communicate with an EEG because it learned the relationships between words from a giant data set of scraped data and is owned by a company" is a pretty callous take IMO.

The opportunity cost of holding this technology back is going to literally be millions of people's lives given current trends in its emerging applications.

Police usage, not training.

Sounds like you didn't read the article. Here's a better synoposis:

I read a NYT article and publish an exact copy of that article on my website: copyright infringement.

Train a model on NYT text and it outputs an exact copy of that text: also copyright infringement.

So presumably when they fix that issue (which, if the text matches exactly, should be trivially easy) then would you accept that as a sufficient remedy?

Determining whether a work violates a copyright requires holistic consideration of the similarity of the work to the copyrighted material, the purpose of the work, and the work’s impact on the copyright holder.

There is not an algorithm for this, cases are decided on by people.

There are algorithms that could detect obvious violations of copyright, such as the one you suggest which looks for exact matches to copyrighted material. However, there are many potential outputs, or patterns of output, which would be copyright violation and would not be caught by this trivial test.

And you think that it would be impossible to train a model to avoid outputs that are substantially similar to training data?

then would you accept that as a sufficient remedy?

Probably not until they pay him a hefty copyright fee.

Basically, ya. It's not enough to change just a couple words around. But ya, there's probably some way to engineer around the problem.

A small number of outputs of ChatGPT are close enough to training articles to be (probably) copyright infringement.

What does that mean?

Look up "substantial non-infringing use" and this little court case:

https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Unive....

Now spend a few million on lawyers and roll your dice.

In Sony vs. Universal case, Sony is the producer of a tool where the consumer uses to "time-shift" a broadcast that they legally are allowed to view. Similarly, you can rip your own CDs or photocopy your own books. This case never made reselling those content legal. OpenAI does not train ChatGPT on the content you own - they do it on some undisclosed amount of data that you may or may not have a legal right to access, and then move on and (is shown to) reproduce it nearly verbatim - they may even charge you for the pleasure.

Fair use is intended for humans, much like copyright in general.

If you can't copyright AI-generated pieces, then why would fair use apply to LLMs?

Fair use is intended for humans.

Is it? Can you quote relevant legislation or case law?

Because it's not just summarizing the bare facts. It's a parrot.

That's why there will be a legalization of the fair use. Just let your intellectual to be used for free training material is not sustainable.

Also remember copyright laws was not there in the first place.

What if they OCR’d the newspapers? No ToS there.

I’m pretty sure there is still a copyright also for the physical newspaper.

For the paper or the author? What exactly was the licensing agreement for Op-Ed authors in 1962?

Read the article. It's not difficult to get ChatGPT to regurgitate recent, obviously copyrighted articles, verbatim.

It will be equally easy for ChatGPT to rewrite copyrighted content that makes the output materially different for a copyright claim to succeed also.

Then ChatGPT should do that.

It's at least partially a copyright claim, isn't it? So the method -- OCR or scraping -- doesn't matter, I think.

On the other hand, NYT website willingly gave out all the information without imposing limitations. Seeing terms of service requires visiting a separate page, they aren't seen immediately upon visiting the website. Understanding and accepting the terms also requires a human interaction.

robots.txt on nytimes.com now disallows indexing by GPTBot, so there's an argument against automated information acquisition starting from some moment, but before some moment they weren't explicitly against that.

Seems weird to argue that you have to speak up if you don’t want something done to you or else you consent to everything.

I do think that’s the case for some things but especially for new things that doesn’t seem like a common sense understanding of the world.

If you don't want people to get at your land, setting up even a small fence creates an explicit indication of limitations. Just like the record in robots.txt I mentioned earlier.

New York Times also doesn't limit article text content if you just request HTML, which is typical for automated cases. But they impose th limits imposed on users viewing the pages in browser with Javascript, CSS and everything else. So they clearly:

1. Have a way to determine the user's eligibility for reading the full article on server side.

2. Don't limit the content for typical automated cases on server side.

3. Have a way to track the activity of not logged in users, determining the eligibility for access. So it's reasonable to assume that they had records of repeated access from the same origin, but didn't impose any limitations before some time.

So there are enough reasons to think that robots are welcome to read the articles fully. I'm not talking about copyright violations here, only about the ability to receive the data.

Did OpenAI agree to those ToS? If not, I think (IANAL) LinkedIn was kind enough to give precedent that it's irrelevant.

( https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn )

it would be as if I would copy parts of other propriety code and copy paste it into my own codebase.

It's not copy-pasted; it's compressed in a lossy manner. Even GPT4 has nowhere near enough memory to store the entirety of its training data in a non-lossy compression format. Just likes how humans compress the information we read.

If you have a copyrighted photo that I simply put through jpeg compression, am I legally allowed to use that?

Software programs are not humans, and need to be treated differently. Anthropomorphization is one of the slipperiest paths to argue anything.

It depends on how much is reproducible and what the use is.

If only small patches of the original image can be reproduced then it becomes much more murky.

If it’s lossy compressed how come they have verbatim content from NYT in there that’s easy to recall? That’s what the lawsuit is about.

Many humans have photographic memories. Not common, but not unheard of for people to be able to memorize long portions of text verbatim.

For example, the Wikipedia article

https://en.wikipedia.org/wiki/List_of_people_claimed_to_poss...

contains several examples of people who were able to look at pages and recite them back. That is actually a much stronger ability than GPT since GPT has presumably looked at them 100 times.

You're kind of proving my comment pretending they are akin to a human brain instead of an evolved form of statistics mixed with code, aka transformer model.

Let alone that it's a centralised model that's being distributed for a fee.

So if compress nytimes articles into a vector database and query it is a vector then that's okay in line with your reasoning?

Just likes how humans compress the information we read.

Humans don’t have the scale machines have and moreover humans aren’t sevices, that argument doesn’t fly.

I really think NYTs data isn’t that important and nor crucial, LLMs could’ve just elided it. However, it’s more about training on copyrighted data in general which is kind of crucial for OpenAi, they trained their LLMs indiscriminately on copyrighted content without any plan to share any profits.

We developers pretend that LLM's are akin to humans and that they've been educational material.

Developers thinking LLMs are akin to humans arent the brightest crop, and are usually a topic of ridicule.

Wow they want to kill it. I wonder if we've just lived through the golden Napster era of LLMs.

Just train on NYT articles no longer in copyright. We may be better for it.

Or buy them. OpenAI market cap is many times NYT.

If we see court judgements start to go copyright owners way, we will also see a scramble from AI companies to buy the few publishers with enough data to be worth buying, and to create works for hire to replace the rest.

In the long run a copyright ruling like that will be a boon for OpenAI and all other players with deep enough pockets to do so, and massively harm everyone else who will suddenly find it far harder to build models legally.

So that would mean articles from the 1920s, provided that the authors of those articles have been dead for 70 years, or longer in some other countries.

Next thing you know ChatGPT gives you the best way to crank your automobile and take good care of your crinoline.

They may just want a licensing deal.

They're already working on it with Apple (see my other reply in this discussion), so I wouldn't doubt that this is another salvo in the same battle.

This is what lawyers are paid for. They ask for the max because there’s no harm in doing so. Everyone knows there’s little meaning to that.

They always go for the max, knowing that they will settle somewhere closer to the expected rate.

My hard drive can - bit for bit - recall video files. If I serve them to other people on the internet without permission of the copyright holder, that’s called piracy.

But is it still piracy if you compress them and serve only a likeness of the original?

Yes.

If 20% of a NYT article is recalled correctly, does that mean I can publish 20% of a movie if surrounded by junk? What if I do that 5 times over?

Yeah, but the LLMs can't. They aren't big enough to contain every byte of every NYT article, even with the best-known compression algorithms. Rather, they pick up and remember the same patterns that humans do when they write. Authors of the articles also did that, and so the two algorithms (human writer, LLM inference) end up with the same result. (That doesn't preclude large chunks of text that are actually remembered, though. We humans have large chunks of verbatim text floating around in our brains. Passwords, phone numbers, "I pledge allegiance to the flag...", etc.)

Anyway, like I said, I don't think OpenAI will win this. Someone will produce one verbatim article and the court will make OpenAI pay a bunch of money as though every article could be reproduced verbatim, and AI in the US will be set back that many billion dollars. It probably doesn't matter in the long run; it preserves the status quo for as long as the judge is judging and the newspaper exec is newspaper exec-ing. That's all they need. The next generation will have to figure out how to deal with AI-induced job loss... and climate change. Have fun, next generation!

In general, if you perform copyrighted works you are doing copyright infringement. There are certain exceptions (personal use, education, very small fragments with proper attribution, maybe a few others) but whether you are reading it aloud from a book or performing it from memory makes no difference.

So, if you setup a service like ChatGPT but powered by humans responding real time to queries, and these humans would occasionally reproduce large chunks of NYT articles, they and the service itself would be liable for copyright infringement. Even if they were all reproducing these from memory.

Now, this is somewhat different from the discussion of whether training the model on the copyrighted data, even if it had effective protections from returning copies of it, constitutes copyright infringement in itself. I believe this is a somewhat novel legal question and I can think of no direct corollaries.

I certainly don't think we can just handwave and say "at some level, when a human reads a copyrighted work, they are doing the same thing", because we really don't know if that is true. Artifical neural networks certainly have no direct similarity with the neural networks in the brain as far as we can tell. And, even if they did, there is no reason to give a machine the same rights that a human has - certainly not until that machine can prove sentience.

I don't agree that an LLM is doing what we are doing.

"Its what we do all the time" is a major assumption

It's extremely speculative to claim that LLM models are basically doing what humans do. There is very clearly something that isn't right about that because in order for a human to learn to speak and converse and they don't need to imbibe the entire corpus of all written text in human history - which is basically what we're doing with these LLMs. What we're giving them is vast amounts of data which is totally unlike how humans work. There's very clearly some gap here between what a LLM is doing and what a human is doing. So you can't use that as a basis to justify why it's ok for OpenAI to operate like this.

To put it another way, let's say I turn the dial all the way the other way, I train the worlds crappest LLM on NYT material, it massively massively overfits and all it will ever return is verbatim snippets of the NYT. Is that copyright infringement?

The core part of the argument here is actually just that OpenAI doesn't want to adhere to what the current standard is for using copyrighted material, if you want to use it and create something new with it you need to license the material. Since OpenAI's LLM isn't actually like a human it needs to license such a vast dataset that it would be uneconomical to run the business without stealing all the content.

They have won no favors in the court of public opinion with their frequent misinformation.

You mean GPT here, right?

ChatGPT only advertises itself as a fancy autocomplete. There is a disclaimer that it may produce output that appears correct but isn't. NYtimes written material purports itself to be the truth, thus shouldn't be held to the same standards as a generative AI obviously.

I think what we should focus on is the volume of misinformation in general, not the provenance of it.

The NYT may produce misinformation but it aims not to, and its staff of human writers are limited in the quantity that they can produce. They also publish corrections.

GPT enables anyone who can pay to generate a virtually unlimited volume of misinformation, launder it into 'articles' with fake bylines and saturate the internet with garbage.

I think we need to focus on the damage done.

Except when it affects their bottom line of course, they publicly lied on how meta tags work during the lawsuits against Google to get more money (like most newspapers did). And I have no doubt that they will extensively lie once again on how LLM really work.

Well that's true for any large language model. As long as they exist there will be a deluge of bot written text producible for any purpose. At this point there is no getting the cat back into the bag.

In that case the bigger danger is Open source LLM's. OpenAI at least monitors the use of their endpoints for obvious harm.

I've never really known The New York Times to file frivolous lawsuits.

Nobody is looking at this suit as applying to the Times exclusively – and neither will the courts.

> There may be some commercial subtlety in specific cases that doesn't depend on scraping and training

The key is to stop calling it "training" and use "learning" or just "reading".

The argument from NYT will probably be that LLMs are just a fancy way to compress or abstract information and spit it back out. In which case "training" seems to support their case?

I don't recall the source, but when people read, they typically only remember 20% of what they read (or heard?). Machine training encodes much more than 20%, so it is much closer to copying than training. Now the emergent abilities that come from this could be considered learning and dare I say imagination (which is the opposite of copying).

NYTimes has a paywall. Is that public internet and therefore fair use?

They don't have the paywall up if you identify as a search engine scraper, so it is kinda public internet. (I'm not claiming it's fair use.)

That is an irrelevant comparison.

This is theft and monstrous profit from theft. For actual justice this should be a class action suit of the world vs. OpenAI/Microsoft and the financial consequences should be company-ending for OpenAI. Otherwise, you have incented everyone in the AI industry to steal as much as they can for as long as they can.

It's funny how times have changed and at least now a louder group seem to be on the troll's side

Because for many people, their views on current events are whatever the "thought leaders" working for the NYT and similar publications tell them to think.

Using the words troll and frivolous undermines your otherwise decent point and in fact goes against your point.

The law isn’t settled, it’s a genuine legal question mark.

It ain’t frivolous or trolling or ridiculous.

What's wrong with that? If I was the NY Time's lawyers that what I would advise. What would it serve to bankrupt the IA, they can't pay anyway? These are corporations enforcing their rights against one another.

There is nothing wrong with profit seeking from your copyright. That's literally their entire business model...they publish copyrighted content which they sell for a subscription.

OpenAI and others could easily have negotiated a licence instead of just using the data. They bet that it would be cheaper to be sued, lets find out if they bet correctly.

Tangentially that's what Apple did with the sensor in their watch, it doesn't always pay off.

What would it serve to bankrupt the IA, they can't pay anyway?

It would serve the termination of the infringement.

My point is that the Times doesn't particular seem to care about infringement per se, they care about getting their slice of the cut from that infringement.

It's like if a video game company or a movie company only attempted to sue illegal downloaders who had a certain net worth.

I mean yeah, no one's gonna bother trying to squeeze money out of Joe Schmoe with 10 bucks in his bank account over some pirated movies. If a company with billions and billions of dollars like Netflix started pushing out pirated movies instead, then obviously they'd be sued into oblivion, as they should be.

I think that the moment you start making big money from someone else's business is the moment that they get riled. That and when you really hurt their business. I suspect that the NYtimes thinks that IA is damaging them in the order of (possibly) $100k pa, and that it thinks that OpenAI is making in the order of $10M's from their content (and possibly doing some damage as well). It's an easy commercial decision to ignore one and go after the other - especially as going after IA is going to create some backlash as well.

Copyright doesn't stop the collection of content, it stops the copying, processing, & redistribution of content. Internet Archive acts as a library, so its widely accepted as fair-use when it makes collections of webpages available.

OpenAI's distribution is materially different to that of a library, so it's not a like-for-like comparison.

One of the main tests of copyright law (at least in the US) is if the entity distributing is _selling_ the copied/derivative work. It's unambiguous that OpenAI is selling something akin to derivative works, which is why NYT feels they can go after this claim. Meanwhile IA's operations don't create sales or incur profits, so while NYT's legal team may be able to establish that copies have been distributed, without the _sale_ aspect of the infringement, judges aren't guaranteed to side with NYT in an legally expensive PR nightmare.

I think it's different. LLMs can solve problems. Part of that problem-solving ability comes from training completely unrelated content such as NYT articles. GPT4 doesn't have to spit out NYT articles verbatim to have benefited from NYT articles. It uses NYT articles for every query.

Let's say I'm an academic; if my research, note-taking, and paper writing skills lead to fair-use, cited quotations where applicable, general knowledge not identified, and the creative aspects and unique conclusions creating the intriguing part of my work, that's copacetic. If I spit out (from memory, mind you) verbatim quotes and light rewordings of NY Times articles, that's not; "I don't remember where I got that material" doesn't cut it. My reading the NY Times every day for years because I judge it to be more literate and accurate than other sources, undoubtedly it has informed my thinking and style, but I don't need to acknowledge that.

If I use ChatGPT as a research tool, as long as it lives within the same parameters that I have to live within, I don't see a problem with its education/learning.

I understand that the NYTimes would like a slice of anything that comes out of the GPT but I'm talking about what seems reasonable. People who share their copyrighted material do not own all of the thinking that comes out of it; they own that expression of it, that is all.

Will AI destroy the economics of "writing" the way the web has killed newspapers? perhaps, perhaps we'll all benefit from and need a new model, but killing the new to keep the old on life support is not the way.

You're not replicating yourself millions of times and selling yourself for $20/month. If you are, then NYT might sue you too.

I'm not saying LLMs are by default, illegal. All I'm saying is that there is some merit to why NYT and content companies want a piece of the pie and think they deserve it.

The NY Times benefited in the past from technologies that led to widespread distribution of the Times, putting competitors out of business and concentrating talent at the Times. Nobody is stopping them from producing new editions of the newspaper, their core business. People now have technologies that help them "remember" what was salient in back issues of the Times. Such is progress.

If you are piece of software then yes.

Yes to what?

A human could tell.

Cannot see how human news companies could compete.

News ultimately comes from physical sources on the ground, which currently AI has no way of doing.

That style of journalism is nearly dead. True on the ground investigative journalism is hardly done today, most is just reporting existing public information releases. You don’t have to be at the presser when everything the police chief says will be put in an online transcript.

I am sure it could easily rephrase the articles to tell them without quoting any real or verifiable sources. Many human news companies often do it too.

Essentially if you copy a lot of copyrighted material into a blob and then apply some sort of destructive compression to it. How destructive would that compression have to be for the copyright no longer to hold? My guess it would have to be a lot.

I imagine the goal is closer to "enough that no one notices we stole it", either in a way that it's not easily discoverable or even when directly analyzed there's enough plausible deniability to scrape by.

The answer to the "closedness" is externally controlled audits.

If you read the complaint, it explains this pretty well. The use of copyrighted content by search engines is fundamentally different from the way LLMs use that same content. The former directs traffic (and therefore $$) to the publisher, the latter keeps the traffic for itself.

The legal misconception I want to flag in your logic is the notion that all uses of the Common Crawl are equally infringing/non-infringing. If you use the Common Crawl to create a list of how often every word in English appears on the internet, that’s unquestionably transformative use. But if you use it to host a mirror of the NYT website with free articles, that’s definitely infringement. The legality of scraping is one matter, and the legality of what you do with the scraped content is quite another.

From my original comment:

Is it legal or not to scrape the web?

If I scrape the web, is it legal to train a transformer on it? Why or why not?

At no point did I say anything about hosting a mirror of the NYT website, with free articles. Obviously. Because OpenAI didn't do that. Some NYT lawyer tried to get ChatGPT to write a NYT article. Maybe first they should have actually done a Google search and shut down some of the actual content farms which simply copy NYT content such as [0]. But instead, we get this.

[0]: https://salaminv.com/news_file/

This is not possible. There is no database of sources inside an LLM. Just like the knowledge in your brain does not have sources attached.

For an example, you referenced "what happened with Google News's home page". Could you give me your source? You could probably search for some suitable article for a reference, but you don't know a source from your memory.

not likely with the way these models have been trained - its basically broken down into sub-words that are all mashed together into probabilities.

Wishful thinking. Just as equally, NYT’s right to copyright over its material, in order to have a functional press, is enshrined the constitution. Anyone threatening that copyright could be unconstitutional.

I think we all agree that no one is entitled to “progress of science” at any cost - as a straw man, killing hundreds of newborn babies for scientific research is not great - so we use ethics and the legal system to find the line of what’s acceptable.

I don’t know exactly what NYT is asking for here, but the two options aren’t unconsented training vs nothing at all. NYT could license, for a fee, its content to OpenAI. It’s pretty common for scientists to have to pay for materials!

Current AI is useless without people writing the articles in the first place.

I remember a case where the court did not allow ID to patent "First person shooters"

This rings similar.

But the idea that words or thoughts or images can be owned (and that the might if the state can be brought to bear to enforce it) would seem utterly ludicrous to someone from an earlier era.

Is there any research into how people from earlier eras thought about it? And should all laws that seemed ludicrous to someone from an earlier era be discarded? If not, how exactly do we determine the relevance of what someone from an earlier era would think about our laws?

You can say the same for any legal enforcement like respecting patent or copyright law or making Champagne outside France. Yet the sky isn’t falling given this reality with so many legally protected industries. Maybe these markets where such an industry might offshore to are too small and insular to be very significant, and are probably language bound to make english models less relevant compared to native language models.

Champagne isn’t a transformative technology, and least not anymore.

A Second LLM's take on this lawsuit can be found below. I'd love to see OpenAI address these complaints publicly and without incurring any additional damages to NYT.

The document is a legal complaint filed by The New York Times Company against Microsoft Corporation and various OpenAI entities, alleging copyright infringement and other related claims. The New York Times Company (The Times) accuses the defendants of unlawfully using its copyrighted works to create artificial intelligence (AI) products that compete with The Times, particularly generative artificial intelligence (GenAI) tools and large language models (LLMs). These tools, such as Microsoft's Bing Chat and OpenAI's ChatGPT, allegedly copy, use, and rely heavily on The Times’s content without permission or compensation.

Nature of the Action: The Times emphasizes the importance of independent journalism to democracy and claims its ability to continue providing this service is threatened by the defendants' actions. The complaint argues that the GenAI tools are built upon unlawfully copied New York Times content, which undermines The Times's investments in journalism.

Defendants: The defendants include Microsoft Corporation and various OpenAI entities, such as OpenAI Inc., OpenAI LP, and several other related companies. The Times alleges these entities have worked together to create and profit from the GenAI tools in question.

Allegations: 1. Copyright Infringement: The Times claims the defendants copied millions of its copyrighted articles and other content to train their GenAI models. This training allegedly involves large-scale copying and use of The Times’s content, emphasizing its quality and value in building effective AI models.

2. Unlawful Competition: The Times argues that the defendants' GenAI tools compete with it by providing access to its content for free, which could potentially divert readers and revenue away from The Times.

3. Misattribution and Hallucinations: The Times asserts that the defendants' tools not only unlawfully distribute its content but also generate and attribute false information to The Times, damaging its credibility and trust with readers.

4. Trademark Dilution: The complaint includes claims that the defendants' use of The Times’s trademarks in connection with lower-quality or inaccurate AI-generated content dilutes and tarnishes its brand.

5. Digital Millennium Copyright Act Violations: The Times alleges that the defendants removed or altered copyright management information from its works, which is prohibited under the law.

Harm to The Times: The Times claims it has suffered significant harm from these actions, including loss of control over its content, damage to its reputation for accuracy and quality, and financial losses due to diminished traffic and revenue.

Demands: The Times seeks various forms of relief, including statutory damages, injunctive relief to prevent further infringement, destruction of the infringing AI models, and compensation for losses and legal fees.

Overall Summary: This legal complaint represents a significant clash between traditional media and emerging AI technology companies. It underscores the complex legal, ethical, and economic issues arising from the use of copyrighted content to train AI systems. The outcome of this case could have far-reaching implications for the AI industry, content creators, and the broader digital ecosystem.

Not sure if they’re ahead but I think it was smart to not ship anything LLMlike until the regulations get made first movers test the waters.

Casey Newton has been saying all year that these things will be awesome once we can unleash them on our own corpus of data safely. “Siri” already does a great job digging through my photos and picking the good memories. I can let my camera roll become a visual junk drawer now.

Do the same for my email. Make “Find” the tool we always wanted to be. I don’t care if I’m conflating LLMs/AI with other smart tech.

Yeah the comparison to humans is silly anthropomorphising at this point.

However I am inclined to agree with them for the simple fact that putting a file into a device and letting that device reproduce parts of the file should be allowed. I mean we're already at the point where this simple right is under pressure from DRM, but people should be allowed to do whatever they want with the files they own.

Whether you can publish this output and share it with the world is a whole different issue.

It already is, but I don't think this is a good example. NYT has a legitimate case here. They own the material they publish, and GPT-4 is shown to be able to recall entire articles verbatim. That's a violation, clear as day.

The thing about lawsuits is that you make dozens of claims, and the court can rule in favor of some of them, and against others. The question of "is LLM training fair use?" hasn't made it to a high court yet. The court could very easily rule against everything else in the suit.

You're getting mixed up. When applying the four factors, you need to individually separate all the uses. So you would need to repeat the fair use test for every alleged type of infringement. This means that the scraping from the public internet to OpenAI's dataset storage cluster is one instance where the full analysis of the 4 must take place, then the training itself, so another full analysis, then the distribution of model outputs, another one, etc.

As you said AI can rewrite articles, obtaining a clean cut separation between ideas and expression. Keep the ideas, write a new text. And if you got multiple sources, the more sources you use the better, it would make the output be even more different. This approach could also check consistency and bias between sources.