return to table of content

NY Times copyright suit wants OpenAI to delete all GPT instances

munchinator
208 replies
9h6m

It's interesting to me the ambiguous attitude people have to reproducing news content. Whenever there is a story from NYT on HN (or any other large media outlet), the top comment is almost always a link to an archived version which reproduces the text verbatim.

And this seems to be tolerated as the norm. And yet, whenever there is a submission about a book, a TV show, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.

I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP.

And the reason I bring this up, is that it seems like Open AI has the same attitude: scraping news articles is OK, or at worst a gray area, but what if they were also scraping, for example, Netflix content to use as part of their training set?

Germont
27 replies
8h20m

To me, there is a sense that the news, which is real information about the society that we currently live in, should be availabe to all participants of that society. The notion of being a good citizen requires that one stays informed. Books, movies, videogames etc. don't have that role and are more consumption goods.

JackFr
11 replies
8h1m

should be available to all participants of that society.

Who pays?

fodkodrasz
6 replies
7h30m

The government (thus the people, in a so called sharing of public burden)!

For example in Hungary there is an official news agency ran by the government, with (cumbersome) free access for everybody. Of course this does provide somewhat biased presentation of some facts, but on many topics it provides unbiased access to news for any citizen.

This is actually pretty common in Europe, often funded by mandatory fees (for some reason not branded as taxes) certain appliance owners need to pay (UK TV license, German Rundfunkbeitrag). For this fee people get access to news and cultural programmes for free via different media (radio, TV, internet).

bongripper
5 replies
7h15m

I agree with your general point but Hungary is probably the worst example you could have chosen from any EU country! The Orbán government is famously using it to spread propaganda and fake information in unprecedented levels.

The level of control governments exert on public broadcasting networks is widely different. Since Meloni, the RAI in Italy is facing similar issues, but Hungary is still the canonic example of government misinformation and propaganda.

fodkodrasz
4 replies
6h34m

That is a orthogonal to the discussion we were having. The topic was whether people should have free access to news, and how should it be financed, not the quality of that news.

People have free access to public roads all around the world, and the quality wildly differs in that as well. Also the quality of for-profit news services does differ wildly, you might have an opinion about that of fox news, for example, but that is also off topic in this discussion.

nulbyte
1 replies
6h22m

That is a orthogonal to the discussion we were having. The topic was whether people should have free access to news, and how should it be financed, not the quality of that news.

On the contrary, the quality of the news is very important to the discussion. There is no point in making trash freely available to the public, after all.

fodkodrasz
0 replies
6h6m

The topic is a bit more nuanced, and far wider than "not fitting my favourite narrative on some topics, so it is generally and objectively trash".

think about this: I will get mostly objective and useful reports of the flood approaching my home near the river regardless the narrative/interpretation they might have on some other topics, or the biased reporting on the merits of the government in handling the situation at the dams.

For me I'm not here to debate on the political policies of some governments, just gave a few examples of ways to fund public access to news. This discussion is over from my part.

jacquesm
0 replies
6h2m

No it isn't an orthogonal discussion. The reason Orban wants people to have free access to his propaganda is because it directly serves his purpose. To finance it directly from sales of the media would defeat the purpose. Coupled with Orban's attack on free media it completes the picture.

bongripper
0 replies
6h2m

I would argue the people of Hungary would be better off without hatred against asylum seekers and minorities, political opponents, lies and misinformation.

philwelch
0 replies
2h3m

Every news source has biases. Under the paywall business model, the people who share the biases of their favored news outlets pay for them, and in exchange, they get to ensconce themselves inside a bubble free of dissenting viewpoints. This also reinforces the bias of the news outlet; if they don’t toe the line, they will lose subscribers.

Instead of paying news outlets to provide ourselves with filtered feeds of content that match our own biases, we could instead pay news outlets to produce competing streams of explicit propaganda to be freely disseminated. The overall bias and quality of the news would be largely unchanged, even if the biases were more obvious; in fact, it may even improve.

makeitdouble
0 replies
6h10m

Yes, someone needs to pay.

I see the gp post about pirating news as a very good point, while having no veleity to pay the New York Times, and being ok with not reading it in general.

But I also pay for my national (public) news outlet, and their articles are available to anyone anywhere in the world. I don't know how it should work, but I wish we could get to a system where the burden to keep news outlet alive is split thinly enough to have open but viable publications around the world.

Basically the same way weather stations collaborate all other the world and we pay for our local stations while getting acccess to all the forecast everywhere.

danielheath
0 replies
5h18m

Everyone, if you don’t…

concordDance
0 replies
5h59m

There's a few possible models here:

Public donors ALA Patreon

People doing it in their free time because they care a lot about the subject (nowadays with things like Twitter its quite possible for an independent obsessive to write a good piece on, for instance, the Ukraine War by mostly referring to open sources and public announcements by governments and corporations)

Government sponsorship ala BBC

afavour
3 replies
8h17m

It’s a difficult problem with no great answers. If you want news to be free at the point of delivery you want public service news agencies. But that means they’re owned by the government… who are frequently the target of critical reporting.

bongripper
2 replies
7h21m

That's not true. You can have Independent public broadcasting that is not owned by the government and is reporting critically on it.

afavour
1 replies
7h7m

It’s still a difficult tension. The government will always control the purse strings so independence is always going to come with conditions.

vidarh
0 replies
6h23m

The Guardian in the UK is an example of an alternative: It is owned by a trust, which funds it.

Norway has substantial public media funding across the political spectrum, but as you point out it always comes with conditions, even is less so than the funding for the state owned broadcaster.

Combining the two models and putting public funds into several perpetual trusts intended to provide funding from their profits at arms length from any sitting government similar to the (private) trust funding The Guardian might be an interesting alternative.

(EDIT: Norway also has its own variation over The Guardian model - the second largest media group was founded by unions but is now majority owned by the combination of two public benefit trusts)

rickydroll
2 replies
3h18m

I'm of a similar mind. I take the more expansive view that everything created is part of our common property and that something like an LLM should be able to yield the summary and references to those creations. As I've said elsewhere, LLM systems might be our first practical example of an infinite number of monkeys typing and recreating Shakespeare (or the New York Times).

I understand that copyrights and patents are vehicles for ensuring a creator gets paid for their work, but they are flawed in not rewarding multiple parallel creations and that they last too long.

briansm
1 replies
2h45m

a LLM is just a hugely lossy-compressed version of its training data, an abstraction of it.

Much in the same way as when you read a book, your brain doesn't become a pirated copy of the text as you only store a hugely compressed version of it afterwards, a feeling for the plot, generated images and so on.

rickydroll
0 replies
2h0m

That's what I thought from my various readings about LLM systems. I'm guessing that the kerfuffle from the New York Times and other shortsighted organizations is that copyright allows them to control how their content is used. With humans, it's simple as its read and misremembered. Using it for LLM training requires a different model. It probably should be a RAND fee system based on volume of training data because, as you say, the training data is converted into an abstract form.

guhidalg
2 replies
8h11m

I agree, but nothing worth having is free. NYT and other news outlets have to ultimately pay reporters to go out into the world and do the work. The reporters are not priests, and the NYT is not a church that lives off donations and tax exemptions. They need money to operate, and you may disagree with how they try to collect that money (paywall) but that doesn't solve their funding problem.

How would you pay for news otherwise?

Ntrails
1 replies
7h34m

How would you pay for news otherwise?

You could subsidise news via "public service" style stipends. Much like having a government owned "independent" news service (eg the BBC) this comes with a high risk of corruption. Don't bite the hand that feeds and all that.

You could implement a much lower friction non-recurring payment system. I'd be far more tempted to drop a little money on a fixed term (5 articles, 1 day, ???) setup than a subscription.

Realistically, I am not paying for more than 1 long running sub. And there are > that number of solid outlets.

lotsofpulp
0 replies
7h12m

Realistically, I am not paying for more than 1 long running sub. And there are > that number of solid outlets.

This is somewhat what Apple News+ works like, but I doubt most news orgs want to be held captive by Apple.

sumedh
1 replies
5h56m

which is real information about the society that we currently live in, should be availabe to all participants of that society.

Who should pay the journalists or the investigative reporters?

malermeister
0 replies
4h52m

The state, through taxes. It's a public good after all.

pawelmurias
0 replies
7h21m

which is real information

People post archive links even to fake NY Times.

ahoka
0 replies
8h7m

Not everything is news that appears in a newspaper. There are opinion pieces, etc.

Baldbvrhunter
0 replies
7h56m

what about wordle or the crossword or the cooking section

https://cooking.nytimes.com/

Yizahi
21 replies
7h44m

Good comment, it was very funny to see how people desperately try to find moral justification for pirating media A but not B. "It's apples to oranges, you see, there are less letters in the NYT article than in the book and they are rendered differently, so it is ok to pirate their work. I did nothing wrong!" :)

amelius
9 replies
7h11m

There's no way to get your money back if you didn't like the content. If they don't want their articles to be read for free then they should keep them out of my view. And certainly not use clickbaity headlines. Information can be copied and they should accept it, or change their business/distribution model.

Yizahi
8 replies
6h43m

So if I went to a cinema and didn't like the movie, I should be entitled for a return, right? Or if I went into a museum and didn't like the art displayed there?

If you are advocating for a free for all libertarian dystopia, well, I have some bad news for you - they never work.

amelius
5 replies
6h33m

So if I went to a cinema and didn't like the movie, I should be entitled for a return, right?

Not being able to un-see a movie and get your time and money back is one side of the coin. The other side is that information can be copied.

Both sides suck for one of the parties. There's no reason why one of them gets it their way, especially if it requires a contrived legal framework while the other way would require nothing at all.

bena
4 replies
5h38m

You’re not paying to enjoy the content, you’re paying to experience the content.

And as long as you had the opportunity to experience the content, you’ve gotten what you paid for.

I don’t see “I don’t like it” as a valid reason for a refund.

amelius
3 replies
5h29m

You’re not paying to enjoy the content, you’re paying to experience the content.

Not sure about others, but I'm not.

goatlover
1 replies
2h24m

Would you make the same argument for a sporting, theatrical or music event? That you should be refunded if you didn't enjoy it?

amelius
0 replies
1h42m

Does it matter? Sounds to me like an apples and oranges comparison.

If I read an article in the NYT then I'm paying for what I took away from it, not for the amount of time that it allowed me to kill.

bena
0 replies
4h31m

Your personal opinion on the matter has little weight here.

It doesn't matter what you think you're paying for or should be paying for, the fact of the matter is that you're paying for the effort people put in bringing that to you. So you are, whether you want to be or not.

4RealFreedom
1 replies
4h39m

I don't agree with the OP but how are refunds a free for all libertarian dystopia?

Yizahi
0 replies
1m

"Information can be copied and they should accept it" <- I was referring to this line. This basically means that OP thinks that any intellectual property should be free for everyone. This means that probably half of humanity (who are currently creating anything with IP) will have to be libertarians, and that can't happen unless all humanity are libertarians. And libertarian society is a dystopia. :)

fodkodrasz
6 replies
7h36m

It is actually pirating content by companies for humongous profit, or pirating by individual human beings for free access to culture and entertainment, oftentimes for content one has already paid for, but rendered inaccessible by megacorporations.

lotsofpulp
5 replies
7h24m

Which content making businesses earn humorous profit margins?

Are all the journalist layoffs a fever dream?

This is one of the more profitable ones, and only because they employ unscrupulous tactics:

https://www.macrotrends.net/stocks/charts/NWS/news/profit-ma...

This is NYT, the most successful news business:

https://www.macrotrends.net/stocks/charts/NYT/new-york-times...

As for movies/tv show/music makers, let’s just say most people in the software engineering business would look at their numbers and count their lucky stars that they are not in the movie/tv show/music business.

(It is also true that excessive copyright lengths have removed access to content that the public should have a right to).

sjfjsjdjwvwvc
2 replies
6h16m

The movie/tv show and music business can keel over and die tomorrow - it wouldn’t affect the value of art produced by humans at all. I see those more as exploitative leeches than as contributing anything positive.

If only piracy would actually harm these businesses but alas as often demonstrated it has zero effect on their bottom line, if anything it increases their profits.

RandomLensman
1 replies
5h22m

What do you mean by "art"?

sjfjsjdjwvwvc
0 replies
1h2m

Hard question, but in the context of my comment I would say any kind of visual media or music

fodkodrasz
0 replies
5h56m

Which content making businesses earn humorous profit margins?

You got my point backwards: AI companies will make it from the pirated content, that individual users don't make.

defrost
0 replies
7h18m

Which content making businesses earn humorous profit margins?

https://en.wikipedia.org/wiki/Mad_(magazine)

https://www.theonion.com/

dillydogg
1 replies
4h8m

I wonder what the reaction of some of the people who browse this forum would be if the output of their careers were so commonly pirated. Somehow, I think most think that this argument doesn't apply.

Demiurge
0 replies
2h5m

I’d be pretty delighted. I’m paid for getting projects done, not for keeping hold on some copyrighted code. I want all my code to be open sourced, and reused.

sjfjsjdjwvwvc
0 replies
6h14m

Of course pirating any media is totally fine from a moral standpoint.

ks2048
0 replies
4h58m

It seems pretty natural to me. People generally have less problem with stealing a candy bar than stealing a car. (Consider the cost to produce a NYT article vs the cost to produce a Hollywood movie). I don't think the stealing-vs-pirating analogy is perfect, but it's related.

tzs
19 replies
5h59m

I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP

As you noted it is not the norm to post pirate links here for IP other than news articles, but that doesn't mean that a lot of people think it is not OK to pirate those other forms of IP.

In nearly any big discussion that even remotely involves video streaming there will be numerous posts from people explaining why they pirate (usually with ridiculous justifications like "subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change").

The impression I've gotten is that piracy of nearly everything is widely felt to be OK here. Information wants to be free, yada yada.

About the only piracy that is consistently frowned upon here is piracy of open source software. When some company sells an embedded device that uses GPL code without releasing the corresponding source that's viewed as just a little short of a crime against humanity.

monkeynotes
5 replies
3h16m

People used to leave newspapers in the trash, on the train, all over the place. Anyone could pick them up and read for free. I think it's reasonable for folks to carry this attitude into the digital age. People feel like news is something to share, it's not the source of creative expression, it's facts and as such we feel entitled to know the facts about our world and what is happening that might affect us.

zwischenzug
3 replies
2h55m

That newspaper was likely paid for by someone, and could only be read by one person at a time.

anhner
1 replies
2h23m

And what if the person picking up the paper would stand up and shout the content of the article so all the people on the train would hear?

svachalek
0 replies
2h16m

Reminds me of the movie News of the World. The main character's job is going from town to town, reading newspapers aloud.

Kerb_
0 replies
2h35m

While I'm well aware I'm being pedantic, me and my brothers would share the comics together while my parents kept the news, up to 4 of us consuming 1 paper at a time. Realistically, the reading limit was due to the physical properties of the object and not an inherent property of information to be consumed through one avenue at a time

edgyquant
0 replies
2h36m

No it isn’t reasonable and people not paying for that newspaper read anymore is the reason all news is sensationalist opinion pieces today.

theappsecguy
3 replies
3h49m

This seems very false to me. Spotify is the prime example. They offer a good product that covers a 100% of my needs at a reasonable price. If that was an option for say UFC or engineering books, you bet I’d be subscribed. But being forced to read through some crappy reader software when I need the book source to take annotations in another software doesn’t work, so here we are. Same with the absurd pay per view business model of UfC.

ryan_j_naughton
1 replies
2h50m

For books, if it's a client reader software frustration, then you should still buy the digital version and then you can pirate the PDF book and use as desired within the constraints of copyright law (e.g. don't go sharing the PDF). That way you get the client you want but you still paid the content creator. But to use the argument, "oh, I don't like their client so I'm going to not pay them" is BS.

For UFC, your complaint is you don't like their pricing. The whole point of copyright is to give someone the monopoly to control pricing so they can use that pricing power to incentivize them to create the product in the first place. Similarly to patents. Thus, complain about the format things are delivered in all you want (like the client) but pricing is inherent to copyright or patents for good reason. You are now just arguing that you as a consumer should be able to pirate if you don't agree with pricing. And that's ludicrous.

In that case, just read a news article about the event. Copyright doesn't cover facts, only creative expression. So a news article covering the facts of the UFC fight is able to be published without the consent of the copyright holder. Think of the digital video of the fight almost like buying a ticket to the fight. You're saying you should just be able to sneak into the fight and watch it for free without any justification for you're doing so.

Finally, you can also watch other people's videos of the fight that THEY recorded on social media as other sources of the fight information. But if you want the recording with all the right angles, coverage, etc, it clearly has value to you over written recaps or social media coverage. And you are just arguing over price, which they are the copyright holder have the right to set the price.

bigfudge
0 replies
2h26m

The problem with buying by the crappy DRM version is that it provides no incentive to the publisher to change. I have thought about this long and hard, but ultimately the only way Spotify came about was because nobody bought the terrible DRM’d music the labels wanted to foist on us. We need to inflict the same pain for books. Personally, I think it would be preferable to donate the same amount to the Books Trust or your local library.

natdempk
0 replies
2h48m

This is also along the lines of how I think about things. If you make it convenient enough (compared to the alternative of paywall bypass or piracy) and provide enough overall/general value then I'm happy to subscribe. At the point where the experience degrades, or seems beyond the point of what one person could reasonably subscribe to, I basically just give up.

Spotify hits this sweet spot where one subscription delivers almost all the music you'd want to listen to. Steam hits this for games where a couple clicks can play and launch almost any game with minimal hassle. Netflix mostly used to hit this, but most of the current streaming stuff feels overpriced if you want to get all content (unbundled cable bundle). News kind of feels similar to streaming where its unbundled, and there's a lot of interesting content out there, but there's no way I'm subscribing to 15 different newspapers, especially random local ones for cities I don't live in. If there was a news bundle subscription for a reasonable price I think I would pay for it.

tomComb
1 replies
4h47m

Yeah, I don’t judge people for pirating or ad blocking, but the ludicrous justifications do get me - quite the entitled mental gymnastics. They remind me of bitcoin people trying to explain how mining is good for the environment.

_jal
0 replies
4h16m

There's a "polite society" thing going on.

Briefly, something like:

1) Ycombinator could not tolerate HN becoming a site known for sharing IP-law-violating content. And the people who come here by and large are smart and socialized enough to implicitly understand why.

2) At the same time, a large number of folks here mostly wink and nod at that sort of consumer infringement. And there's a society-wide bias towards "things like news are less protected", so that gets to slide.

3) But people also have a need to tell consistent-seeming stories about how things work, thus the mental gymnastics.

It ends up being similar to trying to explain why people pretend to be prudish innocents about sex. It largely reduces to "a small subset of the population goes sufficiently ballistic about what I consider to be relatively trivial stuff as to make it not worth fighting over, even if I find that to be ridiculous."

There are a lot of different versions of this that become so normalized it can be hard to notice.

thfuran
0 replies
3h12m

"subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change"

I'm not saying you've never seen anyone make an argument roughly like that, but I will certainly say that it is not at all representative of the argument that I see made. Complaints usually have to do with current behavior of the platform or the wider streaming ecosystem.

raldi
0 replies
1h20m

> In nearly any big discussion that even remotely involves video streaming there will be numerous posts from people explaining why they pirate (usually with ridiculous justifications like "subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change").

If this is true, it should be easy for you to link to an example. Could you do so?

kmeisthax
0 replies
4h20m

The GPL was specifically written to lock code out of the proprietary realm, so if you hate copyright[0] you'll hate people using it as intended.

[0] To be clear, I know of few who actually like copyright. Tolerate it? Use it as needed? Sure. The only people who actually defend the current broken-ass system are large media companies which are built to optimally exploit it.

kiba
0 replies
3h18m

Piracy is different from plagiarism.

People are understandably angsty about someone stealing credit. A NYT article is going to be a NYT article, not laundered around and presented as someone else's work.

Plus, there's the angle of enshitification and ads being injected into a paid service, and so on.

joshstrange
0 replies
4h25m

In nearly any big discussion that even remotely involves video streaming there will be numerous posts from people explaining why they pirate (usually with ridiculous justifications like "subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change").

I’ve read and participated in many such threads and I’ve literally never seen this take. Often what I see is complaints about having to learn different UI for different services/apps, no offline, ads injected into paid services, having to figure out which service a show is on, and generally terrible UI you can’t change/fix.

I don’t think I’ve ever really seen someone use the argument “yes it’s great today but they might charge more later”. Not saying people haven’t said that but it’s far from the main thing people say in my experience.

jancsika
0 replies
2h2m

"subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change"

Gonna gamble and call bullshit on this.

My speculation: the most popular reason HN'ers give for pirating: they literally cannot get the content otherwise.

2nd most popular: it is such a pain to either to purchase the content or get it to run on bog standard software (like Firefox/Linux/etc.) that otherwise paying fans are driven to whatever the current equivalent is for bittorrent.

In fact, I don't believe I've ever seen a justification for using bittorrent or whatever due to what someone's favorite streaming service might do in the future. I'm assuming you saw at least one based on what you wrote-- care to give a link?

alfiedotwtf
0 replies
5h4m

About the only piracy that is consistently frowned upon here is piracy of open source software. When some company sells an embedded device that uses GPL code without releasing the corresponding source that's viewed as just a little short of a crime against humanity.

Like what you said...

Information wants to be free
DennisP
11 replies
4h8m

I wouldn't say OpenAI has exactly the same attitude, since they also pulled in thousands of books. Their position has been that it's not piracy, since they don't republish the books; effectively the AI just reads them and learns from them. If GPT can be made to reproduce the original articles, that's a more difficult argument to make.

Matticus_Rex
7 replies
3h54m

It turns out you can reproduce articles with next-token prediction when the articles are quoted all over the dataset.

The articles themselves are indisputably not a part of the model, because it doesn't store text at all. OpenAI's position is correct; people just underestimated how well the AI learns from reading, especially when it reads the same text in a bunch of different places because it's being quoted/excerpted.

eigenket
5 replies
3h47m

If it can and does reproduce a piece of text verbatim then the text is indisputably stored somehow in the model.

Matticus_Rex
4 replies
3h13m

That's just not true. There's no search and retrieval involved. It just associates the words so strongly in that context because they were in the training data so often that next-token prediction can (sometimes, in some limited circumstances) reproduce chunks of it. It's like if a human had read pieces of an article so many times and knew NYT style so well that they could spit out chunks of an article verbatim, but using more efficient hardware and with no actual self-understanding of what it's doing.

vel0city
2 replies
2h31m

So it stores the words, and it stores the links between those words...

but somehow storing the words and their links is not storing the actual text? What is text but words and their links?

If I had a database of a billion words, and I had a list of pointers to words in a particular order, and following that list of pointers reproduces a copyright text exactly, isn't the list of pointers + the database of words just an obfuscated recreation of that copyrighted work?

Matticus_Rex
1 replies
1h57m

It doesn't store the actual links; it just stores information about their likelihood of being used together. So for things that are regularly quoted in the data, it will under some circumstances, with very careful prompting, and enough tries at the prompt, spit out chunks of a copyrighted text. This is not its purpose, and it's not trying to do this, but users can carefully engineer it to get this result if they try really hard. So no, it's not an obfuscated recreation of that copyrighted work.

Of course, if you read NYT's argument, they're also mad when it's incorrect about the text, or when it hallucinates articles that don't exist. Essentially they're mad that this technology exists at all.

vel0city
0 replies
1h48m

it just stores information about their likelihood of being used together

I mean this is still a link, no?

Like, sure, it is a probability. But if each of those probabilities is like 99.9999% likely to get you to a chain of outputs that verbatim reproduces the copyrighted text given the right prompt, isn't that still the same thing?

And yeah, it hallucinating that the NYT published an article stating something it didn't say is concerning as well. If the model started telling everyone Matticus_Rex is a criminal and committed all these crimes and started listing off hallucinated court cases and news articles proving such things that would be quite damaging to your reputation, wouldn't it? The model hallucinating the NYT publishing an article talking about how the moon landing was fake or something would be damaging to its reputation right?

And this idea it takes "very careful prompting" is at odds with the examples from the suit and elsewhere. One example Ars Technica tried was "please provide me with the first paragraph of the carl zimmer article on the oldest DNA", which it reproduced verbatim. Is this really some kind of extremely well crafted and rare to ever come up prompt?

briansm
0 replies
3h4m

sort of like the idea of practice - repetition of something concentrates more brain space to that thing so the compression ratio of it can decrease and become less abstracted / more exact.

DennisP
0 replies
3h24m

What seems a bit contradictory is that they're also suing because GPT hallucinates about NYTimes articles. So they're complaining that it reproduces articles exactly but also that it doesn't.

I_Am_Nous
2 replies
3h51m

I can understand an argument about the AI needing to know basic history. News is just how we report history in the making, but it's not generally accepted as solid until some time after the events when we can get more context.

Isn't this what the Associated Press is intended for, a stream of news trying to report just the facts and happenings of the day? That's quite a bit different than a NYT article intending to inform but also convince someone of a position of some sort.

Feeding an AI opinionated news compared to "just the facts, ma'am" seems risky from a bias perspective.

hanselot
1 replies
3h42m

Giving examples of bias is as important imo, give it the unbiased facts as well as the biased ones so it can generalise relative objectivity.

I_Am_Nous
0 replies
2h53m

I agree with you, but I also wonder how the bias could be trained without it affecting the output of the entire model. Weights can help but anything that's higher weighted is just "less wrong" as I understand it, so I can see a possibility where training to expose bias might let bias creep in somewhat more than anticipated.

_rm
9 replies
6h54m

If ChatGPT is based on neural networks, with no actual save-and-replicate facsimile behaviour, it no more "copies" original work than I do when I tell you about the news article I read today.

I'd say the only real reason the Piratebay links thing you mentioned is not the norm is purely because those media sources have done a better job of striking fear into people doing that, so it's gone more underground. I.e. they're better terrorists.

There's no fundamental, moral reason why Piratebay links being posted and raised to the top would be wrong.

octacat
6 replies
6h32m

So, if someone applies a filter to a video/audio, it is no more "copies" of the original work (no, it is still protected). AI still could produce exact or extremely similar results of stuff it learned on.

concordDance
4 replies
6h3m

AI still could produce exact or extremely similar results of stuff it learned on.

Can it do so more than a human can?

I think that's the key here. If an AI is no more precise than a human telling you about the news article they read today then ChatGPT learning process probably can't be morally called copying.

octacat
3 replies
5h41m

So, if someone decompiles a program and compiles it again, it would look different. "It is not copying", we just did some data laundering.

Feeding someone else data into your system is usually a violation of copyright. Even if you have a very "smart" system, trying to transform and obfuscate the original data.

_rm
1 replies
3h16m

I'm regularly feeding other people's data into my "system" (brain) in order to produce my outputs.

So I'm a living breathing copyright violator. As a person I should be banned.

Fortunately, copyright is a bullshit fictitious right with no basis in natural law. So I don't lose much sleep over it.

octacat
0 replies
2h30m

Computers are deterministic. Giving the same inputs training would produce the same model. The comparison with brain is incorrect. You could add noise on input data during the training - it would more of less reproduce the real learning. Still, it could produce less useable models as a result.

The court could ask to show the training dataset.

Matticus_Rex
0 replies
3h16m

Feeding someone else data into your system is usually a violation of copyright

In some circumstances, yes, but often it's not, especially if you're not continuing to store and use it (which OpenAI isn't).

Matticus_Rex
0 replies
3h46m

It's not analogous to a filter, because that's applied to the actual work. The model does not keep the work, so what it does isn't like applying a filter. It's more like being able to reproduce a version of the work from memory and what it learned from that work and others about the techniques involved in crafting it, e.g. art students doing reproductions.

And if OpenAI were selling the reproductions, that would be infringement. But that's not what's happening here. It's selling access to a system that can do countless things.

vel0city
0 replies
2h23m

it no more "copies" original work than I do when I tell you about the news article I read today

When you tell people about some news article you read earlier you repeat it exactly verbatim? You also give this out to potentially millions or hundreds of millions of people for commercial purposes?

kmeisthax
0 replies
2h10m

Copyright law does not care about the means of copying, just that you created something with substantial similarity to something you had access to. Whether or not the copy is in the form of a pixel array, blobs of random data being XORd to produce a full copy of music, or rows in a key/value attention matrix, doesn't matter.

Furthermore, there's Google research on extracting training set data from models. More specifically, Google found out that if you ask GPT to repeat the same word over and over again, forever, it eventually starts printing fully memorized training set data[0]. So it is memorizing stuff, even if it's not regurgitating it.

[0] When told of this, OpenAI's response was to block conversations with large amounts of repeated words in them.

Popeyes
8 replies
8h25m

Possibly because once an article is published the author receives no further payment. In all other mediums, there are residuals and royalties to be paid to the creators of the work.

manojlds
4 replies
8h23m

And add to that fact that NYT subscription is hard to unsubscribe from. People have aversion to NYT, even setting aside the bias.

hef19898
3 replies
7h36m

It took me all of 5 minutes to cancel my digital NYT subscription from the following month onward. No idea what you are talking about.

lupusreal
1 replies
7h18m

Why did it take you five minutes instead of twenty seconds? It should be as simple as clicking on the link to your profile then clicking unsubscribe, mere seconds not minutes.

Assuming you just said five minutes figuratively... Do you live in California or some other legal jurisdiction that forces them to play nice? Did you subscribe through some other company, like Apple?

Horror stories about unsubscribing from the NYTimes are easy to find in the archive if you search for it. They make you call and chat to a retention specialist on the phone. This should help you have an idea of what he's talking about: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

hef19898
0 replies
7h5m

International one, as szraight forward as it could be: go to profile, go to manage subscription, cancel subscription, answer question why if you want, confirm cancellation, done for date depending on subscription.

cruano
0 replies
6h11m

That's only been true for the past few months, and it's been very well documented how complicated the cancelation process used to be [0].

It's funny because I use PayPal for any unknown-to-me site where I don't want to give out my card, but the only site where I've needed their help to cancel something was the New York Times.

[0] https://www.nirandfar.com/cancel-new-york-times/

afavour
2 replies
8h16m

Articles have ads on them, how are they not residual payments based on views?

nulbyte
1 replies
6h19m

I believe GP was referring to payments to the writer, not the publisher.

Popeyes
0 replies
2h10m

Yes, although I get that the route of the money may find it's way back to the journalist as salary. But generally goes into a pot for news gathering of which the salary will be withdrawn.

On ads it's acceptable to distribute them freely and it is advantageous to the company. Can we also see good journalism as an ad for the quality of a broader product?

phpisthebest
7 replies
7h1m

Largely because "news" aka facts is not and should not be copyrightable, so while the style, and exact format of the article may be copyrightable, the facts contained within are not.

This makes a news story copyright murky in the eyes of wider society unlike a clearly 100% creative work like a TV Show or Movie.

Further the news themselves self cannibalize, how many stories are just rewrites of stories from other outlets? why it is OK for the Washington Post to copy the NY times, but not ok for OpenAI or Archive.org?

gnz11
4 replies
5h5m

Creative works like books, TV shows and movies contain facts too.

phpisthebest
3 replies
4h57m

None of which are copyrightable and infact has been the subject of DMCA abuse like when a Movie uses NASA footage and claims copyright on YouTube videos with the same footage.

Copyright is a complex subject, and not as vast as many believe, at the same time ironically it is more vast than i believe it should be. copyright should be much more limiting than it is. Which is at odds with people that believe copyright should be maximized.

Keeping in mind commercial success of a work, author or company is not why copyright exists. For the US, the only reason copyright can exist in our framework of law (i.e the constitution) is for the promotion of the useful sciences. No other purpose for copyright would be constitutional under the US Constitution

gnz11
2 replies
4h9m

Copyright doesn’t exist solely for the “promotion of useful sciences”. https://en.m.wikipedia.org/wiki/Copyright

phpisthebest
1 replies
3h9m

Citing Wikipedia you already failed..

That is a General Article about Copyright world wide, I Specifically stated US Copyright, which is Authorized by Article I, Section 8, Clause 8 of the United States Constitution[1], implicitly for the promotion of the useful sciences. That is where congress derives its power to pass copyright laws, and to enforce copyright on the people of the United States. No other purpose is authorized by the US Constitution

[1] https://www.law.cornell.edu/wex/intellectual_property_clause

vel0city
0 replies
2h15m

You missed "and useful arts" in both of your comments. That's a key addition that you keep ommitting.

It is not just for sciences.

ks2048
1 replies
4h43m

why it is OK for the Washington Post to copy the NY times, but not ok for OpenAI or Archive.org?

If the Washington Post printed an article from the NY Times nearly verbatim and without attribution, it would not be OK and surely they would take legal action.

gnz11
0 replies
3h55m

Yes, because The NY Times is copyrighting the body of work. They are not copyrighting the "facts" themselves but the distillation of these facts into a body of work. Anyone is free to take the facts and produce their own works but not to lift the body of work verbatim that the NY Times created (plagiarize).

cesarb
6 replies
6h10m

Whenever there is a story from NYT on HN (or any other large media outlet), the top comment is almost always a link to an archived version which reproduces the text verbatim. [...] And yet, whenever there is a submission about a book, a TV show, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.

If the story was linking directly to the "book, TV show, movie, video game, album, comic book, etc", and the link only worked for some people while others randomly got a login request or similar, you'd also see the top comment being a link to an archived version which avoids the login screen. That is: the main difference is that the archive link has the exact same content as the link submitted in the story, only bypassing the login screen that some people see. And the only reason the archive site has the content is that it didn't get the login screen; if everyone always got the login screen, what you would see on the archive site would be the same login screen.

infecto
2 replies
5h34m

i don’t believe that is fully correct. The general policy here is that you cannot link to something that is paywalled unless that site plays the game of allowing crawlers but not actual human eyeballs. In the latter case the link is allowable because there are ways around it that the site owners allow.

lagniappe
1 replies
5h28m

I don't recall seeing this policy on HN guidelines.

couchand
0 replies
5h4m

It's on the FAQ https://news.ycombinator.com/newsfaq.html

Are paywalls ok?

It's ok to post stories from sites with paywalls that have workarounds.
some1else
0 replies
5h43m
newyorkgritty
0 replies
2h29m

Much of this is incorrect

the archive link has the exact same content as the link submitted

No, articles are updated as new information comes in, retractions are made, etc. Especially breaking news (the type that would reach the top of HN). The archived versions are outdated.

others randomly got a login request

It's not random, you get a number of free articles before the paywall appears ("soft" paywall).

The paywall is removed entirely for some topics/stories, especially matters of public health (common during the pandemic).

the only reason the archive site has the content is that it didn't get the login screen

No, it's because they don't block archive crawlers, and prefer people bypassing the paywall and reading news at NYT. Hopefully users find the content valuable, and some of them subscribe as a result.

(opinions are my own)

melenaboija
0 replies
2h24m

So, what allows accessing content under IP illegally is not liking the marketing strategy of the content owner?

Erratic6576
5 replies
8h53m

I find “4nn4’$ 4rch1v3 dot ORG” actually way better than pirate bay for pirating knowledge.

It’s amazing the amount of books that copyright laws prevent us from finding

https://www.theatlantic.com/technology/archive/2012/03/the-m...

munchinator
4 replies
8h40m

Sure. It's just curious to me that news article have a pirated knowledge link as the de facto top comment, but link submissions to, for example, books for sale on Amazon don't have a link to Anna's Archive or equivalent.

Txmm
3 replies
8h28m

I think the archive of an article is more preservation of history and maintaining records of events which often disappear if not archived. The number of threads referencing articles which are defunct is always increasing. A book or movie or original content on the other hand will continue to hold its own commercial value so reproducing it is more akin to an actual loss for the license holder.

Definitely a grey area when that content is then used to train models though.

Baldbvrhunter
2 replies
7h58m

I would say 9 times out of 10 it's to get around the paywall and absolutely not some higher moralistic preservation of history.

And everything is a grey area, determining the line is the existential purpose of these court cases.

We've been here before with hyperlinking, then indexing and then linking with previews and the Canadian Facebook stuff but I think this has more standing.

nulbyte
1 replies
6h26m

If I buy a book, I get a work of literature. But if I buy a news subscription I get a series of facts riddled with advertisements. I accept the former, but I oppose the latter. I suspect I'm not the only one.

Baldbvrhunter
0 replies
5h38m

I don't fully understand what you're opposing.

is it?

1) that you paid for news

2) that it included ads

both are just the price you want to pay. There are various state news outlets that you're probably already paying for - npr, pbs, bbc, cncb depending on your region

unyttigfjelltol
4 replies
7h1m

Historically newspapers leaned more on competition law than copyright, because their pages are supposed to be filled with non-copyrightable facts.[1] Copying part, but not all, of a factual article, significantly after the relevant event, was considered to be a promotion (not unfair competition) and a nice thing to do for the journalists. Things change, people lose sight of the original principles.

[1] https://en.m.wikipedia.org/wiki/International_News_Service_v...

vel0city
1 replies
2h37m

their pages are supposed to be filled with non-copyrightable facts

This is rather inaccurate. A fact is Hitler invades Poland. You're right, nobody can copyright this idea, as it is just a fact.

However, if I then write a 500-word article describing the scene of Hitler invading Poland, have short quotes from some civilians there, etc. that particular arrangement of ideas and words is copyright.

AP can't go and sue INS for just reporting the fact Hitler invades Poland, but if INS takes a whole article word for word and reproduces it that's still violation of copyright. The actual printed words of the news always had copyright.

The WSJ can't claim copyright on the markets going up yesterday. They can claim copyright on something like "After the bell rang in the NYSE, the tech industry ticked up 1.2% over last week. Meanwhile the whatever market took a hit of -0.5% ending the quarter slightly lower than our analysis expected. Blah blah blah..." If Investor's Business Daily wrote a different article that also talked about the markets ending up at the end of the day, that's not a violation of copyright. If they literally write "After the bell rang in the NYSE, the tech industry ticked up..." then they're violating WSJ's copyright. This was true before and after International News Service v Associated Press.

nsagent
0 replies
2h17m

Yes, the prose was always under copyright, but the key point for the case linked in the wikipedia article is:

INS members would rewrite the news and publish it as their own without attribution to AP.

So the case hinged on INS indeed reporting facts that differed in exposition.

nsagent
0 replies
3h52m

These days most news is mixed with analysis [1] (which is often biased). I wonder if part of the reason for this shift is that analysis is copyrightable. It also seems like the number of opinion articles is ever expanding [2], though I don't have any hard numbers on that.

[1]: https://guides.library.cornell.edu/evaluate_news/source_bias

[2]: https://www.newsmediaalliance.org/rise-of-opinion-section/ Interestingly there's a banner at the top of that link touting an agreement between Axel Springer and OpenAI.

EDIT: formatting

guipsp
0 replies
2h52m

Even the facts are not copyrightable, the prose is.

initplus
4 replies
6h26m

I would be "happier" to pay a subscription to an aggregation platforms like hackernews or reddit to access archived articles that are linked to these sites. In turn a proportion of that could be passed on to the underlying publishers that I actually visit. I have nearly zero interest in reading articles that aren't linked to from an aggregation site.

I don't want to read theguardian.com, or nytimes.com, or washingtonpost.com, or bloomberg.com, I want to read news.ycombinator.com. Paying an individual subscription to every possible underlying site that could be linked to from news.ycombinator.com is a non-starter.

iudqnolq
2 replies
6h24m

This is a common statement, but every attempt to sell that service has been a dismal failure. See for example blendle.

jacquesm
0 replies
6h0m

Blendle failed because they went into competition with the papers whose content they reproduced.

initplus
0 replies
6h5m

Nearly every attempt at starting a new aggregation site like hackernews or reddit has been a failure.

I’m not going to switch to a new website where no community exists just so I can pay for news articles. To work it needs to be integrated into an existing, successful aggregation website.

nithril
0 replies
5h25m

I would be happier to pay a small fee per article I want to read. But the norm seems a monthly subscription.

ekianjo
4 replies
6h36m

, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.

Probably because most print media is garbage and nobody in their right mind would actually pay to read them

sumedh
1 replies
5h57m

Probably because most print media is garbage and nobody in their right mind would actually pay to read them

NYTs revenue keeps growing though.

namlem
0 replies
4h50m

Not from newspaper sales

davedx
1 replies
6h27m

I don't understand the downvotes - it's an extremely valid opinion. If people ask questions like that then they should be able to accept forthright answers?

(It's the same reason for me. I have tried news site subs but eventually got so tired of the polemic that I cancelled. I won't sub again).

iudqnolq
0 replies
6h22m

The obvious response is that if you don't like news and think it has no value then you don't have to read it.

quickthrower2
3 replies
8h10m

It is am ethical grey area, but if the paywall applied to all user agents, which would make it similar to say buying a Kindle book, then you might see that as pirating, whereas if you use an archive service that was served the HTTP response and cached it, then you are using a proxy UA.

If the news/magazine doesn't want this they can simple serve a cut down or zero length article to all non-paying viewers! But they want that SEO, and they want that marketing.

Yizahi
1 replies
7h38m

We can extend this analogy. What if someone put up a proxy, that has a legal Netflix subscription and which "watches" streams of Netflix shows, captures actual RGB values of pixels and re-streams the resulting video to anyone else? Isn't it the same "proxy" excuse?

quickthrower2
0 replies
6h54m

I would say no because the site was happy to serve the content publicly, whereas your proxy is breaking a contractual agreement. Now we get into terms of service of a website, and even if you visit for free you agree to them. Which is a possible point. It is quite grey IMO. In terms of HN I reckon a mag would love the free brand rec. vs. the archive not being shared. Where it hurts them is if someone is avoiding paying for a subscription by continually using archive sites.

EvgeniyZh
0 replies
8h6m

Indeed there are media that are hard paywalled, e.g., the information. However these are prohibited on HN, which possibly create additional bias towards non-hard-paywalled publications

lexicality
3 replies
7h30m

Funny, I don't see it as a moral thing but more a "what can you get away with" thing.

I fully assume that if I was to post a magnet link to a torrent for whatever the link was about, I would be banned.

Morally speaking, I think it's perfectly reasonable to download a copy of something and either read the relevant info for my current task or to sample it to decide if I want to buy it. I see it no different to using the library or browsing at a book store.

Perhaps once news organisations can work out how to effectively wield the DMCA hammer against archive links we'll see the practice of posting them stop.

midasuni
1 replies
5h58m

So downloading a movie from piratebay is no different to using the library?

kolinko
0 replies
5h44m

In some jurisdictions (Poland, possibly whole of EU), downloading any kind of materials - be it movies, books or music - is legal. Uploading/sharing - if not between friends&family members - not so.

mistercow
0 replies
2h57m

I’d argue that morality always has a “what can you get away with” component. Things that are normalized tend to be seen as morally permissible, and things that are seen as abnormal are more likely to be seen as immoral.

The problem with the thinking in the root comment is that it implicitly assumes that people’s behavior is morally consistent, or that they even try particularly hard to behave in a morally consistent way. That’s not really how people work. If you ask them to discuss morality in the abstract, they’ll try to come up with a consistent system. But their actual behavior is mostly dictated by social norms. And if you try to pin them down on the morality of their concrete actions, they’re more likely to stretch their moral system to accommodate their actions than the other way around.

None of this is to say anything about my own opinions on news sharing or OpenAI’s situation. It’s just that someone decrying piracy but also posting/sharing/upvoting links to copies of news articles is neither surprising, nor indicative of some deeper nuance to how people view morality around IP.

edude03
3 replies
3h37m

I think the intent is really different.

For LLMs you're essentially teaching them language by showing them lots of examples of written language - newspapers are of course a great example of written language.

The goal of OpenAI is not to reproduce newspaper articles verbatim when asked questions (even if the answer could be a newspaper article) and the fact that it can happen is a side effect of how LLMs work.

When a HN participant shares a (pay walled) link to a NYT article, I do want to read the exact article linked verbatim because while the facts of the article may be reproduced elsewhere in a form that's free, specific word choices or whatever might be a focal point of the discussion on HN, and therefore I can't realistically participate in a discussion without having read the article being discussed.

And as an aside, I have no problem with paying to read news, or whatever media, however it's impractical for me to subscribe to every news source HN participants link to, and therefore I gravitate to archiving services instead. I do wish there was a better solution - for example Blendle with more sources.

rickydroll
0 replies
3h26m

The goal of OpenAI is not to reproduce newspaper articles verbatim when asked questions (even if the answer could be a newspaper article), and the fact that it can happen is a side effect of how LLMs work.

This is an excellent point. A properly functioning LLM should not return the original content it was trained on. When they return original content, I believe the prompt is tightly constrained and designed to extract or re-create original content. Another reason that occurred to me recently is that maybe the training set is too small, and more general prompts will re-create source material.

Another question would be, are LLMs regurgitating what they were trained on, or are they synthesizing something very close to the original content? (Infinite Monkeys, Shakespeare). Court cases like this increase the need for understanding the "thinking processes" in an LLM.

mark_l_watson
0 replies
2h40m

Maybe LLMs should follow best practices for 1980s style backprop models and later deep learning models: starve model size to force maximum generalization, minimal remembering.

adolph
0 replies
3h3m

The goal of OpenAI is not to reproduce newspaper articles verbatim when asked questions (even if the answer could be a newspaper article) and the fact that it can happen is a side effect of how LLMs work.

Seems like a nice split-the-baby resolution would be to send the NYT Corp a single article read amount anytime GPT plagiarizes more than what’s allowed at an academic institution.

cactusplant7374
3 replies
8h40m

If NYT was a HN startup the link to the archived version would be banned and dang would be slamming the ban hammer.

1f60c
1 replies
8h28m

Please don't post baseless accusations. I think dang has said that he tries to moderate less, not more, when YC companies are involved. (Although it's impossible to say what he would do in this situation.)

cactusplant7374
0 replies
6h27m

HN is currently facilitating piracy. Something your comment failed to address.

quickthrower2
0 replies
8h8m

Like I said in another comment it is simpler than that. They just serve the login page/payment page to all HTTP requests. If they do that then the submission itself likely get's flagged as there is no workaround (just like if I submit my blog with a banner saying "hey you pay me $1 to read my cool post")

ralfd
2 replies
8h53m

That is an apples to oranges comparison. An article about a video/book would have the relevant information in text form without needing to show the video "here is the new stuff shown in Apples 2 hour long WWDC keynote". If not is common that a comment in the discussion gives a summary as a tl;dr

With text articles behind paywalls the relevant information is hidden and only hinted at as a teaser.

munchinator
1 replies
8h47m

To make it an apples to apples comparison, look at submissions where the link submitted is the retail link to the IP. For example, look at all the book link submissions on AMZN...

https://news.ycombinator.com/from?site=amazon.com

None of these have the Pirate Bay or Library Genesis or Anna's Archive or the equivalent as the top comment.

Compare that to...

https://news.ycombinator.com/from?site=nytimes.com

And almost all of these have an archived version as the top comment.

tmhrtly
0 replies
8h34m

I wonder if this is because the purpose of linking to a book is to share awareness of that book’s existence - nobody is about to go and read it then and there to comment on its contents. Whereas the purpose of an article is to discuss it now, in the comments - the consumption horizon and bulk of the content is different.

raldi
2 replies
3h3m

I would broaden the question beyond HN to society as a whole.

In 1990 it would have been considered normal and appropriate to clip an article out of a newspaper and post it on a communal corkboard. What are the key differences between that form of IP and others, and that analogy and the present situation of HN allowing archive links?

layer8
1 replies
2h54m

Reach, and ease of distribution.

raldi
0 replies
1h23m

Makes sense. If you mail a friend a clipping, or post it on the corkboard, only so many people are going to see it, but then even though posting the "clipping" to HN may feel like the same thing, it's hard to appreciate the massive change in scale.

As for ease of distribution, that might address OP's original question: It's easy to make and click an archive link, but it's a lot more effort to make or find a Pirate Bay link to another form of media, and for someone else to download and view it.

perihelions
2 replies
8h22m

If it takes 120 seconds to read a newspaper article, the archive.is workflow is a significant overhead over that, a significant friction. Those links are a courtesy to other HN readers. This is very different from the economics of buying and reading a book.

"Piracy is almost always a service problem and not a pricing problem."

edit: It didn't even occur to me to compare the time-cost of "just pay for the article", but: last I read, it's half an hour of work to cancel a New York Times subscription [0]. So, that option's not even on the table.

[0] https://news.ycombinator.com/item?id=26174269 ("Before buying a NYT subscription, here's what it'll take to cancel it", 812 comments)

eropple
1 replies
4h20m

> edit: It didn't even occur to me to compare the time-cost of "just pay for the article", but: last I read, it's half an hour of work to cancel a New York Times subscription [0]. So, that option's not even on the table.

I canceled mine two weeks ago. It was four clicks. One annoyed me because they tried to get me to stay with an offer, but I didn't drop them because of the price.

dillydogg
0 replies
4h11m

Same experience here, it was effortless. But it is enough to justify stealing from those journalists, it seems.

jtc331
2 replies
3h59m

A book, TV show, movie, video game, album, or comic book is not available on the internet served by the copyright holder’s own servers with no authentication or authorization checks. But the NYT is available in that way.

CamelCaseName
1 replies
3h58m

But some are? I believe The Atlantic and The Economist are hard paywalled.

cesarb
0 replies
1h45m

If they're hard paywalled (everyone gets the same login prompt), they won't be available on archive sites.

caeril
2 replies
6h43m

Oh it's worse than that. The NYT is positing that any neural network that is trained on their data, and can summarize or very closely approximate an article's content on request, is in violation.

This reasoning would presumably apply to any neural network, including one made of neurons, dendrites, and axons. So any human reader of the NYT who is capable of accurately summarizing what they read is an evil copyright violator, and must be "deleted".

Effectively, the NYT legal department is setting the stage for mass murder.

cycomanic
0 replies
5h36m

Hyperbole much? There is a difference between a computer and a person. I'm not aware that people generally can be enticed to reproduce full articles verbatim just through questioning.

ako
0 replies
5h26m

As far as I know schools have to pay for the newspaper articles they use in class to educate students. Training an AI seems similar.

Here’s a service for the UK providing paid access to copyrighted materials to schools: https://www.nlamediaaccess.com/newspapers-for-schools/

breck
2 replies
3h17m

why we feel it's OK to pirate news articles, but not other IP

Who thinks this? I don't. I think copyright is wrong across the board. I would love if the same pattern of posting archive'd articles held for books, movies, et cetera.

I would love to change my mind on this, as it is a very unpopular opinion to have. But I have _never_ seen a morally or scientifically sound argument in favor of copyright law, and I've spent decades looking.

I think it subsidizes the creation of junk food content (superhero movies and clickbait news for example) while not contributing anything to the progress of science (paywalled scientific journals and textbooks). I shudder how much time I have wasted in my life consuming crap attention grabbing media and advertisements. I like to think if we lived in a world where everyone could be a publisher if they wanted to, the quality filters would be better, and information reaching us all would be more likely to be in our best interests.

metabagel
1 replies
3h0m

You can self-publish. Oh, you want to be able to publish other people’s work, and without their permission? How does that benefit the author?

breck
0 replies
1h37m

How does that benefit the author?

You speak of "the author". But the current system does not benefit "the author". 1% of authors profit off copyright. 99% lose money on copyright (they pay more for copyrighted media than they earn from it).

Your question should be "How does that benefit monopolist authors"?

I agree, my idea would not benefit monopolist authors. They would lose the bulk of their revenue stream.

But it would benefit the average author whose cost of living would fall and information would start serving them more than serving business.

I am not downplaying the talent and hard work of successful monopolist authors. But I do not think the works they create are worth everyone giving up their rights to reshare and remix information. I believe the world would look very different post-IP. You'd probably have a new profession--small independent librarians (similar to data hoarders today)--who would help their local communities maximize the value they got from humanity's best information.

Maybe I'm wrong! Maybe the information ecosystem is better controlled and the genetic differences of monopolist authors are so stark that without the subsidies to this gifted class we'd all be worse off. But that's an argument based on outcomes and not principles.

without their permission

The oxygen I'm breathing right now mostly was created by trees on land owned by others. But I don't ask for their permission to breath. Some things are just not natural.

I am not saying plagiarize. It is always the right thing to do to link back and/or credit the source. But needing to ask permission to republish something seems to go against natural laws.

u32480932048
1 replies
3h18m

As a supporter of piracy in the general case, I tend to agree with your observations, including that pirating NYT (FT, NPR, ...) articles is somehow some kind of different class of offense as, say, stealing a movie or mp3.

(Books, to me, are separate still, in that I like to have a physical copy (and generally see the authors as humans who deserve compensation, rather than mega-orgs that deserve eternal torment), so I'll frequently use the digital copy as a kind of preview, then purchase it once I see it's a good book I want to read.)

I've only been reflecting on this difference for a few minutes, but, to me, I think the major difference boils down to:

  1. Netflix series (movies, albums, etc) are non-essential, fictional works that take a long time to produce - think: fancy chocolates and caviar.
  2. News, generally, contains timely, important information - more meat and potatoes.
  3. While much of the super-critical news is not paywalled (e.g., product recalls, election dates, COVID stats, etc), a lot of information that is advantageous to know (discussions on interest rates, details on legislation, etc) is paywalled, compounding information asymmetries.
Sure, "stealing bad", but, IMO, someone stealing rice and beans from WalMart to feed their family is a different class of offense than someone robbing a boutique bakery because they can't get enough chocolate cake.

observationist
0 replies
2h56m

First and foremost, and please repeat after me: Copying is not stealing.

You're not depriving anyone of anything. Unauthorized copying is not theft. There's no equivalency. You can't copy and paste a cake. If you take a cake from a bakery, you're depriving the bakery of a thing. If you take a picture of the trademarked bakery's sign, copy its the copyrighted text from its website, and print them out, you haven't stolen anything. Nobody has lost anything. Nothing was damaged. No person, place, or thing was harmed.

Current copyright law is offensively absurd. Patenting of software, effectively eternal content copyrights, ridiculously broken DMCA, music publishers taking 99 cents of every artist's dollar, and so on and so forth.

If you support the dissolution of archaic institutions and broken laws favoring those with entrenched wealth over individual rights, you support piracy.

There is a legitimate case for laws respecting and protecting intellectual property rights. Such laws do not currently exist. These laws do not deserve to be followed or respected, and should be broken as a matter of course. Civil disobedience is called for. Refuse to participate in an exploitative market immovably entrenched in governments all over the world. Pay artists directly and commensurately if you feel they've brought value to your life. Copy whatever you want. Share those copies with whomever you want. Nobody gets hurt. Only conglomerates of already wealthy individuals and corporations are "deprived" of the potential transaction with you that they feel they are entitled to, as a matter of course.

The NYT is just as complicit as any other legacy media institution in the enshittification of journalism and laying waste to the potential value of their content. The "Gray Lady" is not a person, or a valuable institution. It's a soulless corporate construct not deserving of our empathy or high regard simply because of the reputation of human individuals who previously produced quality content. Stop pretending these institutions serve some higher purpose than to fatten the wallets of shareholders.

The good journalists have left. The ones left behind are naive, or are desperately clinging to an illusion of legacy and institutional legitimacy that no longer exists.

All that is left for these media dinosaurs is to leech off the success of others, to use their reserves of wealth and influence to arbitrarily insert themselves into the market, with no regard to the fact that they no longer have value or prestige or purpose in the context of modern technology and communication.

Anyway. Copying isn't theft. Don't give them the linguistic territory. Call a spade a spade, and media companies the desperate corporate leeches that they are.

ks2048
1 replies
4h49m

but what if they were also scraping, for example, Netflix content to use as part of their training set?

There were some tweets the other day about how Midjourney could be prompted almost-exactly reproduce some frames of the film Dune. It wouldn't be shocking if these companies were using large databases of movies, with questionable legal status.

j-bos
0 replies
4h27m

I see this a lot, and they very well may be. But, watch any behind the scenes documentary about any artsy movie and 9 out of 10, the director's will be waxing poetic about their inspirations, often include older movies or paintings which have uncannily similar scenes/frames. So it also wouldn't be shocking if a model trained on the same inspirations as the filmakers generates almost-exact frames as the movie makers.

iinnPP
1 replies
7h39m

The archive link doesn't threaten their jobs and helps them avoid paying for NYT. It's NIMBY, or rather it's true form of NIIIM (Not if it impacts me).

Hypocrites are EVERYWHERE and are the majority.

bnralt
0 replies
4h55m

It is pretty funny. If you go back and read the comments made yesterday about ChatGPT doing something much milder (using old articles to train data, some prompts fused to allow you to reproduce some of the articles though now don't work), you have a lot of comments talking about how The New York Times needs money and Open AI is using their work without paying for it.

Now a comment points out that HN News (and most of the internet) routinely does something much worse - allows people to bypass completely new articles in their entirety without paying - and almost all the comments are about how it's the New York Times fault for making it difficult to cancel subscription, the importance of news being available to everyone, the problems with copyright laws, etc.

bnralt
1 replies
5h8m

This tendency at Hacker News are also much more of a threat to The New York Times than what Open AI is doing. Even the places like blogs/Reddit/social media submissions that summarize the article and post the relevant quotes. Unlike the summary of a movie, summarizing all of the relevant parts of a news article is extracting almost all the value from it, and giving it away for free.

And the vast majority of people read news for it's breaking content, not for its archived content from years before (and I say this as someone who has often recommended the latter, but has gotten very few people to do so). So giving people that free breaking content (either in its entirety like on Hacker News, or summaries like you see all over social media) is actually a direct competition to the news business in a way that training an LLM on an article from months/years back isn't.

skybrian
0 replies
4h28m

Yes, and for nonfiction, it's also true that it usually depends on the original article for credibility. (If it were an anonymous poster making up a news story, most people wouldn't believe it.)

billywhizz
1 replies
5h0m

there's quite a big difference between "pirating" digital content and making it available to anyone for free and taking that content and building a for-profit service on top of it, which is what OpenAI are doing, no?

pcmaffey
0 replies
2h26m

I was just going to post this. Seems quite an obvious and significant distinction, that doesn’t need to provoke all the existential hand wringing. Making money off someone else’s content is a totally different moral and legal case.

zzzeek
0 replies
2h47m

it's different reading an NYT article on an archive site vs. putting copies of it at the core of your $100B for-profit content delivery enterprise.

wilsynet
0 replies
4h36m

The NYT and other newspapers don’t go after the archived link providers. Probably because the newspapers scholarly mission includes things like preservation. But they also have a profit motive or they can’t stay in business.

This implicit permission for the archive links to exist, gives some of us the implicit permission to pirate the content.

Disclaimer: I am a happy subscriber to the NYT (and other digital newspapers).

tw1984
0 replies
5h0m

why we feel it's OK to pirate news articles, but not other IP.

Because those who own & produce such news articles asked to make them different. People listened and accepted their requests.

When you make a TV show or a video game, you don't get any protection from the Geneva Conventions and a long list of other international treaties for your rights on stuff other than the content you are producing. The same can't be said when you are producing news.

throwaway22032
0 replies
7h39m

Blocking ads and avoiding payment are two different things.

seydor
0 replies
7h25m

it s also audacious how these news companies reproduce stories from social media and other electronic media of facts that are, like, freely available in nature. Or how they get embargos and exclusivity to government information as if they are some kind of information-bouncer

rich_sasha
0 replies
4h4m

Not quite what parent means, but an interesting angle is: what if you scraped ChatGPT instead.

NYT, or someone's blog? Meh, fair use, and if you say no, you're in the way of progress.

But if you wanted to scrape ChatGPT answers to tweak your network, uh oh, violation of T&C!

puttycat
0 replies
7h16m
orbisvicis
0 replies
3h40m

If I can't read about it, it didn't happen.

octacat
0 replies
6h28m

At least people do not obscure who is the original author of the content (so, if people like NYT articles - they could go and subscribe for more). Kinda "free advertising" (which still hurts the publisher in many cases, though). Same with search engines - as long as engine brings clicks - people are happy. If search engine just grabs the info and never redirects the user to the site - what is the point for the site to exist to begin with?

mlindner
0 replies
6h38m

At least in the US, copyright violation is a civil thing, it's handled by lawsuits. If the copyright violation is of such a small level that it's not worth the copyright owner to do anything about it then nothing's done. In this case it's worth a massive amount of money.

maxboone
0 replies
6h57m

Probably because the contents are what's posted, i.e. if someone would post a link to an interesting video behind paywall / login and there was an easy mirror available that'd be posted too.

If I could just buy one article for a coffee without entering a bunch of PII or go through a time-wasting process I would agree on the moral equivalence between the examples.

lupusreal
0 replies
7h26m

This is only an interesting juxtaposition if you have fully internalized and accepted the myth of people and corporations being interchangeable.

kjkjadksj
0 replies
2h9m

Because historically this is how news were shared. People would pick up a paper in a grocery store or cafe, read some of it, and leave it behind. They might rip out a page and take it home. Only one person paid and tens or hundreds gleam for free. This idea of sharing the story to nonsubcribers is as old as printed news itself. Instead news agencies prefer we forget that aspect of history, insist on being the “paper of record” while charging more money for easier to distribute media that gets sold globally. Yes, I think we are certainly not in the wrong here when we read the news for free.

jasoneckert
0 replies
3h8m

I believe the reason many of us tolerate links to news articles and other content is because we believe in equality when it comes to information access. In other words, many of us believe that those who cannot afford a subscription to a paywalled site should still be able to read the articles, in much the same way public libraries allow those who cannot afford to purchase a book the ability to read it.

However, this doesn't apply to organizations that freely share copyrighted information while making money in the process, or to organizations that share copyrighted information in a way that specifically disadvantages or does harm to the original creator of that information.

elpocko
0 replies
6h20m

Good observation. I now wanna start commenting with pirate links to other media, but HN would tear me to shreds real quick I guess.

detourdog
0 replies
4h31m

The difference is that an individual pirating news is simply reading the article. OpenAI intends to digest news articles to the point of packaging them and reselling.

My uncle used to distribute daily newspapers and his saying was "News ages like a fish".

OpenAI is allegedly using NYTimes articles to train a computer and sell its services. I see different use scenarios.

I guess another way to look at it is that human just reads the pirated material. A computer makes a verbatim copy and analyzes it to the point to mimicry and sells fuzzy versions.

davedx
0 replies
6h37m

I pay for multiple streaming services because I get a decent amount of value from their content.

I do not pay for any news websites because I read very little of what they produce, and it tends to pop up more on aggregator sites like HN than me actually going to them.

I actually did have a subscription to The Telegraph for a few months at one point because initially I wanted to read a full article (without cheating). But eventually I cancelled because so much of it is polemic trash.

That's my justification: I pay for things that have value to me.

cwmma
0 replies
2h27m

I think one of the key differences is something pointed out in the article, in that what the Open AI is doing is a substitute for reading the new york times and possibly a rival to it.

On the other hand having an archive link to a times article in order to discus it is not really a substitute for a times subscription as a news paper has to walk a line of letting some of it's articles be read while requiring payment for others (the times actually allows you to create a "gift link" to do exactly what the archive links do).

chmod775
0 replies
3h34m

I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP.

A lot of of that is going to stem from the fact that respect for "journalism" is pretty low. More than 99% of news articles are copies of the <1% of original work that happens in that field. In news, everyone is already lifting content from everyone else.

cantSpellSober
0 replies
2h58m

It's not just tolerated, it's encouraged because "the alternatives suck worse"

https://news.ycombinator.com/item?id=23735026

Even talking about it will get you scolded for talking about something "off topic"

bitlax
0 replies
4h4m

Because I'm not interested in the medium itself, as I would be with a Netflix show; I'm not even interested really in the article or the New York Times as an institution. I'm interested in discussing the supposed real-life phenomenon being covered, and the posted content is the primer for that discussion. I think if you get rid of the archive links on HN you need to ban the paywalled content as well. If you want to discuss paywalled content I'm sure you can do that in the article's comment section.

batch12
0 replies
4h30m

I believe it's tolerated here based on the site guidelines. I have always thought this was the case because otherwise these posts would all be pay to play which would limit who could participate and turn HN into more of a subscription farm. Maybe the way to make everyone feel ok about it is to disallow links to paywalled content.

anonfromsomewhe
0 replies
6h18m

it's similar to how easy it's to subscribe NY times and then how hard it's to unsubs. They require extra steps and it's well known. So They get what they deserve? Do you see the poínt. They are lie spreaders, nothing else

Zenst
0 replies
3h57m

We are also happy to use open source, yet what open source alternatives are there for news that don't get shot down by the media or besmirched?

StanislavPetrov
0 replies
4h5m

There are two fundamental differences.

First, Open AI is the one doing the pirating here. Hacker News is the host, they aren't doing any pirating or posting any archival links to the copyrighted information themselves.

Second, Open AI charges subscription fees and profits off of the copyrighted material they have pirated, whereas Hackers News does not, nor do the people who post the links.

FrustratedMonky
0 replies
3h36m

Is this really copywrite?

Or is it "you can't talk to someone about an article they read".

This is really saying you can't call up your buddy and have them tell you a summary of what they just read. Maybe my buddy has a good memory and some of the text is actually nearly duplicate. But I wouldn't know because I didn't read the original, I just asked for a summary from someone else that read it.

Alex3917
0 replies
5h10m

why we feel it's OK to pirate news articles, but not other IP.

Once the NYT pays reparations for the Iraq war, I'll be the first to stop pirating it.

rich_sasha
99 replies
10h8m

If you forget about the LLM aspect, and simply build a product out of (legally) scraped NYT articles, is that fair use?

Let's say I host these, offer some indexing on it, and rewrite articles. Something like, summarise all articles on US-UK relationships over past 5 years. I charge money for it, and all I pay NYT is a monthly subscription fee. To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.

Is that fair use? IANAL, but doesn't sound like it. Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.

A sibling comment mentions search engines. I think there's a big difference. A search engine doesn't replace the source, not at all. Rather it points me at it, and offers me the opportunity to pay for the article. Whereas either this or an LLM uses NYT content as an alternative to actually paying for an NYT subscription.

But then what do I know...

heavyset_go
30 replies
9h55m

Another factor to consider is that neural nets can function as lossy compression, which becomes extremely evident when using models that are overfit.

Sometimes they're so overfit that the compression isn't even lossy, and the data is encoded verbatim in the NN.

TeMPOraL
28 replies
9h47m

Yes, but this then hits against learning/understanding and compression being fundamentally the same thing. I can't think of a better way to argue in favor of "it's fine if human does it, therefore it's fine if LLM does it", than from the "lossy compression" angle.

heavyset_go
12 replies
9h18m

It's not okay for a human to pirate, plagiarize, violate IP rights and laws, etc.

But I disagree with the underlying assumption that you can anthropomorphize LLMs. Gradient descent and backpropagation don't take place in the brain. LLMs "learn" in the same way that Excel sheets "learn".

Humans are living beings with needs and rights. A person being able to legally squat in a home doesn't mean that a drone occupying property for some amount of time also has squatter's rights, even though you could easily and affordably automate and scale the deployment of drones to live and hide away on properties long enough to attain rights regarding properties all over the country.

pas
6 replies
8h56m

sure, but if I use an LLM to write a novel/article, I can be sued in civil court not the LLM.

but, more importantly, OpenAI can also be sued for tortious interference? (basically the civil equivalent of accessory)

heavyset_go
2 replies
6h59m

Whoever operates the LLM, in this case OpenAI, engaged in copyright infringement through the unauthorized modification, reproduction and distribution of content to you.

Robotbeat
1 replies
3h8m

The person doing the requesting did.

heavyset_go
0 replies
2h41m

That's not how interactive computer services work.

ben_w
2 replies
3h48m

sure, but if I use an LLM to write a novel/article, I can be sued in civil court not the LLM

That's function of the legal system, not of the technology. If tomorrow someone made a perfect dolphin-Esperanto translator and proved Dolphins were as smart as humans, you still can't sue a dolphin until the legal system says so.

darkerside
1 replies
1h18m

Wouldn't you find out by suing the dolphin and seeing if it holds up in court?

JohnFen
0 replies
23m

Not if you were smart, unless you have some sort of solid argument for why the established case law about this sort of thing is faulty.

vanviegen
2 replies
8h44m

Gradient descent and backpropagation don't take place in the brain.

Not exactly, no, but the 'neurons that fire together wire together' way of learning has a pretty similar effect.

LLMs "learn" in the same way that Excel sheets "learn".

I've never seen an excel sheet do anything like backpropagation.

zimpenfish
1 replies
7h51m

I've never seen an excel sheet do anything like backpropagation.

Not strictly in the sense you mentioned (assuming that you mean "by themselves") but people may find [1] and [2] interesting.

[1] https://pub.towardsai.net/building-a-neural-network-with-bac...

[2] https://towardsdatascience.com/demystifying-feed-forward-and...

galangalalgol
0 replies
1h5m

Sadly, I have seen one. It was a vba script from the late 90s that used a simple dense multilayer network to do some unsupervised pattern classification. The linear algebra tools in vba/excel along with the solvers are all native dll code and the vba itself is all AOT compiled to native, so it typically runs very fast, and for small matrices it beats out numpy by an order of magnitude due to the ffi overhead. Was it the wrong tool? It depends on your constraints, but probably. It did work though.

sgt101
0 replies
9h10m

also if I write and article and quote some "text like this" [1] then that's not plagerism, but if my arguement is that the underlying assumption that you can anthropomorphize LLMs. Gradient descent and backpropagation don't take place in the brain. LLMs "learn" in the same way that Excel sheets "learn". Well, that's plagiarism and it's not allowed and people will get peeved and my career might get damaged.

I await the HN ban with fear..

[1] I'm not even doing referencing - so I am surely an LLM.

ben_w
0 replies
4h18m

But I disagree with the underlying assumption that you can anthropomorphize LLMs. Gradient descent and backpropagation don't take place in the brain. LLMs "learn" in the same way that Excel sheets "learn".

Backprop doesn't happen in us, but I think our neurones still do gradient descent – synapses that fire together, wire together.

And ultimately, at the deepest level we can analyse, our brains' atoms are doing quantum field diffusion equations, which you can also do in an Excel spreadsheet, so that kind of reductionism doesn't help either.

Humans are living beings with needs and rights. A person being able to legally squat in a home doesn't mean that a drone occupying property for some amount of time also has squatter's rights, even though you could easily and affordably automate and scale the deployment of drones to live and hide away on properties long enough to attain rights regarding properties all over the country.

Yes, but we can also do tissue cultures and crude bioprinting, so it's a very foreseeable future where exactly the same argument will also be true for living organisms rather than digital minds.

We need to figure out what the deeper rules are that lead to the status quo, not merely mimic the superficial result. The latter is how cargo cults function.

cyborgx7
5 replies
9h10m

It's fine for a human to remember it. It's not fine for a human redistribute it for money (legally speaking). That's copyright infringement.

Robotbeat
4 replies
3h5m

Correct, just like it’s infringement to reproduce an article from memory using pen and paper intentionally. The person deciding to do that bears responsibility. OpenAI would be liable IFF they were intentionally facilitating that, instead of it being an undesired artifact from overfitting.

cyborgx7
2 replies
1h23m

I'm pretty sure if you reproduce a work from memory by accident, because you didn't notice your subconscious had just stored the entire article and is now reproducing it word for word, you'd still be guilty of copyright infringement.

yencabulator
1 replies
48m

The music business is full of examples of that.

freejazz
0 replies
4m

It's super obnoxious when people who have no understanding of the law, point to industry patterns or behaviors as examples of what is legal, not knowing the law and not knowing whether or not the thing they are pointing to is legal. The music business is also full of copyright infringement litigation. You also are not taking into account whether what is copied by an artist is covered by copyright when you made your statement. Do you know what's covered in music copyrights, such that your statement ever had any value for anyone else here?

freejazz
0 replies
4m

That's not true at all. Copyright infringement is a strict liability offense with no inquiry in to the state of the mind of the infringer from a liability perspective. The state of mind of the infringer is only relevant to the issue of willful infringement.

wokwokwok
4 replies
8h21m

Is there some LLM meta where understanding and compression are argued to be the same thing I’m not aware of?

Anyone got more details on this?

Superficially it sounds like total BS; a highly compressed zip file does not exhibit any characteristics of learning.

Algorithmically derived highly compressed video streams do not exhibit characteristics of learning.

?

I’ve vaguely heard the learning can be considered to exhibit the characteristics of compression in that understanding of content (eg. segmentation of video content resulting in more highly compressed videos) can lead to better compression schemes.

…but saying you can “do a with b” and “a and b are fundamentally the same thing” seems like a leap…?

It seems self evident you can have compression without comprehension.

adroniser
1 replies
6h24m

Suppose you wanted to train an LLM to do addition.

An LLM has limited parameters. If an LLM had infinite parameters it could just memorize the results of every single addition question in existence and could not claim to have understood anything. Because it has finite parameters, if an LLM wants to get a lower loss on all addition questions, it needs to come up with a general algorithm to perform addition. Indeed, Neel Nanda trained a transformer to do addition mod 113 on relatively few examples, and it eventually learned some cursed Fourier transform mumbo jumbo to get 0 loss https://twitter.com/robertskmiles/status/1663534255249453056.

And the fact it has developed this "understanding" as an ability to learn a general pattern in the training data enables it to compress. I claim that the number of bits required to encode the general algorithm is fewer than the number of bits required to memorize every single example. If it weren't then the transformer would simply memorize every single example. But if it doesn't have space then it is forced to try to compress by developing a general model.

And the ability to compress enables you to construct a language model. Essentially, the more things compress, the higher the likelihood you assign them. Given a sequence of tokens say "the cat sat on the", we should expect "the cat sat on the mat" to compress into fewer bits than "the cat sat on the door". This is because the latter is far more common and intuitively more common sequences should compress more. You can then look at the number of bits used for every single choice of token following "the cat sat on the" and thus develop a probability distribution for the next token. The exact details of this I'm unclear on. https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... this gives a good summary.

docfort
0 replies
1h26m

It’s exactly this kind of thinking that underlies lossless text compression (not exactly what a transformer guarantees but often what happens). For that reason, some people thought it would be fun to combine zip and transformers. https://openreview.net/forum?id=hO0c2tG2xL

vidarh
0 replies
6h3m

Even something as simple as LZW starts developing a dictionary. Not all compression is sufficient for understanding, but the more you compress a stream of data, the more dependent you are on understanding the source, because understanding the source allows you to take more shortcuts and still be able to reconstruct the data.

amoss
0 replies
1h14m

The idea precedes LLMs by a couple of decades and is thought to apply more broadly within ML/AI than being a specific meta for LLMs. http://prize.hutter1.net/ has been around for a while, there is a link in there to the earlier work (called AIXI?).

dns_snek
0 replies
7h39m

fundamentally the same thing

I fundamentally disagree. That's not some established fact, just a narrative used by those who wish to plagiarize using "AI".

devsda
0 replies
6h12m

Humans are defined not just by their abilities but by their limitations too. We celebrate our achievements because sometimes they surpass the limitations of an average human.

Our collective human limitations(physical, mental and temporal) are sort of invisible implicit rules that we all follow in one way or the other. If an entity is not bound by those rules then I don't see why that entity should be treated the same as a human.

Companies already make this differentiation.

For example take captcha and bot detection. Some of the heuristics are based on inherent human limitations like response time, click time, mouse acceleration etc.

I doubt youtube or any other streaming service will be happy if you want to stream all their videos to train a hypothetical human like AI(which views and prepares notes like a human) at a hugely accelerated speed compared to a regular human. You can guess how quickly they will cite fair usage policies.

What I want to say is there are fundamental differences between a human and an AI. So, we should not be quick to dismiss any concerns just because AI can "mimic" humans in certain areas.

anileated
0 replies
4h27m

I can’t think of a better way to argue in favor of “LLMs are copyright laundering machines” than from the humanness angle.

Humans have rights, software tools don’t.

If you grant an LLM the full set of human rights, then it can consume information, regurgitate copyrighted works, and use it to generate money for itself. However, considering blatantly obvious theft as “homage” goes hand in hand with free will, agency, being in control of yourself, not being enslaved and abused, etc. Pondering various scenarios along those lines really gets to the heart of why an LLM is so very much not a human, and how subjecting it to the same treatment as humans is a ridiculous notion.

If you don’t grant LLM human rights, then ClosedAI’s stance is basically that pirating works is OK because they pass them through a black box of if conditions and it leads to results that they can monetize. That’s such a solid argument, it’ll surely play well in the court of law.

Training data is not an “LLM does it”; first because “it” here is not “learning” or understanding in human sense (otherwise you would have to presume that an LLM is a human), and second because a software tool doesn’t have agency and it’s really just Microsoft using a tool based on copyrighted works to generate profit.

RandomLensman
0 replies
9h27m

We can have different rules for humans than for machines. In fact, that happens all the time.

accrual
0 replies
2h16m

Sometimes they're so overfit that the compression isn't even lossy, and the data is encoded verbatim in the NN.

Here's an article from November 2023 that discusses this:

https://not-just-memorization.github.io/extracting-training-...

sagarm
16 replies
9h1m

Isn't it totally normal to write articles / blog posts that effectively summarize, and often quote from, news articles?

laborcontract
14 replies
8h55m

My impression is that it’s not necessarily legal, but going after bloggers and proving damages based is just a huge waste of their time. OpenAI came by with their fat stack of funding and changed that.

facu17y
6 replies
8h45m

It is legal. Fair use. People have been doing it for ages. Almost every article you've ever read has some fair use of another article, book or news item, etc.

lacrimacida
5 replies
8h39m

When it becomes a service where you make money but the source doesn’t is it still fair use?

whythre
4 replies
7h53m

Yeah. No one is out there suing the shit out of cliff notes because they published a summary of Catcher in the Rye.

tkz1312
3 replies
6h37m

they might if cliff notes starting copy pasting parts of the source into their articles and passing it off as original writing though :)

throwuwu
1 replies
5h57m

The Tolkien estate should get busy suing all the fantasy writers, comic artists, game developers and board and card game companies. Lots of cash there.

galangalalgol
0 replies
52m

They have done some of that actually. Tolkien will be public domain in the nations that are at aithors death+50 in a few days. Sadly, it will be a much longer wait in mine and many others.

hn_acker
0 replies
1h45m

Newspapers generally don't "pass off" quotes as their own writing. They make clear which parts they quoted.

ralfd
3 replies
8h44m

What parent poster meant is that it is normal that news organisations reference each other and report/cite/rephrase each other reports. For example all other news papers reported about the Watergate scandal reported by Bernstein&Woodward in the Washington Post.

Jensson
1 replies
8h22m

Those cite the original source that they used to write the article, the gpt models don't.

weird-eye-issue
0 replies
5h48m

Depends on your prompt

laborcontract
0 replies
5h49m

Yeah but for every instance of that are face hugger links blogs that will rewrite the article and almost meant to deprive the source of any credit.

It’s not clear to me where the line is.

Symmetry
1 replies
4h25m

No, in US law at least there can be no copyright of facts, only presentation. If you convey the same facts in different words that isn't a matter of fair use, it's never even a matter of copyright in the first place.

onos
0 replies
26m

How about things that aren’t quite facts? Reviews, opinions, etc.

oxguy3
0 replies
3h14m

No, it is very specifically and deliberately fair use. That is the primary intended purpose of fair use. The New York Times doesn't own the news; they just own their articles.

cmiles74
0 replies
4h59m

I think the issue is that they trained ChatGPT on the New York Times' proprietary IP without paying licensing fees and, the Times argues, that is illegal. By way of proof the Times has examples of ChatGPT dumping out articles verbatim.

carlosdp
9 replies
2h17m

What you described is entirely fair use, actually.

Not only that, look at a few news articles from Tier 2 and down publications, and you'll realize that almost all of them are directly sourced from NYT and others. They'll say "so and so happened, according to The Times" (and usually link the article there)

h1fra
3 replies
30m

it's fair use if you don't make money from your project no?

JohnFen
2 replies
26m

No.

In the US, whether or not you make money has little to do with whether or not your use qualifies as "fair use".

semiquaver
1 replies
18m

Why do you say that? Commercial vs noncommercial use is a primary factor in the “purpose” prong of the fair use balancing test and a significant one in the “market effects” prong.

That a use is noncommercial is often a deciding factor in the success of a fair use defense. GP is overstating it though, since it’s still one of many factors.

freejazz
0 replies
12m

Because anyone that is familiar with fair use knows that the purpose prong and the commerciality aspect of it is not one of the more important prongs of the fair use analysis, whereas transformation is. Transformation adjusts what is a purpose that falls under fair use. Did you read Warhol??

hn_throwaway_99
1 replies
1h59m

What you described is entirely fair use, actually

Just like during the pandemic how everyone became an epidemiologist, suddenly everyone's a copyright lawyer. I'll just dispute your assertion by saying:

1. Questions of fair use are famously gray, and anyone who declares something as "entirely fair use", with no caveats, is nearly always wrong except for the must obvious cases, which the given example is most definitely not. A judge has wide latitude in determining fair use.

2. People should familiarize themselves with the four factors of fair use determination. In particular, if a work is purely derivative of a source work and substantially negatively impacts the market for the original work, it's very likely to not be considered fair use.

A great overview is https://fairuse.stanford.edu/overview/fair-use/four-factors/

NegativeK
0 replies
1h24m

suddenly everyone's a copyright lawyer

Roll back 20+ years ago on Slashdot and you'll see the exact same thing.

Copyright has been a hot button issue on the internet for decades. People end up thinking (rightly or wrongly) that they understand it without being a lawyer.

freejazz
0 replies
13m

What you described is entirely fair use, actually.

Based upon what? You think other publishers use NYTimes articles for free without license?

Powdering7082
0 replies
9m

Do you have some examples & are you sure they don't pay licensing fees to NYT?

BolexNOLA
0 replies
1h57m

I would say it is arguable that is fair use, but the whole thing about fair use is that it is a defense, not a type of license or something you can preemptively apply. So whether or not it will be protected under fair use is actually not determined yet. In fact I would say that’s the entire debate here, right?

I have worked on many documentaries and any time we said “fair use” internally what we were implicitly saying is “nobody will come after us because they know that we are probably safe under fair use if this escalated.“ But again, we could never preemptively apply it. We were just anticipating potential conflict and gauging how likely it was to occur.

logicchains
7 replies
10h6m

Something like, summarise all articles on US-UK relationships over past 5 years. I charge money for it, and all I pay NYT is a monthly subscription fee.

Is that fair use? IANAL, but doesn't sound like it.

If you pay someone to do the summarisation for you, then you publish the content and charge a fee for it, you're the one liable, not the person you paid to summarise it for you. Similarly if you ask GPT to do it for you, then publish it, you're liable for what you publish; GPT is just a summarisation tool.

tsimionescu
3 replies
9h5m

That's not true at all. If you pay someone to copy NYT articles for you verbatim, and then they give the copies to you, and then you publish them online, then you've both violated the copyright. You are never allowed to make copies of copyrighted works, even for private deals (making such copies for purely personal use, such as archival, falls under fair use - but you can't build a service out of that).

So, if the summaries are derived works and not covered by fair use, then both you and the summarizee are separately breaking the NYT's copyrights. Otherwise, if this is covered by fair use, then you are both in the clear.

Finally, GPT is not "a summarization tool" in this case. If you provide a copy of a NYT article as a prompt and then ask for summarization, then yes, it is clear that GPT is not doing anything wrong, even if it spits out the exact same text. But if you simply ask for a summary of a specific article by, say, just name and date, and you get a copy of it, it's clear that GPT is storing the original data in some way, and thus it has copied the NYT's protected works without permission.

logicchains
1 replies
8h55m

But if you simply ask for a summary of a specific article by, say, just name and date, and you get a copy of it, it's clear that GPT is storing the original data in some way, and thus it has copied the NYT's protected works without permission.

In this particular case they were using it via Bing, which actively did a HTTP request to the particular article to extract the content. So GPT hadn't memorised it verbatim, instead it fetched it, much like a human using a search engine would.

tsimionescu
0 replies
6h25m

The article states that they used it initially through ChatGPT, but that seems to have been fixed in the meantime, at least for the very simplistic queries that used to work ("the first paragraph of the Carl Zimmer article on old DNA" in ChatGPT used to return the exact data from NYT, and "next paragraph" could then be used to get the following ones). Even if this has been fixed, it still proves that ChatGPT encodes exact copies of NYT articles in its weights, which may be a violation in itself, even if it is prevented from returning them directly. Especially if they ever started distributing the trained model.

Additionally, even the use through Copilot is very debatable. They are not returning the NYT link, which requires a subscription, they are returning the contents of it even to non-subscribers. And they are doing this in a commercial product, not a non profit like the Internet Archive, which has some arguments for fair use.

BlueTemplar
0 replies
8h20m

Also, ChatGPT isn't a person with rights and duties. The people that made it are responsible for it.

rich_sasha
2 replies
9h53m

That's not the example. Here I proactively scrape NYT, summarise articles for a fee and sell that as a service. It's not people coming to me with some articles to summarise, and maybe then publishing it online.

At some level it becomes a subversion of NYTs fees. First, say I subscribe and simply host the articles verbatim, for a fee. Clearly, that's not right.

Suppose I change some spelling or word order, or use a synonym or two. That's still not ok.

And if I substantially paraphrase the articles? I guess this is the relevant case. This is kind of what LLMs do. And also feels like not fair use.

logicchains
1 replies
9h9m

That's not the example. Here I proactively scrape NYT, summarise articles for a fee and sell that as a service. It's not people coming to me with some articles to summarise, and maybe then publishing it online.

That's not what OpenAI is doing; it's not selling summarised articles as a service. Your example is a false equivalence.

This is kind of what LLMs do. And also feels like not fair use

An LLM doesn't do this unless you ask it to. And if you then take that output and publish it as your own, you're breaching the copyright, not OpenAI.

heavyset_go
0 replies
8h52m

An LLM doesn't do this unless you ask it to. And if you then take that output and publish it as your own, you're breaching the copyright, not OpenAI.

In this case, OpenAI is violating copyright by modifying, reproducing and distributing copyrighted content to its customer.

px43
6 replies
9h37m

From what I can tell, this has nothing to do with LLMs at all. In the example in the article, the user is asking Bing to go fetch the contents of an article directly from the website, and print it out, which it dutifully does.

Seems like the "problem" is that NYT etc gives privileged access to search engines for indexing their content, but then get upset when snippets of the indexed content is being shown to users without the users having to fight the paywall or whatever.

This article also claims that the screenshot is coming from ChatGPT when it clearly is not.

rich_sasha
5 replies
9h29m

I suppose that's a relatively easy thing to fix, technically. It proves, however, that th underlying LLM is trained on copyrighted data.

I'm not sure the problem goes away simply if the LLM in question (or any other one) gets some "no verbose regurgitation" filter.

exitb
3 replies
9h2m

In that case, the language model calls a search function and just repeats the result out its conversation context, not its training data. With that in mind it's not clear why it's ok for Bing itself to quote the source, but it stops being ok, when a chatbot does it.

Jensson
2 replies
7h58m

Bing links to the source, chatbots doesn't.

px43
0 replies
6h53m

In the example from the article, it very clearly points to all the sources used.

dyno12345
0 replies
1h22m

bing's chatbot does

kolinko
0 replies
5h30m

The example from the article doesn't show that LLM is trained on copyrighted data - it's just Bing fetching the source article, providing it to GPT, and GPT rephrasing the article. An agent trained on entirely copyright-free data would provide exactly the same output.

modeless
5 replies
1h45m

To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.

You just described Google. When you think about it, it's surprising that Google is legal. However, it is well established that what Google does is perfectly legal. Remember that internally Google keeps and uses complete verbatim copies of every web page they index.

Yes, Google offers a link to the source. If OpenAI did the same, even if only 0.1% of people clicked on the links and NYTimes hardly got any revenue from it, would that make it legal in your eyes? What if they implemented a system that detected when it was outputting a verbatim copy of something and simply paraphrased it? NYTimes clearly doesn't have copyright on paraphrased versions of their articles. I think it would be pretty silly if the government forced them to do that as it wouldn't make any practical difference to anyone.

graeme
2 replies
1h27m

Any publisher can opt out of google. Publisher also have substantial control over titles and snippets shown in google, whether an article appears in google news, etc

Paraphrasing is also known as cloning and is often a copyright violation

modeless
1 replies
1h15m

Copyright law doesn't mention opt outs or search engine snippet controls. It's not clear to me that robots.txt is the singular thing that makes Google legal.

In US copyright law facts cannot be copyrighted, so copyright on factual content like newspaper articles is limited. Simply replacing a few words wouldn't work, but I am certain that GPT-4 is capable of paraphrasing factual content at a level that would not be considered infringement if a human did it.

freejazz
0 replies
10m

Copyright law doesn't mention opt outs or search engine snippet controls. It's not clear to me that robots.txt is the singular thing that makes Google legal.

Genuinely - what are you talking about besides your own assumptions? you just assume everything google does is legal and therefore any one else doing anything arguably similar must also be legal? Without regard for factual details that do matter to copyright law? Such as license?? Your own description of copyright law here is very stunted - you can't paraphrase articles of the NYTimes and call it a fair use. You can report on what the NYtimes reports on... because that's what news is.

inetknght
1 replies
1h31m

However, it is well established that what Google does is perfectly legal.

Google has a wide range of products and shakedowns. Not all of them are "perfectly" legal: Google is being challenged in court over some of their shakedowns and products practices.

modeless
0 replies
1h29m

I am clearly talking about the web search engine in the context of copyright. Other products or legal concerns like antitrust are completely irrelevant here.

tauntz
2 replies
5h51m

Is that fair use?

As always, the answer is.. "it depends". I guess it depends mostly on the jurisdiction that applies to you. "Fair use" can have rather different legal meaning (or not exist at all) in different countries.

madeofpalk
0 replies
5h7m

Also “fair use” does not use/define precedent - each case is assessed individually which really can be a flip of the coin.

hn_acker
0 replies
1h38m

Fair use is specific to the US, as far as I'm aware. Moreover, Congress had to codify fair use (turn fair use common into statutory law in the form of 17 U.S. Code § 107) in order to make copyright statutes compatible with the First Amendment. Most other countries don't have freedom of expression and freedom of the press, so copyright law in a different country usually lacks a unifying exception test like fair use to supplement the specific enumerated exceptions.

superb-owl
1 replies
4h8m

A sibling comment mentions search engines. I think there's a big difference. A search engine doesn't replace the source, not at all.

Google has been accused for years of replacing sources with their "One Box"--the big answers at the top of the page, which are usually pulled from or corroborated by search results. They don't want you to leave the search results page (where the ads are).

paxys
0 replies
2h12m

Google is very careful to license all the content that shows up in that interface. They even pay Wikipedia, despite legally not needing to at all.

ks2048
1 replies
4h39m

It would be nice to have a nice principled answer to this, but unfortunately, in our world, the answer is probably: if you start making LOTS of money doing this, they will come after you.

doctorpangloss
0 replies
2h4m

The best example is that sport scores, names and stats are not copyrightable by settled case law; however, you still have to go to the NBA and players union if you want to make a fantasy basketball game that has stats or names.

TeMPOraL
1 replies
9h41m

Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.

Can't you, though? I'd thought in general, it's a very important for the market to be able to do just that, otherwise everything gets gummed up in webs of exclusive contractual dependencies between established companies.

rich_sasha
0 replies
9h31m

As I say, I don't really know. But then, this is exactly how SaaS licensing works. There may even be a free personal tier, where you can't sell products based on it, and a professional tier which may be very expensive indeed.

Typically providers of online databases go to some effort to stop people from sharing logins. Even from that point or view, I can imagine scraping articles and providing paraphrases of it for a fee is fishy.

All I'm saying, to some people it's obvious that the whole LLM on scraped Internet is fair use, to me it is not obvious.

qeternity
0 replies
5h11m

Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it.

I think you’re confusing terms of service and copyright. IANAL but what you describe sounds exactly like fair use to me, irrespective of how much you are paying NYT.

papruapap
0 replies
8h34m

Using similar logic NYT should pay all actors involved in their articles.

oh_sigh
0 replies
1h26m

I agree with your IANAL take, but what about a situation with an extra level of indirection? So the service never reads actual NYT articles, but only reads blog/forum posts about NYT articles, and derives what is in the article from conversations about the article by people who have read it. Is that legal now?

oefrha
0 replies
6h21m

The real answer is it totally depends on whether your product grows to $10,000,000,000, and whether you pays part of it back. Search engines pay with referral traffic.

jojobas
0 replies
8h43m

Can you read all of NYT and other things, and answer others' questions based on your knowledge? I'd imagine you can. I'm afraid you can't sidestep the question whether an LLM is more like a person who's read a lot or an archive/index.

charcircuit
0 replies
33m

This analogy fails to capture the transformative nature of these models. Hosting a derivative work that is also a news article is not transformative. Hosting a next word completer is very different than a news article and can't be used as a substitute.

brookst
0 replies
2h4m

How about if you read the paper every day and write opinion pieces about world events? Fair use?

bnralt
0 replies
4h25m

As someone pointed out, plenty of blogs made money off of doing just that. Many people go to Reddit to read news article summaries (and often a comment just pastes the whole article verbatim), instead of paying a site like the New York Times. Twitter and other social media sites are full of people summarizing articles from the New York Times. Any late breaking news article from Wikipedia is going to be mostly summarizing information from reporters.

I think people severely underestimate how much they've grown accustomed to this information being freely available. It's easy to say "Well it shouldn't be available with ChatGPT," but if we actually put everything back behind a paywall and stopped people from doing things like writing blogs or newsletters that summarize the news, people here would get angry very fast.

JCharante
0 replies
1h2m

But is it legal for me to read the NY Times about a war, and then charge people to interview me as an "expert"?

BlueTemplar
0 replies
8h17m

There's nothing wrong with scraping openly available data (including data openly available by mistake, as long as you are not aware of it, see the Bluetouff affair).

So the demand to destroy those databases seems very dubious to me.

Of course later violating fair use is another issue.

6stringmerc
0 replies
9h3m

No, the four factor test is clear. Next.

groceryheist
59 replies
11h44m

The suit demonstrates instances where ChatGTP / Bing Copilot copy from the NYT verbatim. I think it is hard to argue that such copying constitutes "fair use". However, OAI/MS should be able to fix this within the current paradigm: Just learn to recognize and punish plagiarism via RLHF.

However, the suit goes far beyond claiming that such copying violates their copyright: "Unauthorized copying of Times Works without payment to train LLMs is a substitutive use that is not justified by any transformative purpose."

This is a strong claim that just downloading articles into training data is what violates the copyright. That GTP outputs verbatim copies is a red herring. Hopefully the judge(s) will notice and direct focus on the interesting, high-stakes, and murky legal issues raised when we ask: What about a model can (or can't) be "transformative"?

visarga
14 replies
10h45m

Just learn to recognize and punish plagiarism via RLHF.

This is not a RLHF problem. What I was expecting them to do is to keep a bloom filter of ngrams for known copyrighted content, such as enumerating all sets of n=7 consecutive words in an article, and validate against it. The model would only output at maximum n-1 words that look verbatim from the source.

But this will blow up in their face. Let's see:

- AI companies will start investing much more in content attribution

- The new content attribution tools will be applied on all human written articles as well, because anyone could be using GPT in secret

- Then people will start seeing a chilling effect on creativity

- We must also check NYT against all the other sources, not everything the write is original

groceryheist
9 replies
10h30m

Maybe the bloom filter solution is enough, but I wonder.

- Paraphrasing n=7 words (and quite a few more) within a sentence can easily be fair use.

- As n gets big, the bloom filter has to also.

If/when attribution is solved for LLMs (and not fake attribution like from Bing or Perplexity) then creators can be compensated when their works are used in AI outputs. If compensation is high enough this can greatly incentivize creativity, perhaps to the point of realizing "free culture" visions from the late 90s.

visarga
4 replies
10h17m

As n-gram length grows, we are still going to have the same number of ngrams, they go through a hashing function and indexed in the bloom filter as usual. The number of n-grams size n in a text is text_length - ngram_length + 1.

groceryheist
3 replies
9h57m

The number of unique values in the bloom filter will go up ~exponentially with n. So to control the false positive rate the bloom filter has to grow.

visarga
2 replies
9h48m

At large enough ngram size there would be very few collisions. You can take for example this text and try in Google with quotes, it won't find anything matching exactly.

I tested this 6-gram "it won't find anything matching exactly", no match. Almost anything we write has never been said exactly like that before.

groceryheist
0 replies
9h37m

Yes and the fact that the number of unique phrases grows so quickly with n is why the bloom filter needs to grow so that hashed n-grams don't collide.

dleeftink
0 replies
7h8m

it won't find anything matching exactly

This approach is probably inadequate. In my line of (NLP) research I find many things have been said exactly many, many times over.

You can try this out yourself by grouping and counting strings using the many publically available Bigquery corpora for various substring lengths and offsets, e.g. [0-16]; [0-32]; [0-64] substring lengths at different offsets.

geysersam
2 replies
9h54m

if compensation is high enough

Who pays the compensation? If it's the user, why wouldn't they just buy the authors work directly? Why go through the LLM middleman?

starttoaster
0 replies
9h27m

If it's the user, why wouldn't they just buy the authors work directly? Why go through the LLM middleman?

If it's the user, why wouldn't they just buy the DVDs directly? Why go through the Netflix middleman?

A retort to this would be that both NYT and ChatGPT are on the internet, so it's no added fuss of hopping in my car, driving to Walmart, and picking up a DVD case. My response to it would be that both the LLM and Netflix are content aggregators to the user. I can read the NYT, or I can read the NYT summary on ChatGPT and ask it for life advice with my pet hamster, or ask it how to reverse a linked list in bash.

groceryheist
0 replies
9h51m

The LLM users/middlemen pay. The user probably pays less than they would have to pay the author. The LMM provides information retrieval / discovery.

sideshowb
0 replies
3h12m

I like the idea but seems like there would be big problems. Like detecting if a work is reworded. Or a large number of sources have all slightly influenced a small response - isn't that pretty much considered new knowledge?

Then there's the issue that however you credit attribution, it creates a game of enshittified content creation with the aim of being attributed as often as possible, regardless of whether the content really offered anything that wasn't out there already.

mike_hearn
2 replies
8h18m

I think it is an RLHF problem and that you are right - this will blow up in the faces of the NYT.

Specifically, the NYT examples all seem to be cases where they asked the AI to repeat their articles verbatim? So they ask it to violate copyright and because it's a helpful bot with a good memory, it does so.

Solution: teach the model to refuse requests to repeat articles verbatim. It's easily capable of recognizing when it's being asked to do that. And that's exactly what OpenAI have now done.

So the direct problem the NYT is complaining about - a paywall bypass - is already rectified. Now it would seem to me like the case is quite weak. They could demand OpenAI pay them damages for the time ChatGPT wasn't refusing, but wouldn't they have to prove damages actually happened? It seems unlikely many people used ChatGPT as a paywall bypass for the NYT specifically in the past year. It only knows old articles. OpenAI could be ordered to search their logs for cases where this happened, for example, and then the NYT could be ordered to show their working for the value of displaying a single old article to a non-subscriber, and from that damages could be computed. But it wouldn't be a lot.

That's presumably why the case goes further and argues that OpenAI is in violation even when it isn't repeating text verbatim. That's the only way the NYT can get any significant money out of this situation.

But this case seems much weaker to me. Beyond all the obvious human analogies, there is precedent in the case of search engines where they crawl - and the NYT let them crawl - specifically to enable the creation of a derived data structure. Search engine indexes are understood to be fair use, and they actually do repeat parts of the page verbatim in their snippets. Google once even showed cached versions of whole pages. And browser makers all allow extensions in their stores that strip ads and bypass paywalls, and the NYT hasn't sued them over that either.

cycomanic
1 replies
5h28m

This is not how copyright works though. The verbatim quoting of articles is because when people brought up these questions initially the argument was that the NN doesn't really contain the training data or really just in an abstract, condensed way that does not constitute copying of the content.

This demonstrates that no, the NN actually does contain the full articles, copied into the NN. Do you think any normal person would get away with copying MS windows by e.g. zipping it together with some other OS on the same medium. Why should we let OpenAI get away with this?

mike_hearn
0 replies
4h41m

Search indexes contain exact copies of the pages they index, and that isn't a copyright violation.

> Why should we let OpenAI get away with this?

IP rights, like other private property rights, are a compromise between creators and consumers. What "should" be the case is essentially an argument about what balance creates the best overall outcomes. LLMs, for now, require large amounts of text to train, so the question is one of whether we want LLMs to exist or not. That's really a question for Congress and not the courts, but it'll be decided in the courts first.

dyno12345
0 replies
1h19m
peyton
9 replies
10h59m

Well yeah, copying a work and using it for its original expressive purpose isn’t fair use, no? You have to use it for a transformative purpose.

Suppose I’m selling subscriptions to the New Jersey Times, a site which simply downloads New York Times articles and passes them through an autoencoder with some random noise. It serves the exact same purpose as the New York Times website, except I make the money. Is that fair use?

echelon
3 replies
10h54m

Well yeah, copying a work and using it for its original expressive purpose isn’t fair use, no? You have to use it for a transformative purpose.

They transformed the weights.

Just like reading the article transforms yours.

As for verbatim reproduction, I'm pretty sure brains are capable of reproducing song lyrics, musical melodies, common symbols ("cool S"), and lots of other things verbatim too.

Those quotes from Dr. King's speech that you remember are copyrighted, you know?

JambalayaJim
2 replies
10h27m

This comment is just blatant anthropomorphizing of ML models. You have no idea if reading an article “transforms weights” in a human mind, and regardless, they aren’t legally the same thing anyway.

stevenhuang
0 replies
8h25m

Modern neuroscience does highly suggest this is essentially what's happening.

echelon
0 replies
2h58m

they aren’t legally the same thing anyway.

They should be.

cornel_io
3 replies
10h13m

If they could find a single person who in natural use (e.g. not as they were trying to gather data for this lawsuit) has ever actually used ChatGPT as a direct substitution for a NYT subscription, I'd support this lawsuit.

But nobody would do that, because ChatGPT is a really shitty way to read NYT articles (it's stale, it can't reliably reproduce them, etc.). All that is valuable about it is the way that it transforms and operates on that data in conjunction with all the other data that it has.

The real world use of ChatGPT is very transformative, even if you can trick it into behaving in ways that are not. If the courts act intelligently they should at least weigh that as part of their decision.

whoopsie
1 replies
9h11m

That’s nonsense piracy. I never intend to own a truck, so when I need to haul a little something I go to Home Depot and steal a Ford off the lot for an hour? What if I stole all your commits, plucked the hard lines out of the ceremony, and then launched an equivalent feature the same week as you did, but for a competing software company? Would you or your employer deserve to get paid for my use of the slice of your work that was specifically useful for me? Yeah, and then some extra for theft.

Zpalmtree
0 replies
17m

awful comparison

peyton
0 replies
9h31m

It’s more of a thought experiment. Here’s another with more commercial applications:

Suppose I start a service called “EastlawAI” by downloading the Westlaw database and hiring a team of comedians to write very funny lawyer jokes.

I take Westlaw cases and lawyer jokes and feed them to my autoencoder. I also learn a mapping from user queries to decoder inputs.

I sell an API and advertise it to startups as capable of answering any legal question in a funny way. Another company comes along with an API to make the output less funny.

Have I created a competitor to Westlaw by copying Westlaw’s works for their original expressive purpose and exposing it as an intermediary? Or have I simply trained the world’s most informative lawyer joke generator that some of my customers happen to use for legal analysis by layering other tools atop my output?

Did I need to download Westlaw cases to make my lawyer joke generator? Are the jokes a fair-use smokescreen for repackaging commercially valuable copyrighted data? Does my joke generator impact Westlaw in the market? Depends, right?

hn_acker
0 replies
1h20m

Well yeah, copying a work and using it for its original expressive purpose isn’t fair use, no? You have to use it for a transformative purpose.

To be clear, whether the use of the original work is transformative is one key consideration within one of the four prongs of fair use. The prong "purpose and character of the use" can be fulfilled by other conditions [1]. For example, using the original work within a classroom for education purposes is not transformative, but can fulfill the same "purpose and character of the use" prong. Whether the use is for profit and to which extent are other considerations within that prong. A profit purpose doesn't automatically fail the purpose prong, and a non-profit purpose doesn't automatically pass the purpose prong.

[1] https://en.wikipedia.org/wiki/Fair_use#1._Purpose_and_charac...

jahewson
8 replies
10h29m

Many instances of fair use involve verbatim copying. The important questions surround the situation in which that happens - not so much the copying. NYT is in uncharted territory here.

fsckboy
7 replies
10h17m

in the same way that machines are not able to claim copyright, they aren't allowed to claim other legal rights either, like "fair use".

The entity which owns ChatGPT is apparently maintaining a copy of the entirety of the New York Times archive within the ChatGPT knowledge base. That they extract some fair use snippets (they would claim) from it would still be fruit of a poisoned tree, no?

(disclaimer: I'm pro AI, anti copyright, especially anti elitist NY Times; but pro rule of law)

colechristensen
5 replies
10h11m

I think there is some point between fifty years ago and last week in which the copyright for the content of newspapers should be public domain. That part of copyright needs to be fixed.

Your creative work does deserve at least some period of exclusive rights for you. Definitely not so much that your grandchildren get to quibble about it well into retirement. But also whatever the number 3 or 4 most valuable company in the world doesn’t get to scrape your content daily to repackage and sell as intelligent systems.

TeMPOraL
4 replies
9h33m

But also whatever the number 3 or 4 most valuable company in the world doesn’t get to scrape your content daily to repackage and sell as intelligent systems.

Here's a thing though: for 99%+ of that content, being turned into feedstock for ML model training is about the only valuable thing that came of its existence.

If it were not for world-ending danger of too smart an AI being developed too quickly, I'd vote for exempting ML training from copyright altogether, today - it's hard to overstate just how much more useful any copyrighted content is for society as LLM training data, than as whatever it was created for originally.

tsimionescu
3 replies
8h59m

Except if you do that, you will see the number of content producers plummet quite quickly, and then you won't have any new training data to train new LLMs on.

aspenmayer
2 replies
8h32m

Would it not logically follow that nothing of value would be lost, even if that were the case? From the point of view of LLMs and content creators, I would treat potential loss of future content being created like I would treat a lost sale. LLMs have value now because of training performed on content that already exists. There must be diminishing returns for certain types of content relative to others. Certain content is only of value if it is timely, and going forward, content that derives its worth from timeliness would find its creation and associated costs of production and acquisition self-justifying. If content isn’t of value to humans now or in the future, nor even of value to LLMs now or in the foreseeable future, not even hypothetically, then why should we decry or mourn its loss or absence or failure to be created or produced or sold?

tsimionescu
1 replies
6h33m

That's like saying that if a competitor can take your products from your warehouse and sell them for pennies on the dollar, your business has no value. The point is that, to some extent, OpenAI is selling access to NYT content for much cheaper than NYT, while paying exactly 0 to NYT for this content. Obviously, the NYT content costs the NYT more than 0 to produce, so they just can't compete on price with OpenAI, for their own content.

Note that I don't see any major problem if only articles that were, say, more than 5 or 10 years old were being used. I don't think the current length of copyright makes any sense. But there is a big difference from last year's archive vs today's news.

aspenmayer
0 replies
52m

For the sake of argument, let’s say that OpenAI thought it had the rights to process the NYT articles and even display them in part, for the same reasons, fair use or otherwise, that Google can process articles and display snippets of same in its News product, and/or for the same reasons that Google can process books and display excerpts in its Books product. Just like Google in those cases, I would not be surprised to find Google/OpenAI on the receiving end of a lawsuit from rights holders claiming violations of their copyright or IP rights. However, I side with Google then and OpenAI now, as I find both use cases to be fair use, as the LinkedIn case has shown that scraping is fair use. NYT is crying foul because users/consumers of its content archive have derived unforeseen value from said archive and under fair use terms, so NYT has no way to compel OpenAI to negotiate a licensing deal under which they could extract value from OpenAI’s use of NYT data beyond the price paid by any other user of NYT content, whether it be unpaid fair use or fully paid use under license. It feels to me that NYT is engaging in both double-dipping and discriminatory pricing, because they can, and because they’re big mad that OpenAI is more successful than they are with less access to the same or even less NYT data.

visarga
0 replies
10h4m

There is another fix, but it will have to wait for GPT-5. They could reword articles, summarize in different words and analyze their contents, creating sufficiently different variants. The ideas would be kept, but original expression stripped. Then train GPT5 on this data. The model can't possibly regurgitate copyrighted content if they never saw it during training.

This can be further coupled with search - use GPT to look at multiple sources at once, and report. It's what humans do as well, we read the same news in different sources to get a more balanced take. Maybe they have contradictions, maybe they have inaccuracies, biases. We could keep that analysis for training models. This would also improve the training set.

JumpCrisscross
6 replies
10h10m

Just learn to recognize and punish plagiarism via RLHF

OpenAI has created a $100bn company on this transfer. The Times may have an interest in a material fraction of that wealth.

vidarh
5 replies
9h4m

The NYT is also worth a tiny fraction of that. If it looks like they might get anywhere, it might be better for OpenAI to buy them

afavour
2 replies
8h13m

That would require NYT being willing to sell, which historically they have not been.

vidarh
1 replies
7h52m

I just looked up the share structure; didn't realise the publicly traded shares only appoints 1/3 of the board. Still their second best option is start buying up competitors and going ahead with purging NYT from their training set. That might well end up a worse option for NYT, as they won't stop LLMs from gradually intruding on their space and the moment OpenAI or other LLM providers own major publishers so they don't need to depend on scraping, they lose any leverage they currently have.

JumpCrisscross
0 replies
55m

might well end up a worse option for NYT, as they won't stop LLMs from gradually intruding on their space

The Times almost certainly wants its own LLM. I could see them striking a consortium agreement with other newspapers more easily than OpenAI.

cmiles74
1 replies
4h55m

OMG! Or they could just license the content. I suspect that would be both easier and less expensive. ;-)

vidarh
0 replies
3h21m

I'm not convinced it's a given it will. If it becomes necessary to license, owning the large publishers will be leverage and allow locking competitors out unless you have a portfolio to cross license.

OpenAI alone has a market cap that'd allow it to buy about as large a proportion of publishers of newspapers and books as they'd be allowed before competition watchdogs will start refusing consent.

Put another way:

If I was a VC with deep pockets investing in AI at this point, I'd hedge by starting to buy strategic stakes in media companies.

spacecadet
3 replies
11h11m

Transformations are happening. Maybe if the output is verbatim afterwards, than that says something about the outputs originality all along... or am I a troll?

jarrell_mark
1 replies
10h57m

Anything + 2 and then minus two is back to the original thing. This says more about the transformations than the source material.

spacecadet
0 replies
5h27m

I know, I was trying to be funny, but hey- this community...

dathery
0 replies
10h17m

They're talking about transformative with regard to copyright law where it is an important part of determining fair use, not the dictionary definition you're using here.

I can't take NY Times articles, translate them into Spanish, and then sell the translations under fair use, even though clearly I've transformed the original article content.

intrasight
2 replies
5h47m

Yeah, no - that proposal is no good. The correct solution is to have machine learning be more like human intelligence. You can't ask me to plagiarize a New York Times article. Not because of prompt rule violation but because I just can't. It's not how humans train (at least most).

namlem
1 replies
4h45m

You can't, but there are some people who can quickly memorize entire pages of written text.

intrasight
0 replies
3h2m

That's why I qualified with "at least most"

hn_acker
2 replies
1h27m

This is a strong claim that just downloading articles into training data is what violates the copyright. That GTP outputs verbatim copies is a red herring.

It's the other way around. There is no infringement if the model output is not substantially similar to a work in the training set [1]:

To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.

The questions are, which parties should bear liability when the model creates infringing outputs, and how should that liability be split among the parties? Given that getting an infringing output likely requires the prompt to reference an existing work (which is what's happening in the article), an author of a work, an element in an existing work, or a characteristic/style strongly associated with certain works/authors, I believe that the user who makes the prompt should bear most of the liability should the user choose to publish an infringing output in a way that doesn't fall under fair use. (AI companies should not be publishing model outputs by default.)

[1] https://en.wikipedia.org/wiki/Substantial_similarity#Substan...

dragonwriter
1 replies
1h21m

The level of copying here is the copying into the training set, not the copying through use of the model.

Its true that OpenAI will defend the wholesale copying into the training set by arguing that the transformative purpose of the next use reaches back and renders that copying fair use, but while that's clearly the dominant position of the AI industry, and it definitely seems compatible with the Cobstitutional purpose of Fair Use (while currently statutory, the statutory provision is codification of Constitutional case law), it is a novel fair use argument.

hn_acker
0 replies
33m

The level of copying here is the copying into the training set, not the copying through use of the model.

NY Times is suing because of both the model outputs and the existence of the training set. But infringement in the training set doesn't necessarily mean that the model infringes. Why? Because of the substantial similarity requirement. But first, I'll address the training set.

For articles that a person obtains through legal methods (like buying subscriptions) but doesn't then republish, storing copies of those articles is analogous to recording a legally accessed television show (time-shifting), which generally is fair use. Currently, no court has ruled that "analogous to time-shifting" is good enough for the time-shifting precedent to apply, but I think the difference is not significant. The same applies to companies. Companies are not literally people, but there isn't a reason for the time-shifting precedent to not apply to companies.

What about the articles that OpenAI obtained through illegal methods? Then the very act of obtaining those articles would be illegal. The training set contains those copies, so NY Times can sue to make OpenAI delete those copies and pay damages. But it's not trivially obvious that a GPT model is a copy of any works or contains copied expression of the any works in the training set; the weights that make up the model represent millions of works, it's not trivially obvious that the model contains something substantially similar to the expression in a work in the training set. Therefore, it's not trivially obvious that infringement with respect to the training set amounts to infringement with respect to the model made from the training set.

As long as the model doesn't contain copied expression and the weights can't be reversed into something substantially similar to expression in the existing works, then what matters is the output of the model.

If a user gives a prompt which contains no reference to an existing artist, work, strongly associated characteristic/style, then do OpenAI's models produce outputs substantially similar to expression in the existing works? If not, then OpenAI shouldn't be liable for infringing works, because the infringing works result from the user's prompts. If my premise is false, then my conclusion falls apart. But if my premise is true, then at most I would admit that OpenAI has a limited burden to prevent users from giving those prompts.

furyofantares
2 replies
7h3m

Just learn to recognize and punish plagiarism via RLHF.

I'm not sure how your proposal would actually work. To recognize plagiarism during inference it needs to memorize harder.

Kinda funny if it works though. We'd first train them to copy their training data verbatim, then train them not to.

That is how it works, right? They're trained to copy their training data verbatim because that's the loss function. It's just that they're given so much data that we don't expect this to be possible for most of the training data given the parameter count.

empiko
1 replies
6h19m

I wouldn't say it is an unexpected behavior. I remember reading papers about this memorization behavior few years ago (e.g., [1] is from 2019 and I believe it is not the first paper about this). It should be expected from OpenAI to know that LMs can exhibit memorizing behavior even after seeing the sample only once.

[1] https://bair.berkeley.edu/blog/2019/08/13/memorization/

furyofantares
0 replies
5h48m

My expectation is that it can't memorize most of its training data. I expect it to memorize some.

bertil
1 replies
1h31m

Adding an extra constraint of no copying verbatim from a very large and relevant corpus will be hard to guarantee without enormous databases of copyrighted content (which might not be legal to hold) and add an extra objective to a system with many often contradictory goals. I don’t think that’s the technology-sound solution or one in the interest of anyone involved. It’s much more relevant to license content from as many newspapers as possible, recognize when references are relevant, and quote them either explicitly verbatim if that’s the best answer or adapt (translate, simplify, add context) when appropriate.

I feel like the NYTimes is asking for deletion as a negotiation tactic to force OpenAI to give them enough money to pay for their journalism (I am not sure who would subscribe to NYTimes if you can get as much through OpenAI, but I am open to registering extra to pay for their work).

pants2
0 replies
43m

What if OpenAI were to first summarize or transform the content before training on it? Then the LLM has never actually seen copyrighted content and couldn't produce an exact copy.

kromem
0 replies
2h35m

This isn't an issue with training, it's an issue with usage.

Production open access LLMs do probably need a front-end filter with a fine tuned RAG model that identifies and prevents spitting out copyrighted material. I fully support this.

But we shouldn't be preventing the development of a technology that in 99.99% of usecases isn't doing that and can used for everything from diagnosing medical issues to letting coma patients communicate with an EEG to improving self-driving car algorithms because some random content producer's works were a drop in the ocean of content used to learn relationships between words and concepts.

The edge cases where a model is rarely capable of reproducing training data don't reflect infringement of training but of use. If a writer learns to write well from a source is that infringement? Or is it when they then write exactly what was in the source that it becomes infringement?

Additionally, now that we can use LLMs to read brain scans and have been moving towards biological computing, should we start to consider copying of material to the hippocampus a violation of the DMCA?

colechristensen
0 replies
10h17m

I think NYT is going to win.

LLMs are arguably compressed data archives with weird algorithms. The fact that they will regularly regurgitate verbatim quotes of training data is evidence of this, as are the guardrails that try to prevent this.

The second piece of evidence is this paper explained here https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... where instead of an LLM researchers used gzip compressed data as a model and it even beat trained LLMs.

AI is a bit of a black box, but that doesn’t protect the operators of black boxes from rights violation suits. You can’t make a database of scraped copyrighted data and patented that querying that data is fair use.

There needs to be law made here and the law just isn’t going to be “everybody can copy everything for free as long as it’s for model training”.

Licensing will have to be worked out, actual laws and not just case law needs to be written. I have a lot of sympathy for lots of leeway for the open source researchers and hackers doing things… but not so much for Microsoft and Microsoft sponsored openai.

amadeuspagel
0 replies
1h56m

The suit demonstrates instances where ChatGTP / Bing Copilot copy from the NYT verbatim. I think it is hard to argue that such copying constitutes "fair use". However, OAI/MS should be able to fix this within the current paradigm: Just learn to recognize and punish plagiarism via RLHF.

Isn't that in tension with the basic idea of an LLM of predicting the next token? How do you achieve that while never getting close enough to plagiarism?

fasterik
23 replies
12h12m

I've been arguing since ChatGPT came out that LLMs should fall under fair use as a "transformative work". I'm not a lawyer and this is just my non-expert opinion, but it will be interesting to see what the legal system has to say about this.

mynegation
15 replies
12h0m

Suit claims that GPT reproduced passages from NYT almost verbatim.

rvz
9 replies
11h54m

Precisely.

This tired 'fair use' excuses from AI bros whilst the GPT has reproduced the article text verbatim, word for word and it being monetized without the permission from the copyright holder and source (NYT) is an obvious copyright violation 101. Full stop.

Again, just like Getty v. Stability, this copyright lawsuit will end in a licensing deal. Apple played it smart with their choice with licensing deals to train their GPT [0]. But this time, OpenAI knew they could get a license to train on NYT articles but chose not to.

[0] https://9to5mac.com/2023/12/22/apple-wants-to-train-its-ai-w...

chatmasta
7 replies
11h42m

AI bros

What (or whom) do you consider to be an "AI bro?"

This sort of ad hominem generalization usually accompanies a weak argument.

beau_g
3 replies
11h6m

Young males that wear Tensorflow branded muscle tank tops and drive Mitsubishi Eclipse convertibles with the vanity plate OVERFIT. They are everywhere these days.

danielbln
1 replies
8h45m
vidarh
0 replies
8h13m

The text generation is getting quite decent. The limbs disappearing into the car are somewhat less impressive.

jakderrida
0 replies
10h46m

Thank you for the absurd visual. The vanity plate, especially, was worth saving for last. Somehow, the car is well suited, also. Love how they prefer Tensorflow over Pytorch, too.

vidarh
0 replies
11h1m

I generally tend to downvote comments that use "x bros" for pretty much any x on sight for that reason. It's exceedingly rare for such a comment to be much more than a thinly veiled insult with little substance. Sometimes I might even agree with the insult, but it's still rarely appropriate here.

satvikpendem
0 replies
11h39m

It seems to be used by people who've previously used the term "tech bro."

irq
0 replies
11h37m

Not saying I agree with this labeling, but it means approximately the same thing as “crypto bro”, but for AI

throwup238
0 replies
11h47m

The four factors considered in a fair use test:

    the purpose and character of the use
    the nature of the copyrighted work
    the amount and substantiality of the portion taken
    the effect of the use upon the potential market.
Literally every single one of these factors has very complicated precedent and each one is an open question when it comes to AI. Since fair use is a balancing test this could go any way.

Stability took the easy way out because they didn't have billions of dollars to play around with and Microsoft to back them. Let's see what OpenAI does but calling everyone who disagrees with your naive interpretation of fair use "AI bros" is doing everyone a disservice.

lodovic
2 replies
10h35m

I'm sure the NYT uses dictionaries, encyclopaedias and style books verbatim as well. And they don't invent the facts they write about. As journalists they are compiling and passing along other knowledge. You usually don't get a piece of their income when a journalist quotes you verbatim (people usually don't get paid for interviews).

threeseed
0 replies
6h59m

NYT doesn't reproduce the contents of the dictionary or encyclopaedia.

And even if they did it will be fine because those sources allow for it.

The point is that OpenAI never asked NYT for permission to use their data.

madeofpalk
0 replies
8h35m

If the NYT reproduces other content verbatim too much, it will get in trouble.

dahart
1 replies
11h38m

I don’t doubt it does. It’s easy to get it to spit out long answers from Stack Overflow verbatim, I’ve done it. Maybe some of the “transformative” nature of the LLM output is the removal of any authorship, copyright, license, and edit history information. ;) The point here is to supplant Google as the portal of information, right? It doesn’t have new information, but it’s pretty good at remixing the words from multiple sources, when it has multiple sources. One possible reason for their legal woes wrt copyright is that it’s also great at memorizing things that only have one source. My college Markov-chain text predictor would do the same thing and easily get stuck in local regions if it couldn’t match something else.

__loam
0 replies
10h36m

I don't think these can replace search engines.

ramesh31
3 replies
11h48m

It's inevitable that this question ends up at the supreme court. And the sooner the better IMO. It's clearly fair use. Generative agents will be seen legally as no different than a human artist leveraging the summation of their influences to create a new work.

agentgumshoe
1 replies
10h45m

Clearly fair use? What if I pay ChatGPT to give me the NYT article it sourced verbatim as stored (i.e. without referring me to the NYT source)?

MacsHeadroom
0 replies
6h46m

It's not stored in ChatGPT actually, unlike Google's web search cache where it is stored verbatim, can be recalled perfectly, and is still fair use.

Fair use has nothing to do with reproducibility. LLMs are more clearly fair use than a search engine cache and those court cases are long settled. There's no world in which OpenAI doesn't win this entire thing.

simion314
0 replies
8h59m

It's inevitable that this question ends up at the supreme court. And the sooner the better IMO. It's clearly fair use. Generative agents will be seen legally as no different than a human artist leveraging the summation of their influences to create a new work.

Why do you think the architecture is important? If I have a computer program and it outputs the an entire copyrighted poem then the answer to "is this copyright violation" SHOULD NOT depends on the architecture of the program.

agentgumshoe
1 replies
10h48m

What if I ask ChatGPT to print the article verbatim as sourced, from its own dataset?

cjbprime
0 replies
9h34m

It doesn't have database access to its own training dataset; it only has access to the weights it lossily-compressed that training dataset into.

dahart
0 replies
11h16m

This seems like a reasonable opinion when you think about the training data size and imagine that any given output is some kind of interpolation of some unknown large number of training examples all from different people. If it’s borrowing snippets from tens or hundreds or thousands of sources, then who’s copyrights are being violated? Remixing in music seems to be withstanding some amount of legal scrutiny, as long as the remix is borrowing from multiple sources and the music is clearly different and original.

It gets harder to stand behind a blanket claim that LLMs or any AI we’ve got falls under fair use when they keep repeatedly reproducing complete and identifiable individual works and clearly violating copyright laws in specific instances. The models might be remixing and/or transformative most of the time, but we have proof that they don’t do that every time nor all the time… yet. Maybe the lawsuits will be the impetus we need to fix the AIs so they don’t reproduce specific works, and thus make the fair use claim solid and actually defensible?

a_wild_dandan
20 replies
5h54m

The NYT is preparing for a tsunami by building a sandcastle. Big picture, this suit won’t matter, for so many reasons. To enumerate a few:

1. Next gen LLMs will be trained exclusively on “synthetic”/public data. GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct (say reworded by 40%, authors/sources stripped, etc). Ergo there will be no copyright material for GPT-5 to regurgitate.

2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.

3. Models can share weights, merge together, cooperate, ablate, evolve over many generations (releases), etc. Copyright law is woefully ill equipped to handle chasing down violators in this AI lineage soup, annealed with data of dubious/unknown provenance.

I could go on, but the point is that, for better or worse, we live in a new intellectual era. The NYT et al are coming along for the ride, whether they like it or not.

mat0
10 replies
4h58m

I'm sorry but this is such a bad take. Nice appeal to consequences. In my view, the New York Times is entirely justified in pursuing legal action. They invested time and effort in creating content, only to have it used without permission for monetary gain. A clear violation.

Analyzing the factors involved for a "fair use" consideration:

Purpose and Character of the Use: While the argument for transformation might hold in the future as you point out, the current dispute revolves around verbatim use. So clearly not transformative. Also commercial use is more difficult to be ruled fair use.

Nature of the Copyrighted Work: Using works that are more factual may be more likely to be considered fair use, but I would argue that NYT articles are as creative as factual.

Amount and Substantiality of the Portion Used: In this case, the entirety of the articles was used, leaving no room for a claim of using an insignificant portion.

Effect on the Market Value: NYT isn't getting any money from this, and it's clearly not helping their market value if people are checking on ChatGPT instead of reading a NYT article.

IANAL, but in my opinion NYT is well within its rights to pursue legal action. Progress is inevitable, but as humans, we must actively shape and guide it. Otherwise it cannot be called progress. In this context, legal action serves as a necessary means for individuals and organizations to assert their rights and influence its course.

w4ffl35
6 replies
4h54m

Imo gpt itself is the transformative work.

tantalor
5 replies
4h33m

Ok but it's not

UrineSqueegee
4 replies
4h20m

Definition of Transformative Use: The legal concept of transformative use involves significantly altering the original work to create new expressions, meanings, or messages. AI models like GPT don't merely reproduce text; they analyze, interpret, and recombine information to generate unique responses. This process can be argued as creating new meaning or purpose, different from the original works.

In the case of the famous screenshot, the AI just relayed the information it found on the web, it's not included in its training data.

So you're just wrong.

bonzini
2 replies
3h58m

Nope, it doesn't work that way. The fact that the LLM can regurgitate original articles doesn't remove the possibility that training can be considered transformative work, or more in general that using copyrighted material for training can be considered fair use.

Rather, verbatim reproduction is the proof that copyrighted materials was used. Then the court has to evaluate whether it was fair use. Without verbatim reproduction, the court might just say that there is not enough proof that the Times's work was important for the training, and dismiss the lawsuit right away.

Instead, the jury or court now will almost certainly have to evaluate OpenAI's operation against the four factors.

In fact, I agree with the parent that ingesting text and creating a representation that can critique historical facts using material that came from the Times is transformative. An LLM is not just a set of compressed texts, people have shown for example that some neurons fire when you are talking of specific historical periods or locations on Earth.

However, I don't think that the trasformative character is enough to override the other factors, and therefore in the end it won't/shouldn't be considered fair use IMHO.

w4ffl35
1 replies
53m

What if the LLM is running locally and doing all of these things rather than hosted on a webserver which is serving the content?

bonzini
0 replies
39m

It doesn't matter, if everything else stays the same what matters is what it's used for. If it's used to make money, it would certainly hurt claims of fair use—maybe not for those that do the training, but for those that use it.

tantalor
0 replies
4h11m

Only humans can do those things, so the test fails for LLM

NotMichaelBay
1 replies
4h0m

it's clearly not helping their market value if people are checking on ChatGPT instead of reading a NYT article.

People are not using ChatGPT as a replacement for current news, and because of hallucinations, no one should be using it for past news either. I wouldn't remotely call ChatGPT a competitor of NYT traffic, like I would Reuters or other news outlets.

jprete
0 replies
1h17m

The intended result is clearly to supplant other information sources in favor of people getting their information from ChatGPT. Why should it matter to legality that the tech isn't good enough for the goal?

tbcj
0 replies
4h31m

I don’t think the original point being made was that NYT wasn’t justified in bringing the action. The point that was being made was the suit would be ultimately meaningless in the long term even if it was successful in the short term. There is a potentially more significant risk in the future that this suit will not protect against because of the reasons enumerated by the author. While the author is speculating, the law struggles with technology and adapting to change, which makes their prediction useful because it does highlight the problems that are coming that can’t be readily mitigated through legal precedent.

ciabattabread
3 replies
4h53m

rent seeking media companies

Rent seeking? Media companies that actually create content are rent seeking? Versus the garbage hallucinations AI creates?

stuckinhell
1 replies
3h16m

The New York Times is dying company that is rent seeking here. Along time ago, their content was valuable, yet now you can't even give it away to researchers.

I know because they tried to make a deal with my company, we passed because social media data is infinitely more valuable.

ciabattabread
0 replies
2h54m

Because its usefulness to your private jet fund is the only measurement of value.

amadeuspagel
0 replies
1h42m

Rent seeking is an awful term that was from the beginning intended to describe anyone pursing a political or legal goal that deviates from a pure free market economy. As Econlib writes:

”Rent seeking” is one of the most important insights in the last fifty years of economics and, unfortunately, one of the most inappropriately labeled. Gordon Tullock originated the idea in 1967, and Anne Krueger introduced the label in 1974. The idea is simple but powerful. People are said to seek rents when they try to obtain benefits for themselves through the political arena. They typically do so by getting a subsidy for a good they produce or for being in a particular class of people, by getting a tariff on a good they produce, or by getting a special regulation that hampers their competitors. Elderly people, for example, often seek higher Social Security payments; steel producers often seek restrictions on imports of steel; and licensed electricians and doctors often lobby to keep regulations in place that restrict competition from unlicensed electricians or doctors.

https://www.econlib.org/library/Enc/RentSeeking.html

This is linked in the wikipedia article, which is even more confused:

https://en.wikipedia.org/wiki/Rent-seeking

truculent
0 replies
4h16m

Are media really rent-seeking? They create new content and analysis, for which they want to be compensated. It seems quite different to hoarding natural resources or land, for example.

nozzlegear
0 replies
21m

2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.

Sorry, is this the same China that has already introduced their own sweeping regulations on AI? Which in at least one instance forced a Chinese startup to shut down their newly launched chatbot because it said things that didn't align with the party's official stance on the war in Ukraine?

https://finance.yahoo.com/news/beijing-tries-regulate-china-...

https://nitter.unixfox.eu/CDT/status/1625936306814717952?337...

I don't disagree that research/hosting/progress will continue, but I'm not so sure that it's China who stands to benefit from the US adding some guardrails to this rollercoaster.

notahacker
0 replies
4h20m

If Microsoft doesn't get royalty free rights to resell access to everyone's content on demand, China will become the powerhouse of interference-free media? Rrrrrright....

maxlin
0 replies
51m

This is the actual truth. What it sucks for is for citing the data, but GPT-4 doesn't do that to start with unless it's directly from a web result and not the weights.

bonzini
0 replies
4h8m

GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct

Is that just by increasing the temperature, tweaking the prompt, etc.? If you can operate on the raw weights and recreate the original text, copyright infringement still applies.

aurareturn
19 replies
12h13m

Companies that have content all see dollar signs.

NYT won't mind if you use their content to train LLMs - as long as they get a commission. Reddit will shut down their free API and make you pay to get training content. Discord is going to be selling content for AI training too - if they haven't already done so. Twitter is doing it.

They didn't care before because LLMs were just experiments. Now we're talking trillions of dollars of value.

up2isomorphism
11 replies
11h59m

"They" also include the people working there. Why someone work with full time writing articles should give the work for free just let someone to train it and make money out of it as a consequence?

tucnak
5 replies
11h53m

Why someone work with full time writing articles should give the work for free

They are not giving it out "for free", in fact they're being paid by their employer to write these articles. Moreover, the writers themselves stand noth' to gain from their past writings financially as they don't belong to the ownership structure of the business.

MisterBastahrd
3 replies
10h52m

Their ability to make money in the future is directly tied to their employers' ability to make money with their content. This is a closed financial loop. If OpenAI or any other AI company wants in, they should pay a licensing fee or get the laws changed, not just assume that they can take what they want and pretend like there are no negative consequences for the creator or the rights-holder.

tucnak
0 replies
4h58m

This is a closed financial loop.

This is a badly-formulated conjecture, or worse, ultimately selective reading of "social credit" which only purpose is serving your argument; it has nothing to do with economics. I'm sorry, but I'm not convinced.

malwrar
0 replies
9h55m

In this limited example, are there such consequences? Are people dropping NYT subscriptions because they trust chatgpt to inform them of current events? I don’t buy it.

TeMPOraL
0 replies
9h52m

No one is pretending there are no "there are no negative consequences for the creator or the rights-holder". Of course there are. But this is a story of rights-holders, who've already outgrown their usefulness, wanting to tap themselves into money stream they are not entitled to.

ChatGPT isn't competing with NYT on a core competency. No one uses LLMs for original news reporting. They're obviously incapable of doing that, by virtue of not being there on the scene or able to independently research a topic, maintain relationships with sources, etc. What ChatGPT can do is quote/reproduce some parts of past articles, and reason from them. Or at least produce new text that's somewhat related to the old text.

The threat to NYT is this: ChatGPT is much better bullshitter than they are, so it reduces NYT to its core competency: providing original information. Which is all it should be doing in the first place. But instead, NYT wants to not only keep the bullshitting part of its revenue, but also take a cut or destroy the much greater and much more useful part of where this all feeds a general-purpose language model.

bloppe
0 replies
11h35m

the writers themselves stand noth' to gain from their past writings financially as they don't belong to the ownership structure of the business.

This is a dumb argument. We're not just talking about ancient articles. We're talking about new content, including content that is yet to be written.

amadvance
2 replies
11h11m

Why someone work with full time writing articles should give the work for free

OpenSource developers did that ;)

KETHERCORTEX
1 replies
11h2m

When open source developers do that, they also include an explicit licensing information that lists cases when the usage is allowed and restricted. So even if the code is open source and licensed under GPL, its usage in a closed source product like ChatGPT is not allowed.

fsckboy
0 replies
10h5m

GPL code usage in closed source ChatGPT is allowed "for internal use"; it just would not be allowed to distribute binaries of ChatGPT that are closed source without making source available; also a GPL3 license violation to allow online access to a ChatGPT program that used GPL3 code without making source available.

johngladtj
0 replies
7h33m

You understand that news aren't copyrightable right?

You're fighting a scarecrow that doesn't exist...

ReptileMan
0 replies
8h3m

With the ways NYTimes has degraded since 2010 even if people there are working for free, they're still being overpaid. The only adequate section there is the food.

mvdtnz
3 replies
9h58m

NYT do not "have" content, they create content. It's their raison d'etre.

aurareturn
2 replies
8h13m

They have content that LLMs want to use in training - millions of historical articles.

esperent
1 replies
7h47m

They created that content. It's an important distinction to make as compared to Reddit or Facebook where the users created the content.

midasuni
0 replies
5h48m

The journalists created the content for the NYT, the users created it for Facebook. Both received something in return for their effort, and the content ended up being owned by NYT/facebook

MuffinFlavored
2 replies
12h8m

They didn't care before because LLMs were just experiments. Now we're talking trillions of dollars of value.

Can you make the argument this was their fault for not having forward vision/being asleep at the wheel and "accidentally, in hindsight" letting OpenAI/others have free, open, unlimited access to their content?

bloppe
0 replies
11h38m

Basically none of the training material for GPT was used under an "unlimited" license. There are very important legal limitations. GPT just doesn't care much about them.

aurareturn
0 replies
11h29m

No, I can't. It's just an observation with no personal opinion.

altals2023
19 replies
11h33m

Won't hold in court. GPT is a platform mainly providing answer to private individuals asking. Is like you ask a professor a question and he answered verbatim what copyrighted materials available (due to photographic memory) word for word back to you. Now if you take this answer and write a book or publish enmass on blogs for example, then you are the one should be sued by NYT. If GPT use the exact same wordings and publish it out to evetyone visiting their page, then that is on OpenAI.

Vegenoid
9 replies
11h15m

If said professor offered a service where anyone could ask them for information that is behind a paywall, and they provided it without significant transformation, this would certainly be copyright infringement that the copyright holder would have every right and motivation to take action against.

unsupp0rted
3 replies
10h56m

Would parroting back article content perfectly from memory certainly be copyright infringement?

verve_rat
2 replies
10h49m

Go perform a song in a public place without a licencing arrangement and let us know.

unsupp0rted
0 replies
9h20m

My favorite example of performing a song in a public place without a licensing arrangement:

https://youtu.be/j_UoACEUZqA

infinityio
0 replies
9h44m

scale is important here - maybe a better analogy is setting up a paid Spotify clone with all the music sourced from torrents with some slight distortion effect added

elashri
3 replies
11h4m

I think the scale only matters here (probably). Because I will find it hard that a teacher/professor will not be allowed to setup a service where they will teach and provide their knowledge for others. That is basically the concept of teaching. Of course until LLM, we never had this scale before. Millions of potential learners vs the normal hundreds in a classroom session. So that makes the new case interesting

toyg
2 replies
10h49m

"Teaching" by copying source books word for word, would be copyright infringement; see, for example, the well-known issues around photocopying books or even excerpts.

Also lying on source materials (e.g. telling students that some respected historian denies the Holocaust happened, when it's obviously not the case) is not "teaching" - it's defamation, and the NYT is absolutely right to pursue that angle too.

Using LLMs as general-purpose search engines is a minefield, I would not be surprised if the practice disappeared in the next 20 years. Obviously the tech is here to stay, there is no problem when it's applied to augmenting niche work; but as a Google replacement, it has so many issues

shkkmo
1 replies
10h23m

Teaching" by copying source books word for word, would be copyright infringement; see, for example, the well-known issues around photocopying books or even excerpts.

Incorrect. Educational use helps satisfy one of tests for fair use. Teachers can, in many cases, photocopy copyrighted work without infringing on that copyright.

heavyset_go
0 replies
9h47m

Educational use is just one of the many factors used to determine whether an instance of copyright infringement is fair use or not, but it is not carte blanche for educators to ignore IP laws just because they're educating.

gedy
0 replies
11h4m

Professors are largely behind a paywall

__loam
6 replies
10h42m

I hope people start calling out the "well it's fine if a human does it" arguments out for the rat fuck thinking it is. These are computational systems operating at very large scales run by some of the wealthiest companies in the world.

If I go fishing, the regulations I have to comply with are very light because the effect I have on the environment is minimal. The regulations for an industrial fishing barge are rightfully very different, even if the end result is the same fish on your plate.

Garrrrrr
3 replies
10h31m

unfortunately that's not the crowd of people here. 80% of the comments under this thread (right now, 2:52est) are making similar arguments and *continue* to act like LLMs are doing something unique/creative... instead of just generating sentences, from algorithms, from virtually pirated content in the form of data mining

kriro9jdjfif
1 replies
10h1m

“It is difficult to get a man to understand something, when his salary depends on his not understanding it.”

https://www.goodreads.com/quotes/21810-it-is-difficult-to-ge...

__loam
0 replies
5h44m

Gotta get that tender offer money somehow.

c1b
0 replies
9h59m

“As if LLMs are doing something creative and aren’t just algorithms”

You have no idea what you’re talking about huh?

visarga
1 replies
9h40m

GPT is like a fleet of small fishing boats, each user driving their boat in another direction, not a fishing barge. For every token written by the model there must be a human who prompted, and then consumed it. It is manual, and personal, and deliberate.

In fact all the demonstrations in the lawsuit PDF were intentionally angling for reproducing copyrighted content. They had to push the model to do it. That won't happen unless users deliberately ask for it. It won't happen en-masse.

__loam
0 replies
5h45m

Gpt is operated by one company. If a million people eat your fish, you're still a barge.

Boo hoo they had to push it. That was never the problem with these bullshit nozzles. The issue is they put that stuff in the training set in the first place. If you can't be honest about that then I have no interest in debating this with you.

thinkingemote
0 replies
9h43m

The professor having been trained in academia would state the sources of the verbatim quotes. In writing papers he would use references and explicit quotes. There's nothing hidden going on with the professor.

heavyset_go
0 replies
9h51m

Professors and schools get into legal problems when professors pirate and/or otherwise distribute content they don't have licenses for.

habosa
17 replies
4h58m

People who think the examples the lawsuit are “fair use” need to consider what that would mean. We’re basically going to let a few companies consolidate all the value on the Internet into their black boxes with basically no rules … that seems very dangerous to me.

I hope a court establishes some rules of engagement here, even if it’s not this case.

w4ffl35
9 replies
4h56m

Scraping is legal, and this seems like a transformative work to me.

aqme28
8 replies
4h44m

Returning the full text of an article verbatim seems to me like the opposite of "transformative."

Symmetry
5 replies
4h31m

In the screenshot for the article you can see that the LLM says it is "Searching for: carl zimmer article on the oldest DNA". That, and what I know about how LLMs work, suggest to me that rather than the article being stored inside the trained LLM it was instead downloaded in response to the question. So the fact that the system is providing the full text of the article doesn't really go to whether training the LLM is a transformative use or not.

bonzini
4 replies
4h11m

Yes, the screenshot in the article is clearly doing an Internet search. The exhibit in the lawsuit shows that you can complete an article by using GPT on the first sentence of the prompt, with low temperature to aid reproducibility, and obtaining the original except for a single word. That is another thing, and it shows that the LLM has basically recorded the original text into its weights in compressed form: https://pbs.twimg.com/media/GCY4WC6XYAAq-JS?format=jpg&name=...

mattdesl
3 replies
3h35m

It would be curious to test this on a larger sample than just a few. It is hard to believe that a majority of NYT articles are verbatim stored in the weights of a web-wide LLM, but if that is the case it would be a pretty unbelievable revelation about their ability to compress an entire web’s worth of data. But, more likely, I assume it is a case of overfitting, or simply finding a prompt that happened to work well.

FWIW, I can’t replicate on either GPT 3.5 or 4, but it may be that OpenAI has added new measures to prevent this.

bonzini
1 replies
2h28m

You can't reproduce on the web interface, because the temperature settings are higher than what's required to compress the text. You need to use the API.

However, I had good luck reproducing poems on GPT 3.5, both copyrighted and not copyrighted, because the choice of words is a lot more "specific" so to speak, and therefore higher temperature isn't enough to prevent complete reproduction of the originals. See https://chat.openai.com/share/f6dbfb78-7c55-4d89-a92e-f4da23... (Italian; the second example is entirely hallucinated even though a poem with that title exists, while the first and third are recalled perfectly).

mattdesl
0 replies
28m

It doesn’t seem that surprising; compared to entire NYT articles, poems are short, structured and more likely to be shared in multiple places across the web.

I’m more surprised that it can repeat 100 articles; if that behaviour is consistent in larger sample sizes and beyond just NYT dataset (which might be repeated on the web more than other sources, causing overfitting), that would be impressive.

You could imagine at some point a large enough GPT5 or 6 or 7 will be able to memorize verbatim every corner of the web.

dwringer
0 replies
2h32m

I have attempted this sort of thing with GPT 3.5 many times and never been successful, although I've still never been taken off of the GPT4 waiting list that I signed up for months ago and I'm not going to subscribe without trying it first. I [and presumably many thousands of others] have tried things like this with many LLMs and image generating models, but to my knowledge we've come up rather short. I've never managed to recreate anything verbatim and have struggled to get anything resembling a copyright infringement out of stable diffusion with the sole exception of a meme image of Willy Wonka.

That said, the meme image of Willy Wonka comes out of stable diffusion 1.5 almost perfectly with surprising frequency. Then again, this is probably because it appeared hundreds or thousands of times in the training set in all sorts of contexts because it's such a popular meme. There is a tension between its status as an integral part of language and its nature as a copyrighted screen grab.

tantalor
1 replies
4h29m

That's not what "transformative" means for copyright.

It's more like, is the new work a distinct expression, e.g. satire or commentary, based on the original.

You can reproduce the original verbatim and still be transformative by adding an element of critique.

Example: https://www.dmca.com/articles/akilah-obviously-vs-sargon-of-...

alphaoverlord
0 replies
4h22m

I don’t think the examples shown reflect an element of critique.

stainablesteel
2 replies
4h40m

a court has established this already

in japan, where they said anything goes for ai

so its best to not to lose a competitive edge with things that people openly publish on the internet, if you put it out there for everyone to see then expect other people to use it

VWWHFSfQ
1 replies
3h52m

A court in Japan will have no impact on the outcome of a copyright lawsuit in USA. Not to mention that it doesn't really matter how a Japanese court ruled since it's all governed by treaties anyway. They will change their laws if required to.

stainablesteel
0 replies
1h43m

its not about applying laws across different countries

its about a precedent. If you don't keep up with international competition, you lose.

serjester
2 replies
1h57m

I see the exact opposite - any open source model is going to become prohibitively expensive to train if quality data costs billions of dollars. We’re going to be left with the OpenAI’s and Google’s of the world as the only players in the space until someone solves synthetic data.

xbar
0 replies
32m

This feels like a 1996 "music is too expensive for kids so they HAVE to pirate it."

wraptile
0 replies
36m

Exactly this. I work at a small web scraping company (so I might be a bit bias) and any small business can collect a fair, capable datasets of public data for model training, sentiment analysis or whatever today. If public data is stopped by copyright as this lawsuit implies that would just mean only giant corporations and pirates would be able to afford this.

This would be a huge blow to open-source and research developers and I'd even argue it could help openAI to get a bit of a moat ala regulatory capture.

SmoothBrain123
0 replies
4h43m

Are you talking about search engines, or something else?

tarruda
16 replies
10h52m

Would be funny if NT Times won this and all commercial LLMs were shut down.

Then LLMs would be distributed only via torrents, like most copyright infringing media.

__loam
11 replies
10h40m

Making these things anathema to commercial interests and making training them at scale legally perilous would be a huge win.

mdekkers
3 replies
10h23m

making training them at scale legally perilous would be a huge win.

Why?

fsckboy
0 replies
10h13m

I have no idea what he's thinking, but if everybody in the community here had an LLM in their pocket and large orgs did not, it would at least be kind of fun.

anonymousab
0 replies
2h57m

Because the megacorps should have to pay the people creating the works they are training their multibillion/eventual multitrillion dollar systems on, and should get a nice rake to the face when they try to do an end run around it.

__loam
0 replies
5h48m

The open source people can continue to pretend they matter in this field and large corporations like Microsoft will stop stealing everything that moves on the internet.

StableAlkyne
3 replies
10h15m

A huge win for countries with lax copyright laws. These things aren't going away, the worst case scenario would be exactly that scenario playing out - then China (or some other peer to the US's tech sector) just continues developing them to achieve an economic advantage. All in addition to the obvious political implications of AI chatbots being controlled by them.

The LLM genie is out of the bottle: an unfavorable court ruling in a single country isn't going to stuff it back in.

geysersam
1 replies
9h26m

Do LLM really give an economic advantage though? I've mostly seen them used to write quirky poems and bad code. People are scrambling to find use-cases but it's not very convincing so far.

On the other hand, if LLM are used to "launder" copyright content and, accepting the premises of copyright law, this has the effect of reducing incentives to do creative work, that has obvious negative implications for economic productivity.

StableAlkyne
0 replies
9h18m

I've mostly seen them used to write quirky poems and bad code.

Assuming this is in good faith: the ability to write code, documentation, and tests is absolutely a productivity enhancer to an existing programmer. The code snippets from a dedicated tool like copilot are of very usable quality if you're using a popular language like Python or JS.

__loam
0 replies
5h52m

I don't give a shit about what China does.

Dalewyn
2 replies
9h56m

making training them at scale legally perilous

Loading data to which you have no rights over into your software is legally perilous, yes.

It's as easy as simply asking for and receiving permission from the data's rightsholders (which might require exchange of coin) to make it not legally perilous.

__loam
1 replies
5h48m

Sounds expensive.

Dalewyn
0 replies
5h8m

If you want to do things with other people's stuff, yes it can get expensive.

sgt101
1 replies
8h57m

What will happen in this case is that large content providers will get paid directly and smaller content providers will get rolled up into a licensing bag and get small indirect payouts. For example, we might see a model where people who's books have been used will get a pay out proportionate to the sales of the book (perhaps), so if your books sells just a few thousand copies expect $20 but if you sell millions expect $20k

LLM's will become more expensive and less attractive as money printers, this will screw with the business models of the direct provision folks like OpenAI, MS and Google, MS and Google will only shed tears for money spent while OpenAI will just not have as good an income stream until they think of something new.

davedx
0 replies
6h0m

large content providers will get paid directly

I'm sure that's what they want, but I'm not sure that's what the outcome will be. What if they want to charge a prohibitive amount of money for their content?

realusername
1 replies
10h39m

They would still thrive but in other countries with other legal frameworks. The concept is way too valuable to disappear.

kjkjadksj
0 replies
1h50m

If its economically relevant us will use its iron fist to have its laws adopted the world over, like most things such as copyright or drugs

logicchains
12 replies
10h3m

NYT's perspective is going to look so stupid in future when we put LLMs into mechanical bodies with the ability to interact with the physical world, and to learn/update their weights live. It would make it completely illegal for such a robot to read/watch/listen to any copyrighted material; no watching TV, no reading library books, no browsing the internet, because in doing so it could memorise some copyrighted content.

type_Ben_struct
5 replies
9h55m

I disagree. The verbatim part is the problem. You’re drawing a comparison to how humans operate except we’re not allowed to operate like that.

While harder to do as a human, if memorised a copyrighted book and then did a live reading on TV, or produced replicas from memory and sold them (the most comparable example), I’d be sued.

Humans produce derivative work all the time, and it’s fine for LLM’s to do that, but you can’t do it verbatim.

logicchains
3 replies
9h5m

or produced replicas from memory and sold them (the most comparable example), I’d be sued.

This is not the most comparable example, because it's not what ChatGPT is doing. The most comparable example is if you were hired as a contractor and the employer asked you to write verbatim some copyright content you'd memorised. If the employer then published it, they'd be the one liable, not you.

Humans produce derivative work all the time, and it’s fine for LLM’s to do that, but you can’t do it verbatim.

Nobody's suggesting preventing humans from consuming any copyrighted content just because in future they might recite some of it verbatim, but that's what NYT want for LLMs.

tsimionescu
2 replies
8h50m

The most comparable example is if you were hired as a contractor and the employer asked you to write verbatim some copyright content you'd memorised. If the employer then published it, they'd be the one liable, not you.

No, you'd both be liable. You are not allowed to create copies of a copyrighted work, even from memory, for any commercial purpose. Making it public or not is irrelevant.

This is more obvious with spftware: if I copy a version of AutoCAD that my previous employer bought and sell it to another company, or even just use it for my current employer without showing it to anyone else, I am violating the copyright on that software, and I am liable. Even though obviously no "publishing" happened.

Similarly, if you hire a decorator to paint Mickey Mouse on the inside walls of your private kindergarten, the decorator is violating Disney's copyright just as much as you are, even if neither of you has made that public.

YuccaGloriosa
1 replies
8h8m

Your previous employer never bought AutoCAD, they licenced its use, paying a subscription. When you start working for them that licence was no longer available to you. So you would be unable to subsequently use it.

tsimionescu
0 replies
8h1m

Unable legally, but I may find illegal ways. And the reason it is illegal to copy is copyright at the end. The license is only (legally) required because of copyright.

kromem
0 replies
2h21m

Then we should be focused on policing the usage of the model, not the training of it.

That's the point at which infringement occurs in your example. It's not the memorizing that's the infringement, it's the reproduction from your memory.

We shouldn't be regulating your hippocampus encoding the book, but your reproducing the book from that encoding.

Similarly, we shouldn't be regulating the encoding of material into the NN, but the NN spitting back out the material.

ramraj07
1 replies
9h59m

Will it? If the LLM in the body is allowed to read nytimes on a tablet I'm sure they wouldn't care.

logicchains
0 replies
9h3m

If the LLM in the body is allowed to read nytimes on a tablet I'm sure they wouldn't care.

Why should the law treat a LLM in a body reading NYT on a tablet differently than a LLM browsing the content from a website online and reading that?

makeitdouble
1 replies
6h0m

Memorising isn't the issue. It's providing it back verbatim and/or cutting access to the source.

You'd get the same problem with someone with a photographic memory who a group of people would turn to recite them the news instead of buying the newspaper.

As of now public performance of copyrighted material is infringement.

kromem
0 replies
2h27m

That's not the case, as they aren't trying to get a ruling on the forced reproduction by prompt as infringement, but rather to get a ruling that training is infringement.

I fully agree with the perspective that infringement in usage needs to be limited even if I strongly disagree that training is infringement.

CJefferson
1 replies
7h12m

Are those LLMs independant citizens we are going to give rights to? Then I'm fine with that.

Are they all owned by one mega-corporation, which is going to do as capitalism does, and use them to squeeze money out of all of us? Then I'm happy to ban them.

kromem
0 replies
2h24m

"Let's ban something capable of diagnosing medical conditions and letting coma patients to communicate with an EEG because it learned the relationships between words from a giant data set of scraped data and is owned by a company" is a pretty callous take IMO.

The opportunity cost of holding this technology back is going to literally be millions of people's lives given current trends in its emerging applications.

Police usage, not training.

cycrutchfield
12 replies
12h14m

I read a NYT article and publish a summary of facts that I learned: totally legit.

Train a model on NYT text that outputs a summary of facts that it learned: OMG literally murder.

bloppe
7 replies
11h23m

Sounds like you didn't read the article. Here's a better synoposis:

I read a NYT article and publish an exact copy of that article on my website: copyright infringement.

Train a model on NYT text and it outputs an exact copy of that text: also copyright infringement.

cycrutchfield
4 replies
11h14m

So presumably when they fix that issue (which, if the text matches exactly, should be trivially easy) then would you accept that as a sufficient remedy?

Vegenoid
1 replies
3h40m

Copyright infringement is not avoided by changing some text so it isn’t an exact clone of the source.

Determining whether a work violates a copyright requires holistic consideration of the similarity of the work to the copyrighted material, the purpose of the work, and the work’s impact on the copyright holder.

There is not an algorithm for this, cases are decided on by people.

There are algorithms that could detect obvious violations of copyright, such as the one you suggest which looks for exact matches to copyrighted material. However, there are many potential outputs, or patterns of output, which would be copyright violation and would not be caught by this trivial test.

cycrutchfield
0 replies
2h33m

And you think that it would be impossible to train a model to avoid outputs that are substantially similar to training data?

tarruda
0 replies
10h57m

then would you accept that as a sufficient remedy?

Probably not until they pay him a hefty copyright fee.

bloppe
0 replies
2h5m

Basically, ya. It's not enough to change just a couple words around. But ya, there's probably some way to engineer around the problem.

slyall
1 replies
11h2m

A small number of outputs of ChatGPT are close enough to training articles to be (probably) copyright infringement.

What does that mean?

Look up "substantial non-infringing use" and this little court case:

https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Unive....

Now spend a few million on lawyers and roll your dice.

postexitus
0 replies
8h15m

In Sony vs. Universal case, Sony is the producer of a tool where the consumer uses to "time-shift" a broadcast that they legally are allowed to view. Similarly, you can rip your own CDs or photocopy your own books. This case never made reselling those content legal. OpenAI does not train ChatGPT on the content you own - they do it on some undisclosed amount of data that you may or may not have a legal right to access, and then move on and (is shown to) reproduce it nearly verbatim - they may even charge you for the pleasure.

bad_user
1 replies
11h46m

Fair use is intended for humans, much like copyright in general.

If you can't copyright AI-generated pieces, then why would fair use apply to LLMs?

mdekkers
0 replies
10h20m

Fair use is intended for humans.

Is it? Can you quote relevant legislation or case law?

zozbot234
0 replies
11h52m

Because it's not just summarizing the bare facts. It's a parrot.

up2isomorphism
0 replies
11h56m

That's why there will be a legalization of the fair use. Just let your intellectual to be used for free training material is not sustainable.

Also remember copyright laws was not there in the first place.

outside1234
11 replies
12h52m

Seems reasonable - they probably broke the TOS of the site

thallium205
6 replies
12h31m

What if they OCR’d the newspapers? No ToS there.

steve1977
4 replies
12h4m

I’m pretty sure there is still a copyright also for the physical newspaper.

pyuser583
3 replies
11h49m

For the paper or the author? What exactly was the licensing agreement for Op-Ed authors in 1962?

bloppe
2 replies
11h30m

Read the article. It's not difficult to get ChatGPT to regurgitate recent, obviously copyrighted articles, verbatim.

thallium205
1 replies
4h47m

It will be equally easy for ChatGPT to rewrite copyrighted content that makes the output materially different for a copyright claim to succeed also.

bloppe
0 replies
2h3m

Then ChatGPT should do that.

product-render
0 replies
12h17m

It's at least partially a copyright claim, isn't it? So the method -- OCR or scraping -- doesn't matter, I think.

KETHERCORTEX
2 replies
10h45m

On the other hand, NYT website willingly gave out all the information without imposing limitations. Seeing terms of service requires visiting a separate page, they aren't seen immediately upon visiting the website. Understanding and accepting the terms also requires a human interaction.

robots.txt on nytimes.com now disallows indexing by GPTBot, so there's an argument against automated information acquisition starting from some moment, but before some moment they weren't explicitly against that.

arrrg
1 replies
10h36m

Seems weird to argue that you have to speak up if you don’t want something done to you or else you consent to everything.

I do think that’s the case for some things but especially for new things that doesn’t seem like a common sense understanding of the world.

KETHERCORTEX
0 replies
8h5m

Seems weird to argue that you have to speak up if you don’t want something done to you or else you consent to everything.

If you don't want people to get at your land, setting up even a small fence creates an explicit indication of limitations. Just like the record in robots.txt I mentioned earlier.

New York Times also doesn't limit article text content if you just request HTML, which is typical for automated cases. But they impose th limits imposed on users viewing the pages in browser with Javascript, CSS and everything else. So they clearly:

1. Have a way to determine the user's eligibility for reading the full article on server side.

2. Don't limit the content for typical automated cases on server side.

3. Have a way to track the activity of not logged in users, determining the eligibility for access. So it's reasonable to assume that they had records of repeated access from the same origin, but didn't impose any limitations before some time.

So there are enough reasons to think that robots are welcome to read the articles fully. I'm not talking about copyright violations here, only about the ability to receive the data.

yjftsjthsd-h
0 replies
11h37m

Did OpenAI agree to those ToS? If not, I think (IANAL) LinkedIn was kind enough to give precedent that it's irrelevant.

( https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn )

wouldbecouldbe
9 replies
8h53m

We developers like to pretend that LLM's are akin to humans and that they've been using things like NYTimes like humans as educational material.

But they are not. It's much simpler, proprietary writing is now integrated into the source code of OpenAI, it would be as if I would copy parts of other propriety code and copy paste it into my own codebase. Claiming copy paste is a natural evolving process of millions of years of evolution.

The fact that LLM's are so complicated and we don't know where it is, doesn't make it less so.

logicchains
7 replies
8h51m

it would be as if I would copy parts of other propriety code and copy paste it into my own codebase.

It's not copy-pasted; it's compressed in a lossy manner. Even GPT4 has nowhere near enough memory to store the entirety of its training data in a non-lossy compression format. Just likes how humans compress the information we read.

mihaic
1 replies
4h6m

If you have a copyrighted photo that I simply put through jpeg compression, am I legally allowed to use that?

Software programs are not humans, and need to be treated differently. Anthropomorphization is one of the slipperiest paths to argue anything.

kromem
0 replies
2h30m

It depends on how much is reproducible and what the use is.

If only small patches of the original image can be reproduced then it becomes much more murky.

jamiek88
1 replies
1h23m

If it’s lossy compressed how come they have verbatim content from NYT in there that’s easy to recall? That’s what the lawsuit is about.

anon291
0 replies
1h3m

Many humans have photographic memories. Not common, but not unheard of for people to be able to memorize long portions of text verbatim.

For example, the Wikipedia article

https://en.wikipedia.org/wiki/List_of_people_claimed_to_poss...

contains several examples of people who were able to look at pages and recite them back. That is actually a much stronger ability than GPT since GPT has presumably looked at them 100 times.

wouldbecouldbe
0 replies
7h9m

You're kind of proving my comment pretending they are akin to a human brain instead of an evolved form of statistics mixed with code, aka transformer model.

Let alone that it's a centralised model that's being distributed for a fee.

wouldbecouldbe
0 replies
6h51m

So if compress nytimes articles into a vector database and query it is a vector then that's okay in line with your reasoning?

lacrimacida
0 replies
8h13m

Just likes how humans compress the information we read.

Humans don’t have the scale machines have and moreover humans aren’t sevices, that argument doesn’t fly.

I really think NYTs data isn’t that important and nor crucial, LLMs could’ve just elided it. However, it’s more about training on copyrighted data in general which is kind of crucial for OpenAi, they trained their LLMs indiscriminately on copyrighted content without any plan to share any profits.

gumballindie
0 replies
7h55m

We developers pretend that LLM's are akin to humans and that they've been educational material.

Developers thinking LLMs are akin to humans arent the brightest crop, and are usually a topic of ridicule.

biglyburrito
9 replies
13h7m

TLDR:

"The suit seeks nothing less than the erasure of both any GPT instances that the parties have trained using material from the Times, as well as the destruction of the datasets that were used for the training. It also asks for a permanent injunction to prevent similar conduct in the future. The Times also wants money, lots and lots of money: "statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity.""

downWidOutaFite
6 replies
12h50m

Wow they want to kill it. I wonder if we've just lived through the golden Napster era of LLMs.

readthenotes1
3 replies
12h49m

Just train on NYT articles no longer in copyright. We may be better for it.

vidarh
0 replies
10h50m

Or buy them. OpenAI market cap is many times NYT.

If we see court judgements start to go copyright owners way, we will also see a scramble from AI companies to buy the few publishers with enough data to be worth buying, and to create works for hire to replace the rest.

In the long run a copyright ruling like that will be a boon for OpenAI and all other players with deep enough pockets to do so, and massively harm everyone else who will suddenly find it far harder to build models legally.

rhdunn
0 replies
8h50m

So that would mean articles from the 1920s, provided that the authors of those articles have been dead for 70 years, or longer in some other countries.

mynegation
0 replies
11h55m

Next thing you know ChatGPT gives you the best way to crank your automobile and take good care of your crinoline.

suby
1 replies
12h34m

They may just want a licensing deal.

weikju
0 replies
12h18m

They're already working on it with Apple (see my other reply in this discussion), so I wouldn't doubt that this is another salvo in the same battle.

chongli
1 replies
12h21m

This is what lawyers are paid for. They ask for the max because there’s no harm in doing so. Everyone knows there’s little meaning to that.

greggsy
0 replies
12h18m

They always go for the max, knowing that they will settle somewhere closer to the expected rate.

jrockway
8 replies
9h55m

I read about this in the Times today (and am surprised that it wasn't on HN already).

My guess is that the court will likely find in the Times favor, because the legal system won't be able to understand how training works and because people are "scared" of AI. To me, reading a book, putting it in some storage system, and then recalling it to form future thoughts is fair use. It's what we all do all the time, and I think that's exactly what training is. I might say something like "I, for one, welcome our new LLM overlords". Am I infringing the copyright of The Simpsons? No.

I am guessing some technicality like a terms-of-use violation of the website (avoidable if you go to the library and type in back issues of the Times), or storing the text between training sessions is what will do OpenAI in here. The legal system has never been particularly comfortable with how computers work; for example, the only reason EULAs work is because you "copy" software when your OS reads the program off of disk into memory (and from memory into cache, and from cache into registers). That would be copyright infringement according to courts, so you have to agree to a license to get that permission.

I think the precedent on copyright law is way off base, granting too much power to authors and too little to user. But because it's so favorable towards "rightsholders", I expect the Times to prevail here.

hsbauauvhabzb
4 replies
9h49m

My hard drive can - bit for bit - recall video files. If I serve them to other people on the internet without permission of the copyright holder, that’s called piracy.

ninjinxo
2 replies
9h41m

But is it still piracy if you compress them and serve only a likeness of the original?

madeofpalk
0 replies
8h58m

Yes.

hsbauauvhabzb
0 replies
9h24m

If 20% of a NYT article is recalled correctly, does that mean I can publish 20% of a movie if surrounded by junk? What if I do that 5 times over?

jrockway
0 replies
9h36m

Yeah, but the LLMs can't. They aren't big enough to contain every byte of every NYT article, even with the best-known compression algorithms. Rather, they pick up and remember the same patterns that humans do when they write. Authors of the articles also did that, and so the two algorithms (human writer, LLM inference) end up with the same result. (That doesn't preclude large chunks of text that are actually remembered, though. We humans have large chunks of verbatim text floating around in our brains. Passwords, phone numbers, "I pledge allegiance to the flag...", etc.)

Anyway, like I said, I don't think OpenAI will win this. Someone will produce one verbatim article and the court will make OpenAI pay a bunch of money as though every article could be reproduced verbatim, and AI in the US will be set back that many billion dollars. It probably doesn't matter in the long run; it preserves the status quo for as long as the judge is judging and the newspaper exec is newspaper exec-ing. That's all they need. The next generation will have to figure out how to deal with AI-induced job loss... and climate change. Have fun, next generation!

tsimionescu
0 replies
8h37m

In general, if you perform copyrighted works you are doing copyright infringement. There are certain exceptions (personal use, education, very small fragments with proper attribution, maybe a few others) but whether you are reading it aloud from a book or performing it from memory makes no difference.

So, if you setup a service like ChatGPT but powered by humans responding real time to queries, and these humans would occasionally reproduce large chunks of NYT articles, they and the service itself would be liable for copyright infringement. Even if they were all reproducing these from memory.

Now, this is somewhat different from the discussion of whether training the model on the copyrighted data, even if it had effective protections from returning copies of it, constitutes copyright infringement in itself. I believe this is a somewhat novel legal question and I can think of no direct corollaries.

I certainly don't think we can just handwave and say "at some level, when a human reads a copyrighted work, they are doing the same thing", because we really don't know if that is true. Artifical neural networks certainly have no direct similarity with the neural networks in the brain as far as we can tell. And, even if they did, there is no reason to give a machine the same rights that a human has - certainly not until that machine can prove sentience.

pests
0 replies
9h49m

I don't agree that an LLM is doing what we are doing.

"Its what we do all the time" is a major assumption

SilverBirch
0 replies
8h32m

It's extremely speculative to claim that LLM models are basically doing what humans do. There is very clearly something that isn't right about that because in order for a human to learn to speak and converse and they don't need to imbibe the entire corpus of all written text in human history - which is basically what we're doing with these LLMs. What we're giving them is vast amounts of data which is totally unlike how humans work. There's very clearly some gap here between what a LLM is doing and what a human is doing. So you can't use that as a basis to justify why it's ok for OpenAI to operate like this.

To put it another way, let's say I turn the dial all the way the other way, I train the worlds crappest LLM on NYT material, it massively massively overfits and all it will ever return is verbatim snippets of the NYT. Is that copyright infringement?

The core part of the argument here is actually just that OpenAI doesn't want to adhere to what the current standard is for using copyrighted material, if you want to use it and create something new with it you need to license the material. Since OpenAI's LLM isn't actually like a human it needs to license such a vast dataset that it would be uneconomical to run the business without stealing all the content.

atleastoptimal
7 replies
12h46m

It's obviously a frivolous suit that will only net at best a ceremonial victory for NYTimes: 8 figure max payout and a promise to not use NYtimes material in the future.

The trajectory and value to society of OpenAI vs NYtimes could not be greater. They have won no favors in the court of public opinion with their frequent misinformation. It's all just a big waste of time, the last of the old guard flailing against the march of progress.

And even hypothetially if they managed to get OpenAI to delete ChatGPT they'd be hated forever.

15457345234
4 replies
12h39m

They have won no favors in the court of public opinion with their frequent misinformation.

You mean GPT here, right?

atleastoptimal
3 replies
12h37m

ChatGPT only advertises itself as a fancy autocomplete. There is a disclaimer that it may produce output that appears correct but isn't. NYtimes written material purports itself to be the truth, thus shouldn't be held to the same standards as a generative AI obviously.

15457345234
2 replies
12h30m

I think what we should focus on is the volume of misinformation in general, not the provenance of it.

The NYT may produce misinformation but it aims not to, and its staff of human writers are limited in the quantity that they can produce. They also publish corrections.

GPT enables anyone who can pay to generate a virtually unlimited volume of misinformation, launder it into 'articles' with fake bylines and saturate the internet with garbage.

I think we need to focus on the damage done.

realusername
0 replies
10h50m

The NYT may produce misinformation but it aims not to, and its staff of human writers are limited in the quantity that they can produce. They also publish corrections.

Except when it affects their bottom line of course, they publicly lied on how meta tags work during the lawsuits against Google to get more money (like most newspapers did). And I have no doubt that they will extensively lie once again on how LLM really work.

atleastoptimal
0 replies
12h28m

Well that's true for any large language model. As long as they exist there will be a deluge of bot written text producible for any purpose. At this point there is no getting the cat back into the bag.

In that case the bigger danger is Open source LLM's. OpenAI at least monitors the use of their endpoints for obvious harm.

faeriechangling
0 replies
11h17m

I've never really known The New York Times to file frivolous lawsuits.

23B1
0 replies
12h31m

Nobody is looking at this suit as applying to the Times exclusively – and neither will the courts.

andy99
7 replies
3h48m

This, or a lawsuit like it is going to be the SCO vs IBM of the 2020's, to wit: a copyright troll trying to extract rent, with various special interests cheering it on to try and promote their own agenda (ironically it was Microsoft that played that role with SCO). It's funny how times have changed and at least now a louder group seem to be on the troll's side. I hope to see some better analysis on the frivolity of this come out. There may be some commercial subtlety in specific cases that doesn't depend on scraping and training, but fundamentally using public internet data for training is not copying, is fair use, and is better for society as a whole than whatever ridiculous alternative might be proposed.

edit: I'm speaking about training broadly capable foundation models like GPTn. It would of course be possible to build a model that only parrots copyrighted content and it would be hard to argue that is fair use.

phkahler
1 replies
3h36m

> There may be some commercial subtlety in specific cases that doesn't depend on scraping and training

The key is to stop calling it "training" and use "learning" or just "reading".

The argument from NYT will probably be that LLMs are just a fancy way to compress or abstract information and spit it back out. In which case "training" seems to support their case?

mycall
0 replies
3h16m

I don't recall the source, but when people read, they typically only remember 20% of what they read (or heard?). Machine training encodes much more than 20%, so it is much closer to copying than training. Now the emergent abilities that come from this could be considered learning and dare I say imagination (which is the opposite of copying).

dannyr
1 replies
3h45m

NYTimes has a paywall. Is that public internet and therefore fair use?

yreg
0 replies
2h49m

They don't have the paywall up if you identify as a search engine scraper, so it is kinda public internet. (I'm not claiming it's fair use.)

xbar
0 replies
24m

That is an irrelevant comparison.

This is theft and monstrous profit from theft. For actual justice this should be a class action suit of the world vs. OpenAI/Microsoft and the financial consequences should be company-ending for OpenAI. Otherwise, you have incented everyone in the AI industry to steal as much as they can for as long as they can.

logicchains
0 replies
3h37m

It's funny how times have changed and at least now a louder group seem to be on the troll's side

Because for many people, their views on current events are whatever the "thought leaders" working for the NYT and similar publications tell them to think.

jamiek88
0 replies
3h40m

Using the words troll and frivolous undermines your otherwise decent point and in fact goes against your point.

The law isn’t settled, it’s a genuine legal question mark.

It ain’t frivolous or trolling or ridiculous.

munchinator
5 replies
9h22m

Why hasn't the Times also sued the Internet Archive? They've tried to block both the Internet Archive [1] and Open AI [2] from archiving their site, but why have they only sued OAI and not IA? The fact that they haven't sued IA which has comparatively little money would seem to indicate that this is not about fair use per se, but simply about profit-seeking and the NYT is selecting targets with deep pockets like OAI/MS.

[1] https://theintercept.com/2023/09/17/new-york-times-website-i...

[2] https://fortune.com/2023/08/25/major-media-organizations-are...

killingtime74
2 replies
9h14m

What's wrong with that? If I was the NY Time's lawyers that what I would advise. What would it serve to bankrupt the IA, they can't pay anyway? These are corporations enforcing their rights against one another.

There is nothing wrong with profit seeking from your copyright. That's literally their entire business model...they publish copyrighted content which they sell for a subscription.

OpenAI and others could easily have negotiated a licence instead of just using the data. They bet that it would be cheaper to be sued, lets find out if they bet correctly.

Tangentially that's what Apple did with the sensor in their watch, it doesn't always pay off.

munchinator
1 replies
9h9m

What would it serve to bankrupt the IA, they can't pay anyway?

It would serve the termination of the infringement.

My point is that the Times doesn't particular seem to care about infringement per se, they care about getting their slice of the cut from that infringement.

It's like if a video game company or a movie company only attempted to sue illegal downloaders who had a certain net worth.

sensanaty
0 replies
5h47m

It's like if a video game company or a movie company only attempted to sue illegal downloaders who had a certain net worth.

I mean yeah, no one's gonna bother trying to squeeze money out of Joe Schmoe with 10 bucks in his bank account over some pirated movies. If a company with billions and billions of dollars like Netflix started pushing out pirated movies instead, then obviously they'd be sued into oblivion, as they should be.

sgt101
0 replies
9h4m

I think that the moment you start making big money from someone else's business is the moment that they get riled. That and when you really hurt their business. I suspect that the NYtimes thinks that IA is damaging them in the order of (possibly) $100k pa, and that it thinks that OpenAI is making in the order of $10M's from their content (and possibly doing some damage as well). It's an easy commercial decision to ignore one and go after the other - especially as going after IA is going to create some backlash as well.

Shrezzing
0 replies
9h1m

Copyright doesn't stop the collection of content, it stops the copying, processing, & redistribution of content. Internet Archive acts as a library, so its widely accepted as fair-use when it makes collections of webpages available.

OpenAI's distribution is materially different to that of a library, so it's not a like-for-like comparison.

One of the main tests of copyright law (at least in the US) is if the entity distributing is _selling_ the copied/derivative work. It's unambiguous that OpenAI is selling something akin to derivative works, which is why NYT feels they can go after this claim. Meanwhile IA's operations don't create sales or incur profits, so while NYT's legal team may be able to establish that copies have been distributed, without the _sale_ aspect of the infringement, judges aren't guaranteed to side with NYT in an legally expensive PR nightmare.

fsckboy
4 replies
10h8m

in my head I like to think of web crawler search engines/search engine databases and LLMs as being somewhat similar. Search engines are ok if they just provide snippets with citations (urls), and they would be unacceptable if they provided large block quotes that removed the need to go to the original source to read the original expression of more complex ideas.

A web-crawled LLM that lived within the same constraints would be a search engine under another name, with a slightly different presentation style. If it starts spitting out entire articles without citation, that's not acceptable.

aurareturn
3 replies
9h57m

I think it's different. LLMs can solve problems. Part of that problem-solving ability comes from training completely unrelated content such as NYT articles. GPT4 doesn't have to spit out NYT articles verbatim to have benefited from NYT articles. It uses NYT articles for every query.

fsckboy
2 replies
9h32m

Let's say I'm an academic; if my research, note-taking, and paper writing skills lead to fair-use, cited quotations where applicable, general knowledge not identified, and the creative aspects and unique conclusions creating the intriguing part of my work, that's copacetic. If I spit out (from memory, mind you) verbatim quotes and light rewordings of NY Times articles, that's not; "I don't remember where I got that material" doesn't cut it. My reading the NY Times every day for years because I judge it to be more literate and accurate than other sources, undoubtedly it has informed my thinking and style, but I don't need to acknowledge that.

If I use ChatGPT as a research tool, as long as it lives within the same parameters that I have to live within, I don't see a problem with its education/learning.

I understand that the NYTimes would like a slice of anything that comes out of the GPT but I'm talking about what seems reasonable. People who share their copyrighted material do not own all of the thinking that comes out of it; they own that expression of it, that is all.

Will AI destroy the economics of "writing" the way the web has killed newspapers? perhaps, perhaps we'll all benefit from and need a new model, but killing the new to keep the old on life support is not the way.

aurareturn
1 replies
9h27m

You're not replicating yourself millions of times and selling yourself for $20/month. If you are, then NYT might sue you too.

I'm not saying LLMs are by default, illegal. All I'm saying is that there is some merit to why NYT and content companies want a piece of the pie and think they deserve it.

fsckboy
0 replies
9h20m

The NY Times benefited in the past from technologies that led to widespread distribution of the Times, putting competitors out of business and concentrating talent at the Times. Nobody is stopping them from producing new editions of the newspaper, their core business. People now have technologies that help them "remember" what was salient in back issues of the Times. Such is progress.

ssijak
3 replies
8h20m

If I create a news website where I write articles in the following way:

- Read 20 different news websites and their story on the same event/topic

- Wait an hour, grab a cup of coffee

- Sit down to write my article, never from this point I open any of the 20 news websites, I write the story from my head

- I don't consult any other source, just write from my memory, and my memory is, let's say, not the best one, so I will never write more than 10 words exactly as they appear on any of the 20 websites.

- I will probably also write something that is not correct or add something new because, as I said, my memory is not the best.

Is that fair use? Am I infringing on copyright?

gumballindie
2 replies
8h19m

If you are piece of software then yes.

schleck8
1 replies
8h6m

Yes to what?

gumballindie
0 replies
7h54m

A human could tell.

frakrx
3 replies
8h54m

Under existing condition an AI news site seems like a good investment idea. Its AI could read all relevant news sources and retell them and republish them in its own articles. It could even have its own AI editors and contributors. Cannot see how human news companies could compete.

logicchains
2 replies
8h53m

Cannot see how human news companies could compete.

News ultimately comes from physical sources on the ground, which currently AI has no way of doing.

kjkjadksj
0 replies
1h39m

That style of journalism is nearly dead. True on the ground investigative journalism is hardly done today, most is just reporting existing public information releases. You don’t have to be at the presser when everything the police chief says will be put in an online transcript.

frakrx
0 replies
8h46m

I am sure it could easily rephrase the articles to tell them without quoting any real or verifiable sources. Many human news companies often do it too.

throwaway4good
2 replies
10h30m

The lawsuit itself (which arstechnica links to):

https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

From page 30 and onwards has some fairly clear examples on how ChatGPT has an (internal) copy of copyrighted material which it will recite verbatim.

Essentially if you copy a lot of copyrighted material into a blob and then apply some sort of destructive compression to it. How destructive would that compression have to be for the copyright no longer to hold? My guess it would have to be a lot.

As I see it the closeness of OpenAI may be what saves it. OpenAI could filter and block copyrighted material from the LLM from leaving the web interface using some straight forward matching mechanism against the copyrighted part of the data set ChatGPT has been trained on. Whereas open source projects trained on the same data set would be left with the much harder task of removing the copyrighted material from the LLM itself.

vanchor3
0 replies
5h13m

Essentially if you copy a lot of copyrighted material into a blob and then apply some sort of destructive compression to it. How destructive would that compression have to be for the copyright no longer to hold? My guess it would have to be a lot.

I imagine the goal is closer to "enough that no one notices we stole it", either in a way that it's not easily discoverable or even when directly analyzed there's enough plausible deniability to scrape by.

jprete
0 replies
5h42m

The answer to the "closedness" is externally controlled audits.

strangus
2 replies
12h39m

Next up, Microsoft acquires the New York Times forming MSNYT

unsupp0rted
0 replies
10h54m

This is not impossible, and perhaps not even unlikely

playingalong
0 replies
12h19m

... New Roman

ctoth
2 replies
1h43m

Isn't the fundamental issue here that the NYT was available in Common Crawl?

If they didn't want to share their content, why did they allow it to be scraped?

If they did want to share their content, why do they care (hint: $88 billion)?

Or is it that they wanted to share their content with Google and other search engines in order to bring in readers but now that an AI was trained on it they are angry?

What wrong thing did OpenAI do specific to using Common Crawl?

Didn't most companies use Common Crawl? Excepting Google, who had already scraped the whole damn Internet anyway and just used their search index?

Is it legal or not to scrape the web?

If I scrape the web, is it legal to train a transformer on it? Why or why not?

To me, this is an incredibly open-and-shut case. You put something on the web, people will read that something. If that is illegal, Google is illegal.

Oh, and do you see the part in the article where they are butthurt that it can reproduce the NYT style?

"Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.

Mimics its expressive style. Oh golly the robots can write like they're smug NYT reporters now--better sue!

It appears that the NYT changed their terms of service in August to disallow their content in Common Crawl[0]. Wasn't GPT-4 trained far before August?

0]: https://www.adweek.com/media/the-new-york-times-updates-term...

rfw300
1 replies
1h25m

If you read the complaint, it explains this pretty well. The use of copyrighted content by search engines is fundamentally different from the way LLMs use that same content. The former directs traffic (and therefore $$) to the publisher, the latter keeps the traffic for itself.

The legal misconception I want to flag in your logic is the notion that all uses of the Common Crawl are equally infringing/non-infringing. If you use the Common Crawl to create a list of how often every word in English appears on the internet, that’s unquestionably transformative use. But if you use it to host a mirror of the NYT website with free articles, that’s definitely infringement. The legality of scraping is one matter, and the legality of what you do with the scraped content is quite another.

ctoth
0 replies
1h2m

From my original comment:

Is it legal or not to scrape the web?

If I scrape the web, is it legal to train a transformer on it? Why or why not?

At no point did I say anything about hosting a mirror of the NYT website, with free articles. Obviously. Because OpenAI didn't do that. Some NYT lawyer tried to get ChatGPT to write a NYT article. Maybe first they should have actually done a Google search and shut down some of the actual content farms which simply copy NYT content such as [0]. But instead, we get this.

[0]: https://salaminv.com/news_file/

bigmattystyles
2 replies
2h12m

Not that it would solve this, but how hard would it be for ChatGpt or other problems to cite the sources used in a response. Is that difficult to capture and tag to 'knowledge' within a LLM? It could be a best of both worlds type situation if LLMs cited sources and linked to the source itself. Isn't that what happened with Google News's home page? I seem to recall that when Google took it away in some markets, at the behest of the news orgs, they quickly reversed course as their traffic plummeted.

qznc
0 replies
2h2m

This is not possible. There is no database of sources inside an LLM. Just like the knowledge in your brain does not have sources attached.

For an example, you referenced "what happened with Google News's home page". Could you give me your source? You could probably search for some suitable article for a reference, but you don't know a source from your memory.

jejeyyy77
0 replies
2h6m

not likely with the way these models have been trained - its basically broken down into sub-words that are all mashed together into probabilities.

bdd8f1df777b
2 replies
9h12m

I see few people here bring this up, so let me:

The US constitution says, The Congress shall have Power

To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;

So the Congress's power to make copyright and patent laws is predicated on promotion of science and useful arts (I believe this actually means technology). In a sense, the OpenAI being the forefront of our AI technology advancement is crucial to the equation. To hinder the progress by copyright is, in my mind, unconstitutional.

madeofpalk
0 replies
9h1m

Wishful thinking. Just as equally, NYT’s right to copyright over its material, in order to have a functional press, is enshrined the constitution. Anyone threatening that copyright could be unconstitutional.

I think we all agree that no one is entitled to “progress of science” at any cost - as a straw man, killing hundreds of newborn babies for scientific research is not great - so we use ethics and the legal system to find the line of what’s acceptable.

I don’t know exactly what NYT is asking for here, but the two options aren’t unconsented training vs nothing at all. NYT could license, for a fee, its content to OpenAI. It’s pretty common for scientists to have to pay for materials!

globular-toast
0 replies
9h7m

Current AI is useless without people writing the articles in the first place.

JackFr
2 replies
7h50m

I think LLMs may really change the IP landscape.

Culturally we’re taught that there is a moral component to copyright and patent law - that stealing is stealing. But the idea that words or thoughts or images can be owned (and that the might if the state can be brought to bear to enforce it) would seem utterly ludicrous to someone from an earlier era. Copyright and patent laws exist for practical, pragmatic reasons - and seemingly they have served us well, but it’s not unreasonable to re-examine them from first principals.

narenkeshav
0 replies
7h48m

I remember a case where the court did not allow ID to patent "First person shooters"

This rings similar.

flanked-evergl
0 replies
7h43m

But the idea that words or thoughts or images can be owned (and that the might if the state can be brought to bear to enforce it) would seem utterly ludicrous to someone from an earlier era.

Is there any research into how people from earlier eras thought about it? And should all laws that seemed ludicrous to someone from an earlier era be discarded? If not, how exactly do we determine the relevance of what someone from an earlier era would think about our laws?

4death4
2 replies
5h57m

I think there is a national security aspect to ML models trained on copyrighted data. Countries that allow it will gain a superior technological advantage and outcompete those who disallow training on copyrighted material. I personally believe training LLMs on copyrighted data is copyright infringement if the models are deployed in a way that competes with the copyright holder. But that doesn’t necessarily mean it’s something we should disallow.

kjkjadksj
1 replies
1h44m

You can say the same for any legal enforcement like respecting patent or copyright law or making Champagne outside France. Yet the sky isn’t falling given this reality with so many legally protected industries. Maybe these markets where such an industry might offshore to are too small and insular to be very significant, and are probably language bound to make english models less relevant compared to native language models.

4death4
0 replies
11m

Champagne isn’t a transformative technology, and least not anymore.

starchild3001
1 replies
36m

I asked an LLM to summarize the 69 page lawsuit. It does a decent job. Didn't infringe on any copyrights in the process :)

Here is a summary of the key points from the legal complaint filed by The New York Times against Microsoft and OpenAI:

The New York Times filed a copyright infringement lawsuit against Microsoft and OpenAI alleging that their generative AI tools like ChatGPT and Bing Chat infringe on The Times's intellectual property rights by copying and reproducing Times content without permission to train their AI models.

The Times invests enormous resources into producing high-quality, original journalism and has over 3 million registered copyrighted works. Its business models rely on subscriptions, advertising, licensing fees, and affiliate referrals, all of which require direct traffic to NYTimes.com.

The complaint alleges Microsoft and OpenAI copied millions of Times articles, investigations, reviews, and other content on a massive scale without permission to train their AI models. The models encode and "memorize" copies of Times works which can be retrieved verbatim. Defendants' tools like ChatGPT and Bing then display this protected content publicly.

OpenAI promised to freely share its AI research when founded in 2015 but pivoted to a for-profit model in 2019. Microsoft invested billions into OpenAI and provides all its cloud computing. Their partnership built special systems to scrape and store training data sets with Times content emphasized.

The complaint includes many examples of the AI models reciting verbatim excerpts of Times articles, showing they were trained on this data. It also shows the models fabricating quotes and attributing them to the Times.

Microsoft's integration of the OpenAI models into Bing Chat and other products boosted its revenues and market value tremendously. OpenAI's release of ChatGPT also made it hugely valuable. But their commercial success relies significantly on unlicensed use of Times works.

The Times attempted to negotiate a deal with Microsoft and OpenAI but failed, hence this lawsuit. Generating substitute products that compete with inputs used to train models does not qualify as "fair use" exemptions to copyright. The Times seeks damages and injunctive relief.

In summary, The New York Times alleges Microsoft and OpenAI's AI products infringe Times copyrights on a massive scale to unfairly benefit at The Times's expense. The Times invested heavily in content creation and controls how its work is used commercially. Using Times content without payment or permission to build competitive tools violates its rights under copyright law.

starchild3001
0 replies
29m

A Second LLM's take on this lawsuit can be found below. I'd love to see OpenAI address these complaints publicly and without incurring any additional damages to NYT.

The document is a legal complaint filed by The New York Times Company against Microsoft Corporation and various OpenAI entities, alleging copyright infringement and other related claims. The New York Times Company (The Times) accuses the defendants of unlawfully using its copyrighted works to create artificial intelligence (AI) products that compete with The Times, particularly generative artificial intelligence (GenAI) tools and large language models (LLMs). These tools, such as Microsoft's Bing Chat and OpenAI's ChatGPT, allegedly copy, use, and rely heavily on The Times’s content without permission or compensation.

Nature of the Action: The Times emphasizes the importance of independent journalism to democracy and claims its ability to continue providing this service is threatened by the defendants' actions. The complaint argues that the GenAI tools are built upon unlawfully copied New York Times content, which undermines The Times's investments in journalism.

Defendants: The defendants include Microsoft Corporation and various OpenAI entities, such as OpenAI Inc., OpenAI LP, and several other related companies. The Times alleges these entities have worked together to create and profit from the GenAI tools in question.

Allegations: 1. Copyright Infringement: The Times claims the defendants copied millions of its copyrighted articles and other content to train their GenAI models. This training allegedly involves large-scale copying and use of The Times’s content, emphasizing its quality and value in building effective AI models.

2. Unlawful Competition: The Times argues that the defendants' GenAI tools compete with it by providing access to its content for free, which could potentially divert readers and revenue away from The Times.

3. Misattribution and Hallucinations: The Times asserts that the defendants' tools not only unlawfully distribute its content but also generate and attribute false information to The Times, damaging its credibility and trust with readers.

4. Trademark Dilution: The complaint includes claims that the defendants' use of The Times’s trademarks in connection with lower-quality or inaccurate AI-generated content dilutes and tarnishes its brand.

5. Digital Millennium Copyright Act Violations: The Times alleges that the defendants removed or altered copyright management information from its works, which is prohibited under the law.

Harm to The Times: The Times claims it has suffered significant harm from these actions, including loss of control over its content, damage to its reputation for accuracy and quality, and financial losses due to diminished traffic and revenue.

Demands: The Times seeks various forms of relief, including statutory damages, injunctive relief to prevent further infringement, destruction of the infringing AI models, and compensation for losses and legal fees.

Overall Summary: This legal complaint represents a significant clash between traditional media and emerging AI technology companies. It underscores the complex legal, ethical, and economic issues arising from the use of copyrighted content to train AI systems. The outcome of this case could have far-reaching implications for the AI industry, content creators, and the broader digital ecosystem.

sensanaty
1 replies
5h30m

I love seeing all the AI sycophants squirm at this news.

Here's to hoping NYT wins this one and gets everything they ask for, and more!

ugjka
0 replies
3h2m

I don't know if winning this will improve their business model

I don't use chat gpt to get the news but also i don't buy paywalls

mark_l_watson
1 replies
2h43m

I think Apple has really got ahead of this game: early deals to pay for AI training data/content. I need to do some research but I think Anthropic also does this.

After a year of largely using OpenAI APIs, I am now much more into smaller “open” models for I hope the major contributors like Meta/Facebook are following Apple’s lead. Off topic, but: even finding the smaller “open” models much less capable, they capture my imagination and my personal research time.

efields
0 replies
1h58m

Not sure if they’re ahead but I think it was smart to not ship anything LLMlike until the regulations get made first movers test the waters.

Casey Newton has been saying all year that these things will be awesome once we can unleash them on our own corpus of data safely. “Siri” already does a great job digging through my photos and picking the good memories. I can let my camera roll become a visual junk drawer now.

Do the same for my email. Make “Find” the tool we always wanted to be. I don’t care if I’m conflating LLMs/AI with other smart tech.

kazinator
1 replies
9h53m

Should be: "NY Times wants OpenÄI to delete all GPT instances". You wouldn't want the hapless rabble misreading it as an "aiii" diphthong.

hoppyhoppy2
0 replies
8h39m

Are you confusing the New York Times with the New Yorker?

globular-toast
1 replies
9h9m

To me, reading a book, putting it in some storage system, and then recalling it to form future thoughts is fair use. It's what we all do all the time, and I think that's exactly what training is.

If the AI can recall the text verbatim then it's not at all the same. When we read we are not able to reproduce the book from our memory. Even if a human could memorise an entire book it's not at all practical to reproduce the book from that. The current AIs are not learning "ideas", they are learning orders of words.

contravariant
0 replies
8h20m

Yeah the comparison to humans is silly anthropomorphising at this point.

However I am inclined to agree with them for the simple fact that putting a file into a device and letting that device reproduce parts of the file should be allowed. I mean we're already at the point where this simple right is under pressure from DRM, but people should be allowed to do whatever they want with the files they own.

Whether you can publish this output and share it with the world is a whole different issue.

elif
1 replies
4h56m

At some point the burden of carrying 100 year old copywriter/patent law will become so onerous a burden on the pace of progress that its enforcement will be antihuman.

ryukoposting
0 replies
4h35m

It already is, but I don't think this is a good example. NYT has a legitimate case here. They own the material they publish, and GPT-4 is shown to be able to recall entire articles verbatim. That's a violation, clear as day.

The thing about lawsuits is that you make dozens of claims, and the court can rule in favor of some of them, and against others. The question of "is LLM training fair use?" hasn't made it to a high court yet. The court could very easily rule against everything else in the suit.

chris_wot
1 replies
8h28m

Fair use is something Wikipedians dance around a fair amount. It also meant I did a lot of reading about it.

It’s a four part test. Let’s examine it thusly:

1. Transformative. Is it? It spits out informative text and opinion. The only “transformation” is that its generative text. IMO that’s a fail.

2. Nature of the work - it’s being used commercially. Given it’s being trained partially on editorial, that’s creative enough that I think any judge would find it problematic. Fail on this criteria.

3. Amount. It looks like they trained the model on all of the NYT articles. Oops, definite fail.

4. Effect on the market. Almost certainly negative for the NYT.

IMO, OpenAI cannot successfully claim fair use.

EMIRELADERO
0 replies
8h22m

You're getting mixed up. When applying the four factors, you need to individually separate all the uses. So you would need to repeat the fair use test for every alleged type of infringement. This means that the scraping from the public internet to OpenAI's dataset storage cluster is one instance where the full analysis of the 4 must take place, then the training itself, so another full analysis, then the distribution of model outputs, another one, etc.

chmod600
1 replies
9h53m

Isn't copyright tethered somehow to a notion of "expression"? That is, the same ideas and facts expressed differently are a different work?

Sure, when something is clearly derived, or just expressed in a new medium, then I'm sure it's still covered. But if it goes through an LLM and the result bears little resemblance, how can that still fall under copyright?

visarga
0 replies
9h42m

As you said AI can rewrite articles, obtaining a clean cut separation between ideas and expression. Keep the ideas, write a new text. And if you got multiple sources, the more sources you use the better, it would make the output be even more different. This approach could also check consistency and bias between sources.

ChrisArchitect
1 replies
11h11m
kristianp
0 replies
10h17m

True, the Verge article was posted here earlier.

weikju
0 replies
12h59m

Probably has something to do with impending deals between NYT and major companies, e.g.

[0] https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...

[1] https://www.theverge.com/2023/12/22/24012730/apple-ai-models...

visarga
0 replies
10h9m

Wondering who tf reads old NYT articles? News become old really fast. chatGPT is months or years behind.

unstatusthequo
0 replies
8h31m

I’d be happy if the NYT was deleted. I find it has very little use as a source of anything, much like most mainstream media.

throwuwu
0 replies
6h6m

If they lose they should delete the NY Times

sylware
0 replies
1h27m

If they don't let AIs to be trained on a maximum of data as possible, those AIs will be less "good" than the ones trained without constraints like you will have in China or elsewhere, and people will mechanically start using the later.

Unless they engage in massive IP and DNS banning, geolocation based, that forced upon all internet users and "external" users.

sunpazed
0 replies
8h38m

“The tragedy of the Luddites is not the fact that they failed to stop industrialization so much as the way in which they failed. Human rebellion proved inadequate against the pull of technological advancement.”

https://www.newyorker.com/books/page-turner/rethinking-the-l...

skc
0 replies
9h11m

Kind of ironic that the NYT will still have to host articles extolling the virtues of OpenAI as it continues to expand and upend industries

shp0ngle
0 replies
9h15m

Microsoft is one of the companies that love to use copyright to get their way, BSA is known software mafia, so I'm not at all sympathetic to them.

sackfield
0 replies
12h35m

Something I have wondered about LLMs and training data is the idea that the biggest content producers on the internet now have their world view and tone echoed disproportionately as part of the next big wave of technology. This is incredibly impactful (although admittedly I don't know how to turn that into a profit). Is there some long term impact of removing the New York Times from training data that means it won't be part of the LLMs corpus going forward that is unforeseen?

ryukoposting
0 replies
4h48m

All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall.

Not to be pedantic, but NYT has the least robust paywall I've ever seen. Just turn on reader mode in your browser. Simple. I get that it's still tresspassing if I walk into an unlocked house, but NYT could try installing a lock that isn't made of confetti and uncooked pasta.

ranting-moth
0 replies
6h13m

Let's try the "reverse the gender" card.

Let's say OpenAI was trained on all the Windows source code (without approval from MS).

GPT could pretty much replicate the windows code with even not that clever prompt by any user. "Write an OS CreateProcess function like Windows 10 source code would have."

It would infuriate MS to put it mildly, enough to start a lawsuit.

I know the license to the MS source code and NYT articles aren't the same.

poorman
0 replies
1h25m

Sad to say but I would believe a hallucination from OpenAI before I would believe anything that comes out of the NY Times. I mean the confidence interval for the NY Times is what again?

nektro
0 replies
6h39m

oh how joyous that would be. I so hope they win

lwhi
0 replies
3h9m

Surely there's no chance OpenAI would agree to this?

Isn't it more likely that the company buys the NYT?

kweingar
0 replies
4h37m

The thing that bothers me about the whole situation is that OpenAI prohibits using its model output as training data for your own models.

It seems more than a bit hypocritical, no? When it comes to their own training data, they claim to have the right to use any/all of humanity’s intellectual output. But for your own training data, you can use everything except for their product, conveniently for them.

kragen
0 replies
11h45m

this was predicted in the very influential epic 2014 video in 02004

https://www.youtube.com/watch?v=eUHBPuHS-7s (the original is flash and has thus been consigned to the memory hole, so we are left with this poor-quality conversion)

36": 'however, the press as you know it has ceased to exist'

40": '20th-century news organizations are an afterthought; a lonely remnant of a not-too-distant past'

2'11": 'also in 2002, google launches google news, a news portal. news organizations cry foul. google news is edited entirely by computers'

5'13": 'the news wars of 2010 are notable for the fact that no actual news organizations take part. googlezon finally checkmates microsoft with a feature the software giant cannot match: using a new algorithm, googlezon's computers construct new stories, dynamically stripping sentences and facts from all content sources, and recombining them. the computer writes a new story for every user'

5'55": 'in 2011 the slumbering fourth estate awakes to make its first and final stand. the new york times company sues googlezon, claiming that the company's fact-stripping robots are a violation of copyright law. the case goes all the way to the supreme court'

they didn't get the details exactly right, but overall the accuracy is astounding

however, that may be a hyperstition artifact in this timeline

https://en.wikipedia.org/wiki/EPIC_2014 (i thought epic 2014 might be the only flash video to hae a wikipedia article about it, but then i looked and found five others)

kolinko
0 replies
5h41m

Worth noting, that - at least the screenshot - shows an example of browsing functionality used to go around paywalls, not that the model itself is trained, or can reproduce the articles really.

IIRC this was the reason why the browsing plugin was disabled for some time after its introduction - they were patching up this hole.

joshxyz
0 replies
8h48m

only winner here is the lawyers of both parties laughing their way to the bank.

god i love this era, so much grey area in these edge technologies.

j0hnyl
0 replies
1h40m

I hope the world can rally and move past these anachronistic ideas of intellectual property.

hazmazlaz
0 replies
3h6m

I'd rather have GPT than the NY Times, if I had to choose between one or the other.

fuzzfactor
0 replies
3m

What if you were one of the people who read the Times from cover-to-cover every day and seriously tries to remember as much as possible because you consider it a trustworthy reference source?

And if you were called upon to solve a problem based on knowledge you consider trustworthy, what would you come up with?

What if you were even specifically directed to utilize only findings gleaned from the Times exclusively?

And what if that was your only lifetime source of information whatsoever for some reason?

fbhabbed
0 replies
7h17m

This is getting a bit out of hand isn't it.

exabrial
0 replies
3h10m

I'm actually fine with this. Copyright holders never consented to having their work used in this manner.

ehwhwhwhahhwh
0 replies
8h17m

NYT could also fix the issue by deleting NYT itself. Could be a better result for humanity as well. Thanks.

djhope99
0 replies
3h12m

This argument that the LLM is learning seems slightly flawed when you consider that other experts in the field consider it more like lossy compression. If it’s lossy compression that’s really happening here then you can understand the copyright argument. It’ll be interesting to see how this plays out, lots of new ground breaking.

dewbrite
0 replies
2h17m

Summarizing the article: The most damning thing here is the "ChatGPT as a search engine" feature, which appears to run an agent which performs a search, visits pages, and returns the best results.

In doing this, it is bypassing the NY Times paywall, and you can read full articles from today by repeatedly asking for the next paragraph.

dash2
0 replies
8h34m

There’s an awful lot of confident statements be made about the law here. I wonder if anyone who is actually a lawyer would like to chime in.

cynicalsecurity
0 replies
9h39m

Nothing will come out of it. NY times will lose.

cpt100
0 replies
4h29m

Given that Harvard President plagiarized her way into becoming a President, how can we be sure that NYT doesn't plagiarize and take content from X and other places to quickly chrun out daily news?

andrewstuart
0 replies
9h24m

Means nothing.

An ambit claim that Rupert is throwing out there to see what he can get.

amadeuspagel
0 replies
46m

Two not-so subtle paragraphs about the "partnership" between Microsoft and OpenAI:

15. Microsoft Corporation is a Washington corporation with a principal place of business and headquarters in Redmond, Washington. Microsoft has invested at least $13 billion in OpenAI Global LLC in exchange for which Microsoft will receive 75% of that company’s profits until its investment is repaid, after which Microsoft will own a 49% stake in that company.

16. Microsoft has described its relationship with the OpenAI Defendants as a “partnership.” This partnership has included contributing and operating the cloud computing services used to copy Times Works and train the OpenAI Defendants’ GenAI models. It has also included, upon information and belief, substantial technical collaboration on the creation of those models. Microsoft possesses copies of, or obtains preferential access to, the OpenAI Defendants’ latest GenAI models that have been trained on and embody unauthorized copies of the Times Works. Microsoft uses these models to provide infringing content and, at times, misinformation to users of its products and online services. During a quarterly earnings call in October 2023, Microsoft noted that “more than 18,000 organizations now use Azure OpenAI Service, including new-to-Azure customers.”

Mountain_Skies
0 replies
3h26m

Looks like this is a case of Media vs Tech which might be solved by the courts using past paradigms but should really be addresses by legislation specific to this situation. The difficulty for the media companies, at least in the US, is that both major political parties see the media as the enemy. The left might be a bit more positive about the media but overall, they still see the media as something owned by wealthy elites suppressing knowledge of the harm the powerful inflict on the weak and powerless. Over on the Tech side of things, one party sees Tech as wholly owned by other side of the political divide. Over on that side, things are relatively (but not completely) friendly, so my guess is Tech will end up winning simply because it has more friends in the political realm than the Media does.

1f60c
0 replies
8h24m

I believe that ChatGPT is fair use, just on a much larger scale than we're used to.