return to table of content

Judge dismisses DMCA copyright claim in GitHub Copilot suit

munificent
141 replies
22h4m

> Indeed, last year GitHub was said to have tuned its programming assistant to generate slight variations of ingested training code to prevent its output from being accused of being an exact copy of licensed software.

If I, a human, were to:

1. Carefully read and memorize some copyrighted code.

2. Produce new code that is textually identical to that. But in the process of typing it up, I randomly mechanically tweak a few identifiers or something to produce code that has the exact same semantics but isn't character-wise identical.

3. Claim that as new original code without the original copyright.

I assume that I would get my ass kicked legally speaking. That reads to me exactly like deliberate copyright infringement with willful obfuscation of my infringement.

How is it any different when a machine does the same thing?

JoshTriplett
50 replies
20h49m

You have a much smaller lobbying budget than the AI industry, and you didn't flagrantly rush to copy billions of copyrighted works as quickly as possible and then push a narrative acting like that's the immutable status quo that must continue to be permitted lest the now-massive industry built atop copyright violation be destroyed.

Violate one or two copyrights, get sued or DMCAed out of existence. Violate billions, on the other hand, and you magically become immune to the rules everyone else has to follow.

nadermx
43 replies
20h39m

What about the copyrights purpose of furthering the arts and sciences?

JoshTriplett
37 replies
20h9m

Copyright has utterly failed to serve that purpose for a long time, and has been actively counterproductive.

But if you want to argue that copyright is counterproductive, I completely agree. That's an argument for reducing or eliminating it across the board, fairly, for everyone; it's not an argument for giving a free pass to AI training while still enforcing it on everyone else.

adra
22 replies
19h12m

Without copyright, entire industries would've been dead a long time ago, including many movies, games, books, tv, music, etc.

Just because their lobbies tend to push the boundary of copyright into the absurd doesn't mean these industries aren't worth saving. There should be actually respectful lawmakers who seek for a balance of public and commercial interests.

pdonis
12 replies
18h13m

> Without copyright, entire industries would've been dead a long time ago, including many movies, games, books, tv, music, etc.

Citation needed. There are many ways to make money from producing content other than restricting how copies of it can be distributed. The owner should be able to choose copyright as a means of control, but that doesn't mean nobody would create any content at all without copyright as a means of control.

boplicity
11 replies
15h20m

There's nothing preventing people from producing works and releasing them without copyright restriction. If that were a more sustainable model, it would be happening far more often.

As it is now, especially in the creative fields (which I am most knowledgeable about), the current system has allowed for a incredible flourishing of creation, which you'd have to be pretty daft to deny.

chii
4 replies
13h13m

If that were a more sustainable model, it would be happening far more often.

that's not the argument. The fact that there currently are restrictions on producing derivative works is the problem. You cannot produce a star wars story, without getting consent from disney. You cannot write a harry potter story, without consent from Rowling.

boplicity
3 replies
3h0m

That's not actually true. There's nothing stopping you from producing derivative works. Publishing and/or profiting from other people's work does have some restrictions though.

There's actually a huge and thriving community of people publishing derivative works, in a not-for-profit basis, on Archive of Our Own. (Among other places.)

pdonis
2 replies
2h40m

> There's actually a huge and thriving community of people publishing derivative works, in a not-for-profit basis, on Archive of Our Own. (Among other places.)

Yes, and none of those people are making a living at creating things. That's why they are allowed by the copyright owners to do what they're doing--because it's not commercial. Try to actually sell a derivative work of something you don't own the copyright for and see how fast the big media companies come after you. You acknowledge that when you say there are "restrictions" (an understatement if I ever saw one) on profiting from other people's work (where "other people" here means the media companies, not the people who actually created the work).

It is true that without our current copyright regime, the "industries" that produce Star Wars, Disney, etc. products would not exist in their current form. But does that mean works like those would not have been created? Does it mean we would have less of them? I strongly doubt it. What it would mean is that more of the profits from those works would go to the actual creative people instead of middlemen.

boplicity
1 replies
2h26m

Yes, and none of those people are making a living at creating things.

Again, not true. One of the most famous examples is likely Naomi Novik, who is a bestselling author, in addition to a prolific producer of derivative works published on AO3. Many other commercially successful authors publish derivative works on this platform as well.

It is true that without our current copyright regime, the "industries" that produce Star Wars, Disney, etc. products would not exist in their current form. But does that mean works like those would not have been created? Does it mean we would have less of them? I strongly doubt it. What it would mean is that more of the profits from those works would go to the actual creative people instead of middlemen.

Speculate all you want about an alternative system, but you really don't know what would have happened, or what would happen moving forward.

pdonis
0 replies
20m

> not true

Sorry, I meant they're not making a living at creating derivative works of copyrighted content. They can't, for the reasons you give. Nor can other people make a living creating derivative works of their commercially published work. That is an obvious barrier to creation.

TylerE
3 replies
14h27m

Given that copyrighting is automatic at the instant of creation, that is, um, debatable.

Slapping 3 lines in LICENSE.TXT doesn’t override the Berne convention.

trogdor
2 replies
3h27m

Are you claiming that an author cannot place their work in the public domain?

tpush
0 replies
2h39m

In most of the world no, they can't.

pdonis
0 replies
2h44m

Yes, they can't, because there is no legally reliable way to do it (briefly, because the law really doesn't like the idea of property that doesn't have an owner, so if you try to place a work of yours in the public domain, what you're actually doing is making it abandoned property so anyone who wants to can claim they own it and restrict everyone else, including you, from using it). The best an author can do is to give a license that basically lets anyone do what they want with the work. Creative Commons has licenses that do that.

pdonis
0 replies
2h43m

> the current system has allowed for a incredible flourishing of creation

No, the current system has allowed for an incredible flourishing of middlemen who don't create anything themselves but coerce creative people into agreements that give the middlemen virtually all the profits.

copywrong2
0 replies
6h46m

People do not put out their stuff. People get lured into contracts selling their IP to a shitty company that then publishes stuff, of course WITH copyright so they can make money while the artist doesnt

Zambyte
2 replies
7h22m

Books, music, and games are a lot older than copyright.

adra
1 replies
3h4m

Have you looked at who created these things by and large? For the most part, you have: - aristocrats that were wealthy that didn't need to "work" to survive and put food on the table - crafts people supported through the patronage of a rich person (or religious order) who deign to support your art - (kinda modern world) national governments who want to support their national art often as a fear that other larger nations cultural influences will dwarf their

Are you implying that these three pillars will be able to produce anywhere near the current amount of content we produce?

How in the world where digital copies are effectively free to copy and infinitum would a creator reap any benefits from that network effect?

A modern equivalent would be famous YouTubers who all they do all day is "watch" other people's hard earned videos. The super lazy ones will not direct people to the original, don't provide meaningful commentary, just consumes the video as 'content' to feed their own audience and provides no value to the original creator. The position to kill copyright entirely would amplify this "just bypass the original source" to lower value of the original creator to zero.

Zambyte
0 replies
51m

Are you implying that these three pillars will be able to produce anywhere near the current amount of content we produce?

Do you think the vast "amount of content we produce" is actually propped up by copyright? Have you ever heard of someone who started their career on YouTube due to copyright? On the contrary, how often have you heard of people stopping their YouTube career due to copyright, or explicitly limiting the content they create? I have only heard of cases of the latter. In fact, the latter partially happened to me.

How in the world where digital copies are effectively free to copy and infinitum would a creator reap any benefits from that network effect?

You are making an assumption that people should reap (monetary) benefits for creating things. What you are ignoring is that the world where digital copies are effectively free is also the world where original works are insanely cheap as well. In this world, people create regardless of monetary gain.

To make this point: how much money did you make from this comment that you posted? It's covered by copyright, so surely you would not have created it if not for your own benefit.

TylerE
2 replies
14h29m

For that matter, if you think China ripping everyone else off is bad now… well, just wait until every company can do that.

Qwertious
0 replies
11h47m

If everyone could do it, it wouldn't be as big a deal - small western businesses would be on a more level playing field, since they would be almost as immune from being sued by big businesses as Chinese businesses are. As it is, small businesses aren't protected by patents (because a patent is a $10k+ ticket to a $100k+ lawsuit against a competitor with a $1M+ budget for lawyers) while still being bound by the restrictions of big business's patents. It's lose/lose.

DoItToMe81
0 replies
10h1m

Trademark isn't copyright, so no.

matheusmoreira
0 replies
12h21m

Nobody cares anymore. We're sick of their rent seeking, of their perpetual monopolies on culture. Balance? Compromise? We don't want to hear it.

Nearly two hundred years ago one man warned everyone this would happen. Nobody listened. These are the consequences.

"At present the holder of copyright has the public feeling on his side. Those who invade copyright are regarded as knaves who take the bread out of the mouths of deserving men. Everybody is well pleased to see them restrained by the law, and compelled to refund their ill-gotten gains. No tradesman of good repute will have anything to do with such disgraceful transactions. Pass this law: and that feeling is at an end. Men very different from the present race of piratical booksellers will soon infringe this intolerable monopoly. Great masses of capital will be constantly employed in the violation of the law. Every art will be employed to evade legal pursuit; and the whole nation will be in the plot. On which side indeed should the public sympathy be when the question is whether some book as popular as “Robinson Crusoe” or the “Pilgrim’s Progress” shall be in every cottage, or whether it shall be confined to the libraries of the rich for the advantage of the great-grandson of a bookseller who, a hundred years before, drove a hard bargain for the copyright with the author when in great distress? Remember too that, when once it ceases to be considered as wrong and discreditable to invade literary property, no person can say where the invasion will stop. The public seldom makes nice distinctions. The wholesome copyright which now exists will share in the disgrace and danger of the new copyright which you are about to create. And you will find that, in attempting to impose unreasonable restraints on the reprinting of the works of the dead, you have, to a great extent, annulled those restraints which now prevent men from pillaging and defrauding the living."

https://www.thepublicdomain.org/2014/07/24/macaulay-on-copyr...

cvwright
0 replies
18h1m

So true! Copyrights that last 20 years would be completely reasonable. Maybe with exponentially increasing fees for successive renewals, for super valuable properties like Disney movies.

copywrong2
0 replies
6h47m

Yeah many industries like:

- Big Corps that buy IP

- Patent Trolls

- Companies that fuck over artists

JumpCrisscross
12 replies
18h47m

Copyright has utterly failed to serve that purpose for a long time, and has been actively counterproductive

This debate is tired because nobody brings citations. The pro-copyright lobby cites numbers of jobs. The anti, nothing. In that midst, of course we're going to stick with the status quo.

kibwen
10 replies
17h56m

This is a specious argument. It is impossible for us to gesture at the works of art that do not exist because of draconian copyright. Humans have been remixing each others' works for millions of years, and the artificial restriction on derivative work is actively destroying our collective culture. There should be thousands of professional works (books, movies, etc.) based on Lord Of The Rings by now, many of which would surpass the originals in quality given enough time, and we have been robbed of them. And Lord Of The Rings is an outlier in that it still remains culturally relevant despite its age; most works will remain copyrighted for far longer than their original audience was even alive, meaning that those millions of flowers never get their chance to bloom.

eropple
7 replies
16h4m

This is all true, and in a vacuum I agree with it. There's a pretty core problem with these kinds of assertions, though: people have to make rent. Never have I seen a substantiative, pass-the-sniff-test argument for how to make practical this system when your authors and your artists need to eat in a system of modern capital.

So I'm asking genuinely: what's your plan? What's the A to B if you could pass a law tomorrow?

JoshTriplett
2 replies
11h0m

What's the A to B if you could pass a law tomorrow?

Top priority: UBI, together with a world in which there's so much surplus productivity that things can survive and thrive without having "how does this make huge amounts of money" as its top priority to optimize for.

Apart from that: Conventions/concerts/festivals (tickets to a unique live event with a crowd of other fans), merchandise (pay for a physical object), patronage (pay for the ongoing creation of a thing), crowdfunding/Kickstarter (pay for a thing to come into existence that doesn't exist yet), brand/quality preference (many people prefer to support the original even if copies can be made), commissions (pay for unique work to be created for you), something akin to "venture funding", and the general premise that if a work spawns ten thousand spinoffs and a couple of them are incredible hits they're likely to direct some portion of their success back towards the work they build upon if that's generally looked upon favorably.

People have an incredible desire both to create and to enjoy the creations of others, and that's not going to stop. It is very likely that the concept of the $1B movie would disappear, and in trade we'd get the creation of far far more works.

eropple
1 replies
5h29m

Yeah, this is what I was expecting. I have no love for Disney et al but I think that this is dire (aside from UBI, which would be great but is fictional without a large-scale shift in American culture).

"Everybody else gets paid for the work they do; you get paid for things around the work you do, if you're lucky" is a way to expect creatives to live that, to put a point on it, always ends up being "for thee, but not for me". It's bad enough today--I think you described something worse.

JoshTriplett
0 replies
1h24m

The current model is "most people get paid for the work they do, but you get paid for people copying work you've already done", which already seems asymmetric. This would change the model to "people get paid for the work they do, and not paid again for copying work they've already done".

Qwertious
1 replies
11h33m

Not the person you responded to, but:

So I'm asking genuinely: what's your plan? What's the A to B if you could pass a law tomorrow?

Patreon (or liberapay etc). Take a look at youtube: so many creators are actively saying "youtube doesn't pay the bills, if you like us then please support us on Patreon". Patreon works. Some of the time, at least - just like copyright. Also crowdsourcing (e.g. Kickstarter), which worked out well for games like FTL and Kingdom Come: Deliverance.

Although, I personally don't believe copyright should be abolished - it just needs some amendments. It needs a duration amendment - not a flat duration (fast-fashion doesn't need even 5 years of copyright, but aerospace software regularly needs several decades just to break profitable), but either some duration-mechanism or a simple discrimination by industry.

Also, I think any sort of functional copyright (e.g. software copyright) ought to have an incentive or requirement to publish the functional bits - for instance, router firmware ought to require the source code in escrow (to be published once copyright duration expires) for any legal protections against reverse-engineering to be mounted. Unpublished source code is a trade secret, and should be treated as such.

Also, these discussions don't seem to mention fanfiction, which demonstrates plenty of people write good works without being professionally paid and without the protection of copyright.

davrosthedalek
0 replies
5h33m

How many subscribers on patreon are there because the creators provides pay-walled extra content? How many would remain if that pay-walled content would be mirrored directly by youtube or on youtube?

Crowdsourcing might work better, but how many would donate to a game where, instead of getting it cheaper as a kickstarter supporter, they could get free after it is released?

uhoh-itsmaciek
0 replies
12h37m

Copyright is not optimized for making sure artists and authors get enough to eat. It's optimized for people with a lot of money to make even more money by exploiting artists and authors.

I doubt there's a simple answer (I certainly don't have one), but the current system is not exactly a creators' utopia.

Kim_Bruning
0 replies
8h30m

My own business model is to create Things That Don't Exist Yet. This (typically bespoke work) is actually the majority of work in any era I think. For me, copyright doesn't do much, it mostly gets in the way.

If you pass the law tomorrow -all else being equal- my profits would stay equal or go up somewhat.

JoshTriplett
1 replies
11h16m

It is impossible for us to gesture at the works of art that do not exist because of draconian copyright.

We can gesture at the tiniest tip of the iceberg by observing things that are regularly created in violation of copyright but not typically attacked and taken down until they get popular:

- Game modding, romhacks, fangames, remakes, and similar.

- Memes (often based on copyrighted content)

- Stage play adaptations of movies (without authorization

- Unofficial translations

- Machinima

- Speedruns, Let's Play videos, and streams (very often taken down)

- Music remixes and sampling

- Video mashups

- Fan edits/cuts, "Abridged" series

- Archiving and preservation of content that would otherwise be lost

- Fan films

- Fanfiction

- Fanart

- Homebrew content for tabletop games

sleepybrett
0 replies
1h16m

"- Speedruns, Let's Play videos, and streams (very often taken down)"

Very often taken down, only by nintendo.

Kim_Bruning
0 replies
8h48m

Fashion is traditionally not copyrightable[1] , and the fashion industry is doing rather well.

Similarly our IT infrastructure is now built mostly on [a set of patches to the copyright system][2] called F/L/OSS that provided more freedom to authors and users, and lead to more innovation and proliferation of solutions.

So even just in the modern west, we can see thriving ecosystems where copyright is absent or adjusted; and where the outcomes are immediately visible on the street.

[1] Though a quick search shows that lawyers are making inroads.

[2] One way of describing it at least, YMMV.

idle_zealot
0 replies
19h24m

Could these "free passes" for AI training serve as a legal wedge to increase the scope of fair use in other cases? Pro-business selective enforcement sucks, but so long as model weights are being released and the public is benefiting then stubbornly insisting that overzealous copyright laws be enforced seems self-defeating.

kelnos
2 replies
18h47m

That ship sailed long ago. While copyright can and is used at times to protect the "little guy", the law is written as it is in order to protect and further corporate interests.

The current manifestation of copyright is about rent-seeking, not promoting innovation and creativity. That it may also do so is entirely coincidental.

ryandrake
0 replies
18h16m

Also, if it wasn't about rent-seeking and preventing access to works, copyright wouldn't have to last for decades, many multiples of a work's useful commercial life. The fact that it does last this long shows that it's not about promoting innovation and creativity.

DoItToMe81
0 replies
9h59m

Copyright was invented by a cartel of noblemen, the British Stationer's Company, who, due to liberal reform, were going to lose their publishing monopoly. The implementation of copyright law as they helped pen allowed them to mostly continue their position while portraying it as "protecting the little guy".

Funny how both the rhetoric and intentions are the same after three hundred years.

teeray
0 replies
16h30m

Copyright’s purpose is a cudgel to be wielded to enrich the holder for, ideally, eternity. If “eternity” is threatened, you use proceeds from copyright to change copyright law to protect future proceeds.

CobrastanJorji
0 replies
16h59m

You want to look at the Supreme Court case "Eldred v. Ashcroft." Eldred challenged Congress for retroactively extending existing copyrights, for extending the patent protections on existing inventions could not possibly further arts and sciences. They also argued that if Congress had the power to continually extend existing copyrights by N years every N years, the Constitutional power of "for a limited time" had no meaning.

The Supreme Court's decision was a bunch of bullshit around "well, y'know, people live longer these days, and some creators are still alive who expected these to last their whole lives, and golly, coincidentally this really helps giant corporations."

stale2002
0 replies
7h39m

acting like that's the immutable status quo

It is immutable.

What are you going to do about it? Confiscate everyone's home gamer PCs?

Even in the most extreme hypothetical where lawsuits shutdown OpenAI, that doesn't delete the stable diffusion models that I have on my external hard drives.

The tech is out there. It's too late.

marsten
0 replies
9h16m

There's a strong geopolitical angle as well. If you force American companies to license all training data for LLMs, that is such a gargantuan undertaking it would effectively set US companies back by years relative to Chinese competitors, who are under no such restrictions.

Bottom line, if you're doing something considered relevant to the national interest then that buys you a lot of leeway.

justinclift
0 replies
12h16m

Violate one or two copyrights, get sued or DMCAed out of existence. Violate billions, on the other hand, and you magically become immune to the rules everyone else has to follow.

Sounds like the same concept as commonly said of "murderer vs conqueror".

Could probably be applied to many other fields for disruption too. Not the murderer bit (!), more the "break one or two laws -> scaled up massively to a potential new paradigm".

fragmede
0 replies
14h44m

works the same for banks and owing them money

RF_Savage
0 replies
11h54m

Violate billions or millions is what they used to nail warez folks with. So there is that.

Kim_Bruning
0 replies
8h37m

You will need to first demonstrate that actual copying took place. And that what copying that did take place was actually illegal or infringing.

As we're seeing in court, that's a very interesting question. It turns out that the answers are very counter-intuitive to many.

tomxor
9 replies
20h41m

US copyright does protect for "substantial similarity" [0]. And at the other end of the spectrum, this has been abused in absurd ways to argue that substantially different code has infringed.

In Zenimax vs Oculus they basically argued that a bunch of really abstract yet entirely generic parts of the code were shared, we are talking some nested for loops, certain combinations of if statements, and due to a lack of a qualitative understanding of code, syntax, common patterns, and what might actually qualify for substantively novel code in the courtroom, this was accepted as infringing. [1]

Point is, the legal system is highly selective when it comes to corporate interests.

[0] https://en.wikipedia.org/wiki/Substantial_similarity

[1] https://arstechnica.com/gaming/2017/02/doom-co-creator-defen...

talldayo
6 replies
20h20m

Point is, the legal system is highly selective when it comes to corporate interests.

I don't even think it's that. In recent cases like Oracle v. Google and Corellium v. Apple, Fair Use prevailed with all sorts of conflicting corporate interests at play. The Zenimax v. Oculus case very much revolved around NDAs that Carmack had signed and not the propagation of trade secrets. Where IP is strictly the only thing being concerned, the literal interpretation of Fair Use does still seem to exist.

Or for a more plain example, Authors Guild. v. Google where Google defended their indexing of thousands of copywritten books as Fair Use.

tpmoney
5 replies
19h38m

In fact, go to far as to argue your example of Authors Guild v. Google is a good indication that most cases will probably go an AI platform's way. It's a pretty parallel case to a number of the arguments. Indexing required ingesting whole works of copyright material verbatim. It utilized that ingested data to produce a new commercial work consisting of output derived from that data. If I remember the case correctly, google even displayed snippets when matching a search so the searcher could see the match in context, reproducing the works verbatim for those snippets and one could presume (though I don't recall if it was coded against), that with sufficiently clever search prompts, someone could get the index search to reproduce a substantial portion of a work.

Arguably, the AI platforms have an even stronger case as their nominal goal is not to have their systems reproduce any part of the works verbatim.

jcranmer
3 replies
18h22m

In fact, go to far as to argue your example of Authors Guild v. Google is a good indication that most cases will probably go an AI platform's way.

The more recent Warhol decision argues quite strongly in the opposite direction. It fronts market impact as the central factor in fair use analysis, explicitly saying that whether or not a use is transformative is in decent part dependent on the degree to which it replaces the original. So if you're writing a generative AI tool that will generate stock photos that it generated by scraping stock photo databases... I mean, the fair use analysis need consist of nothing more than that sentence to conclude that the use is totally not fair; none of the factors weigh in favor it.

tpmoney
2 replies
17h56m

I think that decision is much narrower than "market impact". It's specifically about substitution, and to that end, I don't see a good argument that Co-Pilot substitutes for any of the works it was trained on. No one is buying a license to co-pilot to replace buying a license to Photoshop, or GIMP, or Linux, or Tux Racer. Nor is Github selling co-pilot for that use.

To the extent that a user of co-pilot could induce it to produce enough of a copyrighted work to both infringe on the content (remember that algorithms are not protected by copyright) and substitute for the original by licensing in lieu of, I would expect the courts to examine that in the ways it currently views a xerox machine being used to create copies of a book. While the machine might have enabled the infringement, it is the person using the machine to produce and then distribute copies that is doing the infringing not the xerox machine itself nor Xerox the company.

Specifically in the opinion the court says:

If an original work and a secondary use share

the same or highly similar purposes, and the secondary use

is of a commercial nature, the first factor is likely to

weigh against fair use, absent some other justification for

copying.

I find it difficult to come up with a good case that any given work used to train co-pilot and co-pilot itself share "the same or highly similar purposes". Even in the case of say someone having a code generator that was used in training of co-pilot, I think the courts would also be looking at the degree to which co-pilot is dependent on that program. I don't know off hand if there are any court cases challenging the use of copyright works in a large collage of work (like say a portrait of a person made from Time Magazine covers of portraits), but again my expectation here is that the court would find that while the entire work (that is the magazine cover) was used and reproduced, that reproduction is a tiny fraction of the secondary work and not substantial to its purpose.

Similarly we have this line:

Whether the purpose and character of a use weighs in favor

of fair use is, instead, an objective inquiry into what use

was made, i.e., what the user does with the original work.

Which I think supports my comparison to the xerox machine. If the plaintiffs against Co-Pilot could have shown that a substantial majority of users and uses of Co-Pilot was producing infringing works or producing works that substitute for the training material, they might prevail in an argument that co-pilot is infringing regardless if the intent of github. But I suspect even that hurdle would be pretty hard to clear.

jcranmer
1 replies
17h34m

Of the various recent uses of generative AI, Copilot is probably the one most likely to be found fair use and image generation the least likely.

But in any case, Authors Guild is not the final word on the subject, and anyone trying to argue for (or against) fair use for generative AI who ignores Warhol is going to have a bad day in court. The way I see it, Authors Guild says that if you are thoughtful about how you design your product, and talk to your lawyers early and continuously about how to ensure your use is fair and will be seen as fair in the courts, you can indeed do a lot of copying and still be fair use.

tpmoney
0 replies
16h54m

I agree. Nothing is going to be the final word until more of these cases are heard. But I still don't think Warhol is as strong even against other uses of generative AI, and in fact I think in some ways argues in their favor. The court in Warhol specifically rejects the idea that the AWF usage is sufficiently transformed by the nature of the secondary work being recognizably a Warhol. I think that would work the other way around too, that a work being significantly in a given style is not sufficient for infringement. While certainly someone might buy a license to say, Stable Diffusion and attempt to generate a Warhol style image, someone might also buy some paints and a book of Warhol images to study and produce the same thing. Provided the produced images are not actually infringements or transformations of identifiably original Warhol works, even if they are in his style, I think there's a good argument to be made that the use and the tool are non-infringing.

Or put differently, if the Warhol image had used Goldsmith's image as a reference for a silk screen portrait of Steve Tyler, I'm not sure the case would have gone the same way. Warhol's image is obviously and directly derived from Goldsmith's image and found infringing when licensed to magazines, yet if Warhol had instead gone out and taken black and white portraits of prince, even in Goldsmith's style after having seen it, would it have been infringing? I think the closest case we have to that would have been the suit between Huey Lewis and Ray Parker Jr. over "I Want a New Drug"/"Ghostbusters" but that was settled without a judgement.

I do agree that Warhol is a stronger argument against artistic AI models, but it would very much have to depend on the specifics of the case. The AWF usage here was found to be infringing, with no judgement made of the creation and usage of the work in general, but specifically with regard to licensing the work to the magazine. They point out the opposite case that his Campbell paintings are well established as non-infringing in general, but that the use of them licensed as logos for soup makers might well be. So as is the issue with most lawsuits (and why I think AI models in general will win the day), the devil is in the details.

belorn
0 replies
19h5m

A key finding that the judge said in the Authors Guild v. Google case was that the authors benefited from the tool that google created. A search tool is not a replacement for a book, and are much more likely to generate awareness of the book which in turn should increase sales for the author.

AI platforms that replaces and directly compete with authors can not use the same argument. If anything, those suing AI platforms are more likely to bring up Authors Guild v. Google as a guiding case to determine when to apply fair use.

wahern
0 replies
18h48m

US copyright does protect for "substantial similarity"

Substantial similarity refers to three different legal analyses for comparing works. In each case what the analysis is attempting to achieve is different, but in no case does it operate to prohibit similarity, per se.

The Wikipedia page points out two meanings. The first is a rule for establishing provenance. Copyright protects originality, not novelty. The difference is that if two people coincidentally create identical works, one after another, the second-in-time creator has not violated any right of the first. (Contrast with patents, which do protect novelty.) In this context, substantial similarity is a way to help establish a rebuttable presumption that the latter work is not original, but inspired by the former; it's a form of circumstantial evidence. Normally a defendant wouldn't admit outright they were knowingly inspired by another work, though they might admit this if their defense focuses on the second meaning, below. The plaintiff would also need to provide evidence of access or exposure to the earlier work to establish provenance; similarity alone isn't sufficient.

The second meaning relates to the fact that a work is composed of multiple forms and layers of expression. Not all are copyrightable, and the aggregate of copyrightable elements needs to surpass a minimum threshold of content. Substantial similarity here means a plaintiff needs to establish that there are enough copyrightable elements in common. Two works might be near identical, but not be substantially similar if they look identical merely because they're primarily composed of the same non-copyrightable expressions, regardless of provenance.

There's a third meaning, IIRC, referring to a standard for showing similarity at the pleadings stage. This often involves a superficial analysis of apparent similarity between works, but it's just a procedural rule for shutting down spurious claims as quickly as possible.

copywrong2
0 replies
6h50m

Copyright is abused often. Our modern version of copyright is BS and only benefits large corps who buy a lot of IP.

scott_w
4 replies
1h46m

While correct, the example given is that they COPY the code, then make adjustments to hide the fact. I suspect this is still a copyright violation. It’s interesting that a judge sees it differently when it’s just run through a programme. I’m not a legal expert so I’m guessing it’s a bit more complex than the headline?

itishappy
2 replies
1h28m

No copy-paste was explicitly used. They compressed it into a latent space and recreated from memory, perhaps with a dash of "creativity" for flavor. Hypothetically, of course.

The distinction is pedantic but important, IMHO. AI doesn't explicitly copy either.

scott_w
1 replies
1h18m

But isn’t that the same as memorising it and rewriting the implementation from memory? I’m sure “it wasn’t an exact reproduction” is not much of a defence.

itishappy
0 replies
1h5m

I sure think so. I also think that (to first order) this is exactly what modern AI products do. Is a lossy copy still a copy?

scott_w
0 replies
1h43m

Ok I read the article and it looks like the issue is the DMCA specifically, which require the code to be more identical than is presented. I’m guessing separate claims could still come from other copyright laws?

torginus
1 replies
10h38m

If I were to license a cover of a song for a music video, I'd have to license both the original song and the cover itself.

I'd say this is extremely relevant in this case.

bryanrasmussen
0 replies
5h17m

if that is the case why do people ever license covers?

to clarify - I thought you just had to negotiate with the cover artist about rights and pay a nominal fee for usage of the song for cover purposes - that is to say you do not negotiate with the original artist, you negotiate with a cover artist and the whole process is cheaper?

ADeerAppeared
1 replies
20h49m

The simple version is that code is copyrightable as an expression. And the underlaying algorithm is patentable.

The legal term you're looking for here is the "Abstraction-Filtration-Comparison" test; What remains if you subtract all the non-copyrightable elements from a given piece of code.

adrian_b
0 replies
11h0m

Algorithms have become patentable only very recently in the history of patents, without a rationale being ever provided for this change, and in some countries they have never become patentable.

Even in the countries other than USA where algorithms have become patentable, that happened only due to USA blackmailing those countries into changing their laws "to protect (American) IP".

It is true however that there exist some quite old patents which in fact have patented algorithms, but those were disguised as patents for some machines executing those algorithms, in order to satisfy the existing laws.

giamma
0 replies
10h32m

Software like Blackduck or Scanoss is designed to identify exactly that type of behaviour. It is used very often to scan closed source software and to check whether it contains snippets that are copied from open source with incompatible licenses (e.g. GPL).

To be able to do so, these softwares build a syntax tree of what your code snippet is, and compare the tree structure with similar trees in open source software without being fooled by variable names. To speed up the search, they also compute a signature for these trees so that the signature can be more easily searched in their database of open source code.

Analemma_
14 replies
21h39m

How is it any different when a machine does the same thing?

Because intent matters in the law. If you intended to reproduce copyrighted code verbatim but tried to hide your activity with a few tweaks, that's a very different thing from using a tool which occasionally reproduces copyrighted code by accident but clearly was not designed for that purpose, and much more often than not outputs transformative works.

munificent
7 replies
18h33m

> clearly was not designed for that purpose,

I'm not aware of evidence that support that claim. If I ask ChatGPT "Give me a recipe for squirrel lemon stew" and it so happens that one person did write a recipe for that exact thing on the Internet, then I would expect that the most accurate, truthful response would be that exact recipe. Anything else would essentially be hallucination.

zmmmmm
4 replies
18h14m

i think you are misconceiving then how LLMs work / what they are

You can certainly try to hit a nail with a screw driver, but that doesn't make the screw driver a hammer.

munificent
2 replies
15h8m

As I understand it, LLMs are intended to answer questions as "truthfully" as they can. Their understanding of truth comes from the corpus they are trained on. If you ask a question where the corpus happens to have something very close to that question and its answer, I would expect the LLM to burp up that answer. Anything less would be hallucination.

Of course, if I ask a question that isn't as well served by the corpus, it has to do its best to interpolate an answer from what it knows.

But ultimately its job is to extract information from a corpus and serve it up with as much semantic fidelity to the original corpus as possible. If I ask how many moons Earth has, it should say "one". If I ask it what the third line of Poe's "The Raven" is, it should say "While I nodded, nearly napping, suddenly there came a tapping,". Anything else is wrong.

If you ask it a specific enough question where only a tiny corner of its corpus is relevant, I would expect it to end up either reproducing the possibly copyright piece of that corpus or, perhaps worse, cough up some bullshit because it's trying to avoid overfitting.

(I'm ignoring for the moment LLM use cases like image synthesis where you want it to hallucinate to be "creative".)

zmmmmm
0 replies
12h59m

I get that's what you and a lot of people want it to be, but it isn't what they are. They are quite literally probabilistic text generation engines. Let's emphasise that: the output is produced randomly by sampling from distributions, or in simple terms, like rolling a dice. In a concrete sense it is non-deterministic. Even if an exact answer is in the corpus, its output is not going to be that answer, but the most probable answer from all the text in the corpus. If that one answer that exactly matches contradicts the weight of other less exact answers you won't see it.

And you probably wouldn't want to - if I ask if donuts are radioactive and one person explicitly said that on the internet you probably aren't going to tell me you want it to spit out that answer just because it exactly matches what you asked. You want it to learn from the overwhelimg corpus of related knowledge that says donuts are food, people routinely eat them, etc etc and tell you they aren't radioactive.

kortilla
0 replies
13h36m

They are all hallucinations. Calling lies hallucinations and truths normal output is nonsense.

paulddraper
0 replies
17h53m

Perfect analogy.

remuskaos
1 replies
12h39m

Recipes are not copyrightable for that exact reason.

sleepybrett
0 replies
1h10m

Substitue recipe for literally any other piece of unique information.

olliej
2 replies
21h4m

Um, the entire intent of these "AI" systems is explicitly to reproduce copyrighted work with mechanical changes to make it not appear to be a verbatim copy.

That is the whole purpose and mechanism by which they operate.

Also the intent does not matter under law - not intending to break the law is not a defense if you break the law. Not intending to take someone's property doesn't mean it becomes your property. You might get less penalties and/or charges, due to intent (the obvious examples being murder vs manslaughter, etc).

But here we have an entire ecosystem where the model is "scan copyrighted material" followed by "regurgitate that material with mechanical changes to fit the surrounding context and to appear to be 'new' content".

Moreover given that this 'new' code is just a regurgitation of existing code with mutations to make it appear to fit the context and not directly identical to the existing code, then that 'new' code cannot be subject to copyright (you can't claim copyright to something you did not create, copyright does not protect output of mechanical or automatic transformations of other copyrighted content, and copyright does not protect the result of "natural processes", e.g 'I asked a statistical model to give me a statically plausible sequence of tokens and it did'). So in the best case scenario - the one where the copyright laundering as a service tool is not treated as just that, any code it produces is not protectable by copyright, and anyone can just copy "your work" without the license and (because you've said if you weren't intending to violate copyright it's ok) they can say they could not distinguish the non-copyright-protected work from the protected work and assumed that therefore none of it was subject to copyright. To be super sure though they weren't violating any of your copyrights, they then ran an "AI tool" to make the names better and better suit your style.

I am so sick of these arguments where people spout nonsense about "AI" systems magically "understanding" or "knowing" anything - they are very expensive statistical models, the produce statistically plausible strings of text, by a combination of copying the text of others wholesale, and filling the remaining space with bullshit that for basic tasks is often correct enough, and for anything else is wrong - because again they're just producing plausible sequences of tokens and have no understanding of anything beyond that.

To be very very very clear: if an AI system "understood" anything it was doing, it would not need to ingest essentially all the text that anyone has ever written, just to produce content that is at best only locally coherent, and that is frequently incorrect in more or less every domain to which it is applied. Take code completion (as in this case): Developers can write code without essentially reading all the code that has ever existed just so that they can write basic code, because developers understand code. Developers don't intermingle random unrelated and non-present variables or functions in their code as they write, because they understand what variables are and therefore they can't use non existent ones. "AI" on the other hand required more power than many countries to "learn" by reading as much as possible all code ever written, and then produce nonsense output for anything complex because they're still just generating a string of tokens that is plausible according to their statistical model - the result of these AIs is essentially binary: it has been in effect asked to produce code that does something that was in its training corpus and can be copied essentially verbatim, with a transformation path to make it fit, or it's not in the training corpus and you get random and generally incorrect code - hopefully wrong enough it fails to build, because they're also good at generating code that looks plausible but only fails at runtime because plausible sequence of tokens often overlaps with 'things a compiler will accept'.

shkkmo
0 replies
20h1m

Also the intent does not matter under law - not intending to break the law is not a defense if you break the law

Intent frequently matters a great deal when applying laws.

In the specific area of copyright law, it doesn't itself make the use non infringing, but it can absolutely impact the damages or a fair use argument.

Kim_Bruning
0 replies
8h17m

I actually once tracked this claim down in the case of stable diffusion.

I concluded that it was just completely impossible for a properly trained stable diffusion model to reproduce the works it was trained on.

The SD model easily fits on a typical USB stick, and comfortably in the memory of a modern consumer GPU.

The training corpus for SD is a pretty large chunk of image data on the internet. That absolutely does not fit in GPU memory - by several orders of magnitude.

No form of compression known to man would be able to get it that small. People smarter than me say it's mathematically not even possible.

Now for closed models, you might be able to argue something else is going on and they're sneakily not training neural nets or something. But the open models we can inspect? Definitely not.

Modern ML/AI models are doing Something Else. We can argue what that Something Else is, but it's not (normally) holding copies of all the things used to train them.

archontes
1 replies
21h25m

Not in copyright. The work speaks for itself, and the function of code is not a copyrightable aspect.

bawolff
0 replies
19h42m

The intent of the work can matter when determining if de minimis applies as well as fair use.

anigbrowl
0 replies
19h3m

It's equally plausible to say you don't intend to reproduce copyrighted code verbatim but occasionally do so given either a sufficiently specific prompt or because the reproduced code is so generic that it probably gets rewritten a hundred times a day because that's how people learned to do basic things from books or documentation or their education.

singleshot_
13 replies
21h46m

The guy who owns the machine is really rich, while you are more or less (all due respect of course) not worth suing.

That’s why I think the opposite of what you claim is true: if you were to do this, absolutely nothing would happen. When they do it, they will get sued over and over until the law changes and they can’t be sued, or they enter some mutually-beneficial relationship with the parties who keep suing.

beeboobaa3
12 replies
21h39m

if you were to do this, absolutely nothing would happen

Read up on the DMCA and the impact it has on e.g. nintendo emulators and the developers thereof

dmix
8 replies
21h25m

Those emulators are very popular though to the point of potentially impacting another business's bottom line. Where an individual putting it out a small block of code isn't exactly going to attract expensive lawyers.

I'm skeptical Github Copilot reproducing a couple functions potentially used by some random Github project is going to be a threat to another party's livelihood.

When AI gets good enough to make full duplicates of apps I'd be more concerned about the source. Thousands of smaller pieces drawn from a million sources and being combined in novel ways is less worrying though.

BadHumans
7 replies
20h53m

There is no impact to a company's bottom line when you are emulating a product they do not sell.

lcouturi
4 replies
20h31m

Yuzu, the emulator that was sued by Nintendo, was emulating the Nintendo Switch, which is a product Nintendo does sell.

BadHumans
3 replies
20h8m

Yuzu is not the only emulator taken down by Nintendo and Nintendo is not the only company that has gone after emulators.

lcouturi
2 replies
16h18m

In that case, could you clarify what instances of this you're referring to?

The death of Citra wasn't really a deliberate action on the part of Nintendo, it was collateral damage. Citra was started by Yuzu developers and as part of the settlement they were not able to continue working on it. Citra's development had long been for the most part taken over by different developers, but the Yuzu people were still hosting the online infrastructure and had ownership of the GitHub repository, so they took all of it down. Some of the people who were maintaining Citra before the lawsuit opened up a new repository, but development has slowed down considerably because the taking down of the original repository has caused an unfortunate splintering of the community into many different forks.

There is some speculation Nintendo was involved with the death of the Nintendo 64 emulator UltraHLE a long time back, but this was never confirmed. If indeed they did go after UltraHLE, then this would just like Yuzu be a case of them taking down an emulator for a console they were still profiting from, as UltraHLE was released in 1999.

The most famous example of companies going after emulators is Sony, which went after Connectix Virtual Game Station and Bleem!. Both were PS1 emulators released in 1999, a period during which Sony was still very much profiting from PS1 sales. Sony lost both lawsuits and hasn't gone after emulators since.

In 2017, Atlus tried to take down the Patreon page for RPCS3, a PS3 emulator. However, Atlus only went after the Patreon page, not the emulator itself, which they did because of their use of Persona 5 screenshots on said page. The screenshots were simply taken down and the Patreon page was otherwise left alone. Of note is that Atlus is a game developer, so they were never profiting from PS3 sales. However, they were certainly still profiting from Persona 5 sales, which had only released in 2016.

These are the only examples I can remember. Did I miss anything?

omegacharlie
0 replies
12h57m

emulators for many nintendo consoles have been developed and released while the console was still sold and have been left alone as long as they had no direct links to piracy, recent events are a bit of a change.

There is some speculation Nintendo was involved with the death of the Nintendo 64 emulator UltraHLE a long time back, but this was never confirmed.

iirc it got c&d but a case was never filed in court, the source code turned up eventually anyways.

fragmede
0 replies
14h40m

the bnetd emulator, that let Diablo and StarCraft players not have to pay Blizzard for the privilege of buying the game, though that's a bit different.

fragmede
1 replies
14h37m

Yes there is. If I can emulate Super Mario Odyssey on my PC, I don't need to buy a Nintendo Switch. If it wasn't available there, I'd have to buy a Nintendo Switch to play it. That's a lost sale for Nintendo. You could argue that I wasn't going to buy a switch anyway, but then we're getting too into hypotheticals.

ExoticPearTree
0 replies
9h49m

This is the same reasoning the music and movie industries use when they go after people downloading music. And contrary to the popular opinion, I think it is wrong: if people want to pay, they will pay. Same for movies: if people would really want to pay for a movie, they would go to a cinema. Or stream it after a week or two. But there are also people who would jump through hoops than pay for music or movies. And that is not a lost sale because there was never an intention to buy something in the first place.

singleshot_
2 replies
19h30m

I enjoy how you removed the “I think” qualifier which suggested that it’s very possible that you’re right.

I’m quite well read on the DMCA but admit you probably know far more about how Nintendo wields it.

Still, I suggest that it’s a lot more likely that GitHub is going to get sued than you or GP.

Finally, I believe using the legal system to bully independent software developers is, in legal terms, super lame. We are probably in the same side here.

bawolff
1 replies
19h20m

DMCA (at least the take down requests part) is not really suing someone and not really about making money. Its about getting certain works off the internet.

You are probably more likely to be on the wrong end of a dmca take down request as a poor person since you dont have the resources to fight it, and its not about recovering damages just censorship.

singleshot_
0 replies
18h56m

We are really losing the plot of what this thread is about here, but: DMCA takedown requests that are ignored or wheee the site does not comply with the process are subject to private civil action. Obviously, a takedown request is distinct from suing someone. And the way that the rights holder forces the site to remove the content is under threat of monetary penalties.

skissane
3 replies
13h36m

I assume that I would get my ass kicked legally speaking. That reads to me exactly like deliberate copyright infringement with willful obfuscation of my infringement.

It looks like wilful obfuscation because the obfuscation is so simplistic. But as the obfuscation gets increasingly sophisticated, it becomes ever harder to distinguish wilful obfuscation from genuine originality.

chii
2 replies
13h10m

But sufficiently complex obfuscation of infringement is very hard to distinguish from genuine originality.

for the purposes of copyright, originality is not required, just different expressions. It's ideas (aka, patent) that require originality.

The 'sufficiently complex obfuscation' is exactly what people's brains go through when they learn, and re-produced what they learnt in a different context.

I argue that AI-training can be considered to be doing the same.

skissane
1 replies
12h37m

Some different scenarios:

(1) You leave your employer, don’t take any code with you, start your own company, reimplement your ex-employer’s product from scratch, but you do it in a very different way (different language, different design choices, different tech stack, different architecture)

(2) You leave your employer, take their code with you, start your own company, make some superficial changes to their code to obscure your theft but the copying is obvious to anyone who scratches the surface

(3) You leave your employer, take their code with you, start your own company, start very heavily manually refactoring their code, within a few months it looks completely different, very difficult to distinguish from (1) unless you have evidence of the process of its creation

(4) You leave your employer, take their code with you, start your own company, download some “infringement obfuscation AI agent” from the Internet and give it your employer’s codebase, within a few hours it has transformed it into something difficult to distinguish from (1) if you didn’t know the history

(1) is unlikely to be held to be infringing. (2) is rather obviously going to be held to be infringing. But what about (3)? IANAL, but I suspect if you admitted that is how you did it, a judge would be unlikely to be very sympathetic. Your best hope would be to insist you actually did (1) instead. And then the outcome of the case might come down to whether the judge/jury believes your claim you actually did (1), or the plaintiff/prosecution’s claim you did (3).

And (4) is basically just (3) with AI to make it a lot faster and quicker. Such an agent likely doesn’t exist yet, but it could happen.

Timing is obviously a factor. If you leave your employer and launch a clone of their app the next week, everyone is going to think either you stole their code, or you were moonlighting on writing it (in which case they may legally own it anyway). If it takes you 12 months, it becomes more believable you wrote it from scratch. But if someone uses AI to launder code theft, maybe they can build the “clone” in a few days or weeks, and then spend a few months relaxing and recharging before going public with it

megaman821
0 replies
4h13m

Numbers 2, 3, & 4 are all illegal because they start with an illegal action.

If I find a dollar on the sidewalk and put it in my wallets, is that stealing? If I punch a man getting change at a hotdog stand and a dollar falls on the sidewalk and then I put that in my wallet, is that stealing?

It doesn't matter what the scenario is after you stole code from your former employer, all actions are poisoned after.

hyperpape
3 replies
19h47m

From the article:

The most recently dismissed claims were fairly important, with one pertaining to infringement under the Digital Millennium Copyright Act (DMCA), section 1202(b), which basically says you shouldn't remove without permission crucial "copyright management" information, such as in this context who wrote the code and the terms of use, as licenses tend to dictate.

It was argued in the class-action suit that Copilot was stripping that info out when offering code snippets from people's projects, which in their view would break 1202(b).

The judge disagreed, however, on the grounds that the code suggested by Copilot was not identical enough to the developers' own copyright-protected work, and thus section 1202(b) did not apply. Indeed, last year GitHub was said to have tuned its programming assistant to generate slight variations of ingested training code to prevent its output from being accused of being an exact copy of licensed software.

So (not a lawyer!) this reads like the point about GitHub tuning their model is not a generic defense against any and all claims of copyright infringement, but a response to a specific claim that this violates a provision of the DMCA.

I don't know whether this is a reasonable defense or not, but your intuitions or mine about whether there is a general copyright violation or what's fair are not necessarily relevant to how the judge construes that very specific bit of legal code.

xinayder
2 replies
7h18m

What I got from this is, you can copy someone's copyrighted work provided you tweak a few things here and there. I wonder how this holds up in court if you don't have billions at your disposal.

sleepybrett
1 replies
1h12m

Weird Al should be in the clear then, he changes probably 85% of all the song lyrics in his covers.

sensanaty
0 replies
48m

Weird Al explicitly seeks out permission from copyright holders and won't do a cover if he doesn't get their go-ahead [1].

Pretty much the exact opposite of all these AI companies :p

https://www.weirdal.com/archives/faq/

dmix
3 replies
21h35m

That's a significant over simplification of how it works though to the point of almost not being a useful analogy.

If your analogy was you were a human who memorized every variation of a problem (and every other known problem) and there was a tiny perctange of a chance where you reproduced that exact varation of one you memorized, but then added an after the fact filter so you don't directly reproduce it...

It's more like musicians who basically copy a bunch of music patterns or chord progressions before then notice their final output sounds too similar to another song (which happens often IRL) then changes it to be more original before releasing it to the public.

ADeerAppeared
1 replies
21h19m

If you analogy was you were a human who memorized every variation of a problem (and every other known problem)

This is mere assumption. AI is supposed to work like that, but that's a goal, and not the result of current implementations. Research shows that they do memorize solutions as well, and quite regularly so. (This is an unavoidable flaw in current LLMs; They must be capable of memorizing input verbatim in order to learn specific facts.)

and there was a tiny perctange of a chance where you reproduced that exact varation of one you memorized

This is copyright infringement. Actionable copyright infringement. The big music publishers go after this kind of accidental partial reproduction.

but then added an after the fact filter so you don't directly reproduce it...

"Legally distinct" is a gimmick that only works where the copyright is on specific identifiable parts of a work.

Changing a variable name does not make a code snippet "legally distinct", it's still copyright infringement.

dmix
0 replies
20h52m

Meh I still see that as a big oversimplification. Context matters. Even if the copyright courts often ignore that for wealthy entities. Someone reproducing a song using AI and publishing it as their own copyright infringement, a person specifically querying an AI engine, that sucked up billions of lines of information and generates what you ask it do with a sma probability it will reproduce a small subset of a larger commercial project and sends it to someone in a chatbox is not exactly the same IMO.

This is Github Copilot after all. I use it daily and it autocompletes lines of code or generates functions you can find on stackoverflow. It's not letting giving you the source code to Twitter in full and letting you put it on the internet as a business under another name.

belorn
0 replies
20h30m

We are currently seeing the music industry reacting to AI learning a bunch of music patterns and chord progressions and outputting works that sounds very similar to existing music and artists. They are not liking it.

To just see how much they disliked it, youtube copyright strikes is basically a trained AI to detect music patterns to identify sound with slight variations or copyrighted songs and take videos down. Generating slight variations was one of the early method that videos used to bypass the take down system.

skybrian
2 replies
19h47m

The machine alone doesn't do anything. The user and machine together constitute a larger system, and with autocomplete, the user is charge. What's the user's intent?

I suspect that a lot of copyright violations are enabled by cut-and-paste and screenshot-taking functionality, and maybe we need to be careful with autocomplete, too? It's the user's responsibility to avoid this. We should be careful using our tools. Do users take enough care in this case? Is it possible to take enough care while still using CoPilot?

I've switched from CoPilot to Cody, but I use them the same way, to write my code. There's no particular reason to use CoPilot's output verbatim and lots of good reasons not to. By the time I've adapted it to my code base and code style and refactored it to hell and back, it's an expression of how I want to solve a problem, and I'm pretty confident claiming ownership.

Is that confidence misplaced? Are other people more careless?

BeefWellington
1 replies
14h51m

The machine alone doesn't do anything.

By the same token, the machine alone can't download pirated movies. Yet the sites hosting those movies are targeted as the infringers.

There's a point at which foisting this responsibility on the users is simply socializing losses. Ultimately Copilot is the one serving the code up - regardless of the user's request. If the user then goes on to republish that work as their own it becomes two mistakes. It'll be interesting to see if any lawyers are capable of articulating that well enough in any of these lawsuits.

Is that confidence misplaced? Are other people more careless?

I would say yes, for two reasons. One is that using code of unknown provenance means you're opening yourself to unknown legal risks. The second is if you're rewriting it fully (so as not to run afoul of easily spotted copyright) that's not actually "clean room" and you're still open to problems. I'd also wonder what the point of using a code writing LLM is anyways if you're doing all the authorship yourself. It seems like doing double the work.

skybrian
0 replies
11h47m

It is a lot of work to do a lot of rewrites, but it’s noncommercial and I’m not in a hurry. And autocomplete is still pretty useful.

xorcist
1 replies
18h50m

Why stop there? Extrapolate that thought, keep generating more variants of the code, claim copyright, and seek rent from other people doing the same thing. To extrapolate full circle, there would be a business opportunity to generate as many variants as possible for the original author, to prevent all this from happening.

As long as we're not required to register copyright there's no reason to think the above will play out. International copyright agreements are not limited to verbatim copies only.

BeefWellington
0 replies
14h48m

Why stop there? Extrapolate that thought, keep generating more variants of the code, claim copyright, and seek rent from other people doing the same thing. To extrapolate full circle, there would be a business opportunity to generate as many variants as possible for the original author, to prevent all this from happening.

This has already been done[1] in music, though in their case they released them to the public domain. Admittedly I think that was more of a protest than anything.

[1]: https://www.vice.com/en/article/wxepzw/musicians-algorithmic...

wvenable
1 replies
21h8m

You probably do this all the time. Forget memorizing but undoubtedly you've read code, learned from it, and then likely reproduced similar code. Probably nothing terribly important, just a function here or there. Maybe even reproduced something you did for a previous employer.

Aerroon
0 replies
16h38m

arr.sort((a, b) => a - b);

comes to mind. I bet most js devs have written this verbatim.

roenxi
1 replies
16h31m

If you tell a programmer to implement a function foo(a, b) then there are actually only a tiny number of ways to do that, semantically speaking, for any given foo. The number of options narrows quickly as the programmer implementing it gets more competent.

Choosing function signatures is an art form but after that "copying" is hard to judge.

sva_
0 replies
6h41m

a function foo(a, b) then there are actually only a tiny number of ways to do that

I'd argue there are infinite ways to implement any function, just almost all of them are extremely bad.

woah
0 replies
15h36m

You would not get your ass kicked legally speaking. Copyright is not that broad. It's not a patent.

williamcotton
0 replies
21h9m

Just to set the stage and not entirely specific to this complaint... It really depends on what is and isn't subject to copyright for software.

Broadly, there is the distinction between expressive and functional code. [1]

And then there are the specific tests that have been developed by the courts to separate the expressive and functional aspects of software. [2] [3]

In practice it is very expensive for a plaintiff to do such analysis. For the most part the damages related to copyright are not worth the time and money. Plaintiffs tend to go for trade secret related damages as they are not restricted by the above tests.

There are also arguments to be made of de minimis infringements that are not worth the time of the court.

Most importantly the plaintiff fundamentally has the burden of proof and cannot just say that copying must have taken place. They need concrete evidence.

[1] https://en.wikipedia.org/wiki/Idea–expression_distinction

[2] https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

[3] https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

sim7c00
0 replies
9h59m

it depends on how much tax you are paying really. if you pay billions in taxes annually, they might see past it. if the company you copied from pays billions in taxes anually. you will go to jail. if this isn't painfuly obvious by now...

rnkn
0 replies
6h2m

It seems the total disregard that the tech community showed toward copyright when it was artists losing out has come back to bite. Face-eating leopards, etc.

kjellsbells
0 replies
3h55m

Days like this, I wonder what Borges would have made of such questions.

"Pierre Menard, author of redis"

I know from experience that parents are aggressively pushing their children into STEM to maximize their chances of being economically secure, but, I really feel that we need a generation of philosophers and humanists to sift through the issues that our technology is raising. What does it mean to know something? What does authorship mean? Is a translated work the same as the original? Borges, Steiner, and the rest have as much to contribute as Ellison, Zuckerberg, and Altman.

jollofricepeas
0 replies
16h36m

No clue.

But what if the generative AI were used to create music instead of code would the court have ruled differently?

CONSIDER:

In 2015, a federal judge order Thicke & Pharrell to pay 50% of proceeds to the Marvin Gaye estate for being “too similar” to the song, “Gots to Give It Up”.

Comparison and commentary: https://youtu.be/7_UiQueteN4?si=SkClbyBMOcucigRm

Comparison of both songs: https://youtu.be/ziz9HW2ZmmY?si=3_VZzfoLT-NrozoK

isodev
0 replies
8h42m

It would. And this is where some legislation "in the spirit of" would have helped. So Microsoft's huge legal arm can't just wiggle their way out on technicalities. Clearly, the law is not prepared to face the challenge of copyright violations on the scale created by the LLMs.

I also think it's not just copyright. It's simply not right to create a product on top of the collective work of all open source developers monetize them on the absurt scale Microsoft operates and never ever credit the original creators.

eftychis
0 replies
18h45m

Adding to the sibling comments:

First: every human is per se doing that already. We have – to handwave – a "reasonable person" bar to separate violations versus results of learning and new innovation.

Second: You can be a holder of copyright and your creations result in copyrightable artifacts. Anything generated by the program has been held as uncopyrightable.

constantcrying
0 replies
10h16m

I assume that I would get my ass kicked legally speaking.

Why? This is no different than copy pasting and modifying a bit of code from some documentation/other project/tutorial/SO. Surely if that were a basis for copyright infringement most semi-large software projects would be infringing on copyright.

I don't think anyone here should be willing to open the can if worms that is copy pasting small snippets of code and modifying them.

The judge seems to argue that the non-identical copies are at issue here and that they only happen under contrived circumstances. My moral opinion is that this is irrelevant and that even the defendant is the wrong person. Even verbatim copies of code snippets shouldn't be copyright infringement and suing the company providing the AI is wrong to begin with, as the AI or its providercan not possibly be the one to infringe.

bmitc
0 replies
6h40m

Regardless of the details here, it's become quite clear that the judicial system is for corporations. It doesn't matter whether they win, lose, or settle, as they win regardless, since the monetary benefits of what got them in court in the first place far outweigh any punishment or settlement cost.

bena
0 replies
5h4m

I agree. I don’t see the difference.

That’s the entire reason “clean room reverse engineering” is done.

Using nothing but the binary itself, work out how things are done. Making sure that the reverse engineers don’t even have access to any material that could look like it came from the other organization in question. And that it is provable.

beeboobaa3
0 replies
21h44m

Rules for thee but not for me (rich companies). Think of the shareholders!

bawolff
0 replies
13h18m

How is it any different when a machine does the same thing?

I think the argument is that the machine is not doing that, or at least there isn't evidence that it is doing that.

Specificly no evidence that github is doing both 1 and 2 at the same time. There might be cases where it makes trivial changes to code (point 2) but for code that does not meet the threshold of originality. Similarly there might be cases with copyrighted code where the idea of it is taken, but it is expressed in such a different way that it is not a straightforward derrivitave of the expression (keeping in mind you cannot copyright an idea, only its expression. Using a similar approach or algorithm is not copyright infringement)

And finally, someone has to demonstrate it is actually happening and not just in theory could happen. Generally courts dont punish people for future crimes they haven't comitted yet (sometimes you can get in trouble for being reckless even if nothing bad happens, but i dont think that applies to copyrighg infringement)

ars
0 replies
21h25m

I assume that I would get my ass kicked legally speaking.

Maybe, maybe not. It's not as simple as you made it out to be. If you write a book with lots of stuff and you got inspiration from other books, and even put in phrases wholesale, but modified to use your own character names instead, I'm not convinced you would lose.

The court would look at the work as a whole, not single pieces of it.

They would also check if you are just copying things verbatim, or if you memorize a pattern and emit the same pattern - for example look at lawsuits about copying music, where they'll claim this part of the music is the same as that part.

It's really not as cut and dry as you make it out to be.

alickz
0 replies
4h55m

who gets to copyright claim the various array sorting algorithms then?

_heimdall
0 replies
18h23m

The actual answer here, regardless of a court ruling, is that you'd go broke if anyone big enough tried to go after you for it.

Legal protections for source code are still pretty fuzzy, understandably so given how comparatively new the industry is. That doesn't stop lawyers from racking up huge fees though, it actually helps because they need so much more prep time to debate a case that is so unclear and/or lacking precedent.

ProAm
0 replies
16h58m

How is it any different when a machine does the same thing?

Literally the bank account behind the action...

Kuinox
0 replies
8h54m

You are taking the plaintiff statement as is, which is wrong. You can blame the media that didn't made it clear that it was a statement from the plaintiff.

ExoticPearTree
0 replies
10h14m

I don't think it works that way. During the course of your professional career as a developer you change jobs. And let's say that at every job you create APIs. Besides the particular functions those API provide, the API code itself (how you interact with clients, databases etc.) will be pretty much the same as whatever you did at previous jobs. Does this constitute copyright experience or is just experience?

My analogy is that if Copilot doesn't provide 100% code from another repository it is OK to be used by other people trained with code available on GitHub.

bityard
64 replies
23h35m

This is pretty interesting, and I have conflicted feelings about the (seemingly obvious) outcome of this trial.

I wonder, if MS and OpenAI win, does that mean it will be legal for anyone to take the leaked source code for a proprietary product, train an LLM on it, and then ask the LLM to emit a version of it that is different enough to avoid copyright infringement?

That would be quite the double-edged sword for proprietary software companies.

ChrisMarshallNY
41 replies
23h14m

I suspect that this is exactly what will happen; not just with code, but also prose and artwork.

Someone is likely to design an LLM that is specifically trained to do exactly that.

Lots of money to be made...

devmor
33 replies
23h6m

On the matter of artwork there's no need for suspicion - it is and has been happening for a while now. There are entire online databases dedicated to providing non-consenting artist's "styles" as downloadable model parameters by name.

satvikpendem
22 replies
22h41m

Style is not copyrightable so I see nothing wrong with making essentially a robot that can paint in the style of someone else.

devmor
16 replies
20h11m

The legality of using someone’s copyrighted work to train a model to reproduce it without their consent is still under debate - but the morality of the act at least, is not related to its legality - be it positively or negatively; and I personally consider it abhorrent.

satvikpendem
15 replies
20h8m

Under what morals do you consider it "abhorrent?" I bet got a straight answer from those I've asked about this as the counter arguments seem too easy to make.

devmor
14 replies
19h17m

It's just pure exploitation. You're using the product of someone's work to create a machine that takes away their work.

struant
6 replies
18h1m

Why is doing a task with a machine suddenly objectionable when the same task performed by humans is perfectly fine?

devmor
3 replies
5h53m

Chiefly, scale and accountability.

The work of a person can be mitigated and a person can be held accountable for their actions.

Much of our society operates on the idea that we don’t need to codify and enforce every single good or bad thing due to these reasons; and having such an underpinning affords us greater personal freedom.

satvikpendem
2 replies
4h57m

This does not actually answer the question of why it is bad (in your opinion) in the first place, it just states that bad things are mitigated. I am looking for a concrete answer to the former, not a justification of the latter. The former is what usually AI opponents can never answer, they assume prima facie that AI is bad, for whatever reason.

devmor
1 replies
2h58m

I answered your question plainly, but I'll try to go into detail. I have a suspicion that you don't see this as the philosophical issue that AI detractors do, and perhaps that hasn't been clearly communicated to you in the answers you've received, leading to your distaste for them or confusion at why they don't meet your criteria.

I believe that this kind of generative AI is bad because it approximates human behavior at an inhuman scale and cannot be held accountable in any way. This upends the entire social structure upon which humans have relied to keep each other in-check since the advent of the modern concept of "justice" beginning with the Code of Hammurabi.

In essence: Because you cannot punish, rehabilitate or extract recompense from a machine, it should not be allowed in any way to approximate a member of society.

This logic does not apply to machines that "automate" labor, because those machines do not approximate human communication - they do not pretend to be us.

satvikpendem
0 replies
2h50m

Your argument can be applied to the printing press or the automatic loom, and before you say that AI is much more at scale, I do not think that it is any more at scale than producing billions of books and garments cheaply. If you instead say that AI is more autonomous than the prior which require human functionality, I will remind you that no AI today (and likely into the future) produces outputs autonomously with no human input (and indeed, many humans tweak those outputs further, making it more like photo editing than end-to-end solutions). Even if they could perfectly read your mind and output end-to-end, you must first think for them to do what you desire.

Should those machines then be subject to your same philosophies? I'd suspect you'd say "that's different" somehow but it is only because you are alive at this moment and these machines have been normalized to you that you do not care about them. Were you to be born in a few centuries, you would likely feel the same way most do about the prior machines, and indeed, you'd be hard pressed to find anyone who think that future generation's AI (probably simply called technology then) is problematic as you do today. Recency bias is one hell of a drug.

sensanaty
1 replies
6h26m

A man with a small canoe catching a few fish with a fishing rod for his dinner is very different to a commercial fishing vessel trawling through the ocean with a massive net to catch thousands of fish at once. The two are treated differently under the law, and have different rules that apply to them due to the difference in scale.

Scale matters, and the scale that computers/these AIs operate under are absurd compared to a person doing it manually.

satvikpendem
0 replies
5h4m

Why does scale matter in terms of AI? Just because a computer can do it at scale doesn't mean it should be treated similarly to your analogy. Rather than using an analogy, please tell me why it matters that computers can do something like AI at scale rather than individuals doing it.

satvikpendem
6 replies
11h23m

Why does someone's work matter?

devmor
3 replies
5h59m

If it didn’t matter, you wouldn’t want to take it.

satvikpendem
2 replies
5h0m

The word "work" is being overloaded here, their work as in output might matter but I am asking why they must work at all in the first place. If your answer is because they must procure money to survive, that is an economic failure, not one of AI. Jobs are simply a roundabout way of distributing money for output to be produced, if an AI can produce the output, the job need not exist. This is the same argument that has been used for centuries as automation advances in every field, but suddenly, when it comes for my white collar high tech industry? It's an outrage.

Even then, their work as output can matter but that doesn't necessarily mean they (should) have a per se right to their work without other people also using it, especially in cases where their work is not used as outputs directly, which is what plagiarism is. If that were the case, no one could learn from a other's work, regardless of whether that one is a person or a computer.

devmor
1 replies
2h48m

Remember, we are discussing art here, not white collar tech jobs. AI coming for my job would be unpleasant and devastating, but that, like you said, is an economic problem. That I agree on.

I don't think there is a way to continue this particular branch of this argument without devolving into a debate on the value of human life like a couple of Macedonian philosophers - suffice to say, my point of view is that the work of others has intrinsic value tied to intent, and machines do not have intent.

If no output of humans has intrinsic value, then once machines can approximate humans sufficiently there is no reason for humans to exist - and that is an outcome that I, as a human, reject with all of my being.

satvikpendem
0 replies
2h30m

Output of humans has value to humans; art does not have value to beings outside of humans, of course. That does not mean that one cannot use a machine to create new outputs, and it doesn't mean that those will or will not have value, as again, value is subjective to the (human) beholder. We see this already with people praising AI art. Therefore, I do not believe that intent matters in the slightest as long as people deem something valuable.

The reason for humans existing is not because of the output they produce (indeed, that is dystopic), humans have worth inherently, regardless of what they output. This is also what nihilists have figured out, so maybe that is something you should look into if you seriously have such an opinion as expressed in your last paragraph.

sensanaty
1 replies
6h25m

Why do you want the end result of the work if the work itself doesn't matter?

satvikpendem
0 replies
4h59m

I replied to the other comment.

falcolas
4 replies
22h31m

In isolation, no. But the produced works can be too close for fair use (as demonstrated with the Prince pieces by Andy Warhol), and passing it off as a piece from the original artist can open you up to forgery/fraud charges.

To put another way, the motivations to produce art in another artist's style can still land the artist/buyer in legal trouble regardless of fair use.

satvikpendem
3 replies
22h28m

Yes that is true, but I don't think the people who use style transfer are actually passing it off as the original, they just like it for the aesthetic value of their own images. In other words, no one using the Van Gogh LoRA is actually trying to forge the Starry Night.

falcolas
2 replies
22h20m

Given the value of an "authentic" painting of the Starry Night (or more realistically the value of something forged in, say, Samwise Didier's style) I can't agree with "no one".

I have to imagine that it's likely quite popular to sell AI generated art that mimics or copies existing works.

satvikpendem
0 replies
22h16m

Do you use AI art generators? Flaws are extremely easily found out, it is only good for a rough snapshot (without much fiddling and even then, artifacts remain). I can guarantee you it is definitely not popular to sell existing works made with AI, you are better off hiring an actual forger. In fact, your suggestion is even the first I've even heard of such an idea.

8organicbits
0 replies
21h20m

I guess there's always a greater fool, but forging an oil painting using AI digital images seems pretty far fetched.

CuriouslyC
8 replies
22h16m

I sure wish I could non-consent to people observing me in the world, I'd like to move through society invisibly and only show myself when it benefitted me. Unfortunately, the only answer is to stay inside if I don't want people to see me.

vkou
5 replies
22h9m

I sure wish I could non-consent to people observing me in the world,

You aren't allowed to use photos featuring a non-consenting person to, for instance promote a product.

You are allowed to use photos including a non-consenting person.

There's a lot of complicated law, differing between different jurisdictions to cover this question, and to balance the needs of the public with commercial desires. It's not as simple as you make it sound, and there's no reason we should just default to bending over backwards for commercial interests.

Laws exist to serve society, not the other way around.

CuriouslyC
3 replies
21h56m

I'm sure that the people who are being constantly victimized by paparazzi would like to know those rules that you just quoted, and have them be enforced.

vkou
2 replies
21h52m

If you had done a little research into this question, you'd realize that 1A use cases ('journalism') are treated by law quite differently than use of likeness for commercial intent.

This is my whole point. There isn't a single, one-size-fits-all rule that a five year old can comprehend that describes any particular country's legal framework around the many, many different dimensions of tension between public and private interests on this incredibly broad question.

And none of the existing frameworks fit the new use cases well, and we should probably have an open political debate about what we want to do going forward.

CuriouslyC
1 replies
21h31m

I'll happily take your picture against your will and put it on the internet with the tag "vkou mad at photographer, news at 11"

vkou
0 replies
20h57m

Okay? What will that prove? That you can be an ass?

Being an ass is generally not illegal. Particular behaviours might be, but no legal or social system intends to censure you for every possible one, and most people who are experts in law or ethics don't believe that they should.

If you identify particular problems with the particular paparazzi laws in your country, that's an interesting conversation, and maybe, if framed well, an interesting data point for this discussion, but is not in itself the 'last word' on it. Just because you can torture an analogy, doesn't mean the analogy has a lot of power.

sweeter
0 replies
20h48m

consent Careful... A lot of people online have selective understanding when it comes to this concept. It's selfishness and self-centredness taken to it's extreme, and not seeing other people as humans, but as tools for their consumption to be used and tossed aside for pleasure or for profit. It's one of the most disgusting things I've layed eyes on.
devmor
1 replies
20h10m

We are not discussing people observing people. We are discussing programs observing people.

CuriouslyC
0 replies
5h57m

Seems like a meaningless distinction in the face of a government that defines giving money as speech.

ChrisMarshallNY
0 replies
22h56m

Try getting Mickey Mouse comics.

That should be fun...

ADeerAppeared
3 replies
20h45m

Someone is likely to design an LLM that is specifically trained to do exactly that.

Perplexity AI.

chimeracoder
2 replies
20h8m

Perplexity AI.

How does this describe Perplexity AI more than any other LLM?

ADeerAppeared
0 replies
18h52m

I am referring to their service rather than their LLM in specific.

Perplexity is in the business of using an LLM to paraphrase existing content, then serving that up as their own "work" in a way that directly harms the original content they took.

It's not even a question of "Is AI training copyright infringement", they're just doing copyright infringement with AI. And it's horribly common already.

crote
1 replies
18h32m

I was mainly inspired by this section:

Specifically, the judge cited the study's observation that Copilot reportedly "rarely emits memorized code in benign situations, and most memorization occurs only when the model has been prompted with long code excerpts that are very similar to the training data."

That almost sounds like it'd be fine to train an "art transformation model" which takes an image and transforms it, which for all the frames of a specific Disney movie just so happen to output the very next frame...

saint_fiasco
0 replies
16m

That sounds like the opposite from the quote. The art transformation model you propose WOULD emit memorized art in benign situations, so in that judge's opinion it WOULD count as plagiarism.

epolanski
0 replies
20h6m

I feel like what really matters is who has more money to throw in tribunals.

Somehow I feel if it was "Adobe vs dev that claims his code was spit by copilot" it would not end the same.

pennomi
4 replies
22h11m

Or even those AI-powered decompilers people are working on… you could clone virtually any software with that. Surely there will be limitations.

wongarsu
1 replies
21h23m

The source code of Windows XP is widely available. Same with a ~2 year old version of Bing, Bing Maps, Cortana etc. Yet that doesn't seem to have had major negative effects on those products. If anything having the Windows source code available seems to be a net boon for Windows development. Sometimes looking at the source is just better if the documentation is unclear.

userbinator
0 replies
17h21m

MS probably hates that the source for XP/2K3 leaked because it means more people will put in effort to fix and extend/backport, even if it's not truly legal, when MS would rather coerce them into the latest most invasive and user-hostile version. Also because projects like NTVDMx64 show how some of their decisions have been political instead of technical as they like to claim.

Far less people care about Bing or Cortana.

mr_toad
0 replies
14h24m

If you compiled it and the resulting binary was substantially similar to the original you’d likely get sued.

beeboobaa3
0 replies
21h36m

The limitation is the amount of money & political power the owner of the software you're cloning has.

Spivak
3 replies
21h19m

No, because judges aren't robots applying the law like code. Intent matters. If you do this it will be painfully obvious that your intent is to duplicate a large body copywritten code.

yellowapple
2 replies
19h56m

It's painfully obvious that the intent of GitHub Copilot is to duplicate a large body of copyrighted code.

tpmoney
0 replies
19h22m

It doesn't appear to be painfully obvious. Both because they're not losing court cases yet, and there's a huge swath of non copyrighted code being produced by co-pilot every day. By contrast the plaintiffs apparently were unable to induce Copilot to duplicate any parts of their code.

Bognar
0 replies
15h26m

Oh so that's why Copilot has a filter to prevent suggesting copyrighted code, because the intent is to duplicate copyrighted code. It all makes sense now.

Legend2440
3 replies
21h40m

I mean you can legally do this by hand right now. That's how they cloned the IBM bios back in the day. IBM sued and lost.

marcosdumay
1 replies
21h27m

No, that's not.

They cloned the bios by observing how it behaved and writing code that behaved the same way. Nobody even looked at the bios code.

wvenable
0 replies
21h0m

That's not how they did it. They had one team read the BIOS source listings in the IBM PC Technical Reference Manual and create a technical specification and a second team take that specification and write a new BIOS [1]. The second team never saw the original code so therefore they could not have copied it.

To do something similar with AI, you really need to train one AI on the source code and then have it explain that code to a second AI that never saw the original code.

[1] https://en.wikipedia.org/wiki/Phoenix_Technologies

axus
0 replies
21h26m

I thought there was a "clean room", where the people reading it and the people writing it were different; and they made a written specification instead of a Vulcan mind meld.

jeroenhd
1 replies
21h27m

A Wine fork built using an LLM trained on leaked Windows code might be pretty useful.

witx
0 replies
6h31m

You'd get a Wine full of ads, the need for an account to use and the not so occasional BSoD /s

yellowapple
0 replies
19h57m

That's exactly what I expect to happen with the source code to Microsoft's own software products, namely Windows.

Hilarity will ensue :)

throwaway562if1
0 replies
12h53m

Let's be honest: It will be legal if you're a $3 trillion company, and not if you're not.

stale2002
0 replies
7h28m

By definition you are allowed to take leaked source coded and change it enough such that it avoids infringement, and this will avoid infringement.

The LLM has nothing to do with it, and isn't required here.

spencerflem
0 replies
22h55m

Not unless a big company is the one doing it lol

mr_toad
0 replies
14h36m

The misappropriation of the code (a trade secret) would likely be grounds for legal action against the people who stole it and the people who received it. A lot depends on jurisdiction.

But if it was made public and then if an unrelated third party were to re-write the code in such a way that it was non-infringing, then it would be non-infringing. That’s just a tautology.

elzbardico
0 replies
1h56m

Yeah. In a ideal world where an open source developer gets equal treatment from the law facing a giant corporation with hordes of very expensive lawyers and "technical experts".

devmor
0 replies
23h7m

Following existing law and applying reasonable expectations, I would point to the old adage "intent is 9/10ths of the law".

It would probably be legal to do this, as long as no one could reasonably show that you intentionally trained the LLM on said leaked source code with the intent to reproduce the product.

Of course, civil suits could be another matter entirely. If you pick a product to rip off that's owned by a multi-billion dollar company, all that can save you is the ethical limits of their legal team's consciences.

daedrdev
52 replies
23h23m

The anonymous programmers have repeatedly insisted Copilot could, and would, generate code identical to what they had written themselves, which is a key pillar of their lawsuit since there is an identicality requirement for their DMCA claim. However, Judge Tigar earlier ruled the plaintiffs hadn't actually demonstrated instances of this happening, which prompted a dismissal of the claim with a chance to amend it.

It sounds fair from how the article describes it

whimsicalism
28 replies
22h27m

Huh. There have definitely been well publicized examples of this happening, like the quake inverse square root

voxic11
11 replies
22h0m

You can't copyright a mathematical operation. Only a particular implementation of it, and even then it may not be copyrightable if its a straightforward and obvious implementation.

That said the implementation doesn't appear to be totally trivial and copilot apparently even copies the comments which are almost certainly copyrightable in themselves.

https://x.com/StefanKarpinski/status/1410971061181681674 https://github.com/id-Software/Quake-III-Arena/blob/dbe4ddb1...

However a twitter post on its own isn't evidence a court will accept. You would need the original poster to testify that what is seen in the post is actually what he got from copilot and not just a meme or joke that he made.

Also the plaintiffs in this case don't include id-Software and there is some evidence that id-Software actually stole the fast inverse sqrt code from 3dfx so they might not want to bring a claim here anyways.

whimsicalism
3 replies
21h37m

Not sure where you thought I said you could copyright a mathematical operation, I was clearly referring to the implementation due to the mention of “quake”.

When it was reported, I was able to reproduce it myself.

TechDebtDevin
2 replies
10h54m

Weren't people getting it to spit out valid windows keys also?

pas
1 replies
8h57m

GPT4 regurgitated almost full NYT articles verbatim. It's strange that this lawsuit seems to be so amateurish that they failed to properly demonstrate the reproduction. Though of course it might require a lot of legal technicalities that we naively think are trivial but they might be not.

Kim_Bruning
0 replies
8h6m

I read that case.

Absolutely there were a few outliers where a judge might want to look more closely. I'd be surprised if -under scrutiny- there wouldn't be any issues whatsoever that OpenAI overlooked.

However, it seemed to me that over half of the NYT complaints were examples of using the -then rather new- ChatGPT web browsing feature to browse their own website. In the case, they then claimed surprise when it did just what you'd expect a web browsing feature to do.

voidfunc
2 replies
17h43m

Its even simpler, iD is owned by ZeniMax. ZeniMax is owned by Microsoft.. who would they even sue?

nvr219
0 replies
16h22m

"Trust no one... even yourself"

naikrovek
0 replies
6h8m

That's not how that works.

All the plaintiffs would need to do is provide evidence that copywritten code was produced verbatim. This includes showing the copyrighted code on GitHub, showing copilot reproducing the code (including how you manipulated copilot to do it), showing that they match, and showing that the setting to turn off reproduction of public code is set.

It makes no difference who owns the copyrighted code, it need only be shown that copilot is violating copyright. Microsoft can't say "uhh that doesn't count" or whatever simply because they own a company that owns a company that owns copyright on the code.

williamcotton
0 replies
6h36m

The second step is to remove from consideration aspects of the program which are not legally protectable by copyright. The analysis is done at each level of abstraction identified in the previous step. The court identifies three factors to consider during this step: elements dictated by efficiency, elements dictated by external factors, and elements taken from the public domain.

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

banish-m4
0 replies
9h33m

Algorithms can and are definitely patented in utility patents in the US.

wongarsu
6 replies
21h29m

It reads like the judge required them to show it happened to their code, not to any code in general. That's a much higher bar. There are thousands of instances of fast inverse square root in the training data but only one copy of your random github repositories. Getting to model to reproduce your code verbatim might be possible for all we know, but it isn't trivial.

whimsicalism
3 replies
21h29m

of course for standing. but it seems like with the right plaintiffs this could have gone forward

brookst
1 replies
13h28m

But that’s like saying my lawsuit alleging Taylor Swift copied my song could have gone forward with a plaintiff who had, years ago, written a song similar to what Ms. Swift recorded recently. That”s true, but perhaps the lesson here is that damages that hinge on statistically rare victims should not extrapolated out to provide windfalls for people who have not been harmed.

whimsicalism
0 replies
10h6m

i think that is a weak analogy and also unnecessary bc it is already clear what i am saying

Dylan16807
0 replies
16h16m

If it only copies code that has been widely stolen already then that's a lot weaker of a case and is something they can do a lot to prevent on a technical level.

sleepybrett
0 replies
1h4m

It could be forced, of course. I can republish my copyrighted code millions of times all over the internet. Next time they retrain there is a good chance my code will end up in their corpus, maybe many many times, reinforcing it statistically.

Suppafly
0 replies
14h37m

It reads like the judge required them to show it happened to their code, not to any code in general.

Rightly so, you have to show some sort of damage to sue someone, not just theoretical damages.

polishTar
6 replies
22h16m

Fast inverse square root is now part of the public domain.

Also, even if this weren’t the case you can’t sue for damages to other people (they’d need to bring their own suit)

immibis
4 replies
22h10m

Has it really already been 70 years since John Carmack died?

polishTar
3 replies
21h32m

Ah, you're right. I was wrong to say "public domain".

It would be more correct to say Quake III Arena was released to the public as free software under the GPLv2 license.

KnightHawk3
2 replies
20h29m

There is a large gap between public domain and GPL. For starters if Copilot is emitting GPL code for closed source projects... that's copyright infringement.

FireBeyond
1 replies
18h58m

That would be license infringement, not copyright infringement.

immibis
0 replies
4h1m

Copyright infringement is emitting the code. The license gives you permission to emit the code, under certain conditions. If you don't meet the conditions, it's still copyright infringement like before.

anonymoushn
0 replies
22h15m

Is the particular implementation that the model spits out 70+ years old?

dathinab
0 replies
11h2m

yes, but you need to show that it happened _in your case_, not that it can happen in general.

daedrdev
0 replies
20h52m

The article mentions that GitHub copilot has been trained to avoid directly copying specific cases it knows, and that although you can get it to spit out copyright code by prefixing the copyrighted code as a starting point, in normal us cases its quite rare.

ADeerAppeared
21 replies
21h27m

Where it gets ethnically dubious is that:

1. The copilot team rushed to slap a copyright filter on top to keep these verbatim examples from showing up, and now claims they never happen.

2. LLMs are prone to paraphrasing. Just because you filter out verbatim copies doesn't mean there isn't still copyright infringement/plagiarism/whatever you want to call it. The copyright filter is only a legal protection, not a practical protection against the issue of copyright infringement.

Everyone who knows how these systems work understand this. The copilot FAQ to this day claims that you should run copyright scanning tools on your codebase because your developers might "copy code from an online source or library".

Github has it's own research from 2021 showing that these tools do indeed copy their training data occasionally: https://github.blog/2021-06-30-github-copilot-research-recit...

They clearly know the problem is real. Their own research agreed, their FAQs and legal documents are carefully phrased to avoid admitting it. But rather than owning up to the problem, it's "Ner ner ner ner ner, you can't prove it to a boomer judge".

squarefoot
8 replies
21h12m

1.

Isn't that akin to destruction of evidence?

ADeerAppeared
3 replies
21h0m

Legally? No.

In spirit? ... Probably?

Unlike most LLMs, Github copilot can trivially solve their copyright problem by just using only code they have the right to reproduce.

They have a giant corpus of code tagged with license, SELECT BY license MIT/Equivalent and you're done, problem solved because those licenses explicitly grant permission for this kind of reuse.

(It's still not very cash money to take open source work for commercial gain without paying the original authors, and there's a humorous question if MIT-copilot would need to come with a multi-gigabyte attribution file, but everyone widely agrees it's legal and permitted.)

The only reason you'd hack a filter on top rather than doing the above is if you'd want to hide the copyright problem. It's an objectively worse solution.

sleepybrett
0 replies
1h1m

Have the copyleft people, or anyone else, produced some boilerplate licenses that explicitly deny use in training models?

gkbrk
0 replies
10h14m

There is no difference when it comes to MIT and GPL here. If your model outputs my MIT licensed code, you still need to provide attribution in the form of a copyright notice as required by the MIT license.

Spivak
0 replies
17h52m

Unlike most LLMs, Github copilot can trivially solve their copyright problem by just using only code they have the right to reproduce.

Absolutely not trivial, in fact completely impossible by computer alone. You can't determine if you have the right to reproduce a piece of code just by looking at the code and tags themselves. *Taps the color-of-your-bits sign.*

* I can fork a GPL project on Github and replace the license file with MIT. Okay to reproduce?

* If I license my project as MIT but it includes code I copied inappropriately and don't have the right to reproduce myself, can Github? (No) This one is why indemnity clauses exist on contracted works.

* I create a git repo for work and select the MIT license but I don't actually own the copyright on that code and so that license is worthless.

bawolff
1 replies
19h24m

I would think it is pretty obviously not.

Is taking away a drunk driver's keys (before they get in the car) destruction of the evidence of their drunk driving?

squarefoot
0 replies
12h49m

This is not what I meant. By placing a copyright filter and claiming it never happened (please read the line I was replying to) before the system can be audited, they're indeed taking away the drunk driver's keys, which is a good thing, but also removing the offending car before Police arrives.

tpmoney
0 replies
19h27m

No more so than scanner/printer manufacturers adding tech to prevent you from scanning and printing currency is destruction of evidence that they are in fact producing illegal machines for counterfeiting.

abigail95
0 replies
20h44m

Not in any way I'm aware of - and would be required if they were served a DMCA notification/Cease and Desist against a specific prompt.

The people that think Copilot is infringng their copyright would be happy with that I would think? Unless they take a much stricter definition of fair use than current courts do.

nl
6 replies
18h54m

Just because you filter out verbatim copies doesn't mean there isn't still copyright infringement/plagiarism/whatever you want to call it.

Actually, it does. The production of the output is what matters here.

kelnos
5 replies
18h45m

If you copy someone else's copyrighted work and then rearrange a few lines and rename a few things, you're probably still infringing.

Spivak
4 replies
18h6m

For a book or a song, for sure, although that isn't really punished. Search the drama surrounding a popular YA author in the 10's, Cassandra Claire. For code since you can only copy the form and not the function that might actually be enough.

People do clean room implementations because of paranoia, not because it's actually a necessary requirement.

Retric
3 replies
16h14m

Moving a few things around means your internal process already had copywrite infringement.

Spivak
2 replies
14h57m

Probably not. Copyright infringement in the manner we're talking about presumes you already have license to access the code (like how Github does). What you don't have license to do is distribute the code -- entirely or not without meeting certain conditions. You're perfectly free to do whatever naughty things you want with the code, sans run it, in private.

The literal act of making modifications isn't infringement until you distribute those modifications -- and we're talking about a situation where you've changed the code enough that it isn't considered a derivative work anymore (apparently) so that's kosher.

Retric
1 replies
13h44m

First the case would be dismissed if Copilot had permission to make copies. Clearly they didn’t. Copyright cares about copies, for profit distribution just makes this worse.

you already have license to access the code

This isn’t access, that occurs before the AI is trained. It’s access > make copy for training > AI does lossy compression > request unzips that compression making a new copy > process fuzzes the copy so it’s not so obvious > derivative work sent to users.

warkdarrior
0 replies
2h43m

Clearly Copilot had permission to make (unmodified) copies, the same way Github's webserver had permission to make (unmodified) copies. The lawsuit is about making partial copies without attribution.

bawolff
3 replies
19h27m

The copilot team rushed to slap a copyright filter on top to keep these verbatim examples from showing up, and now claims they never happen.

Well if the copyright filter is working they indeed aren't happening. Putting in safe gaurds to prevent something from happening doesn't mean you're guilty of it. Putting a railing on a balcony doesn't imply the balcony with railing is unsafe.

LLMs are prone to paraphrasing. Just because you filter out verbatim copies doesn't mean there isn't still copyright infringement/plagiarism/whatever you want to call it

Copyright infringement and plagerism are different things. Stuff can be copyright infringement without being plagerized, and can be plagerized without being copyright infringement. The two concepts are similar but should not be conflated, especially in a legal context.

Courts decide based on laws, not on gut feeling about what is "fair".

They clearly know the problem is real

They know the risk is real. That is not the same thing as saying that they actually comitted copyright infringement.

A risk of something happening is not the same as actually doing the thing.

"Ner ner ner ner ner, you can't prove it to a boomer judge".

Its always a cop-out to assume that they lost the argument because the judge didn't understand. I suspect the judge understood just fine but the law and the evidence simply wasn't on their side.

FireBeyond
2 replies
19h0m

Well if the copyright filter is working they indeed aren't happening. Putting in safe gaurds to prevent something from happening doesn't mean you're guilty of it. Putting a railing on a balcony doesn't imply the balcony with railing is unsafe.

Doesn't mean you weren't, at some point, guilty of it, either. It doesn't retcon things.

bawolff
0 replies
13h46m

Sure, which is why we require evidence of wrong doing. Otherwise its just a witch hunt.

After all, you yourself probably cannot prove that you didn't commit the same offense at some point in time in the past. Like Russel's teapot, its almost always impossible to disprove something like that.

Dylan16807
0 replies
16h13m

Yeah but I think the main concern in this situation is copilot moving forward, not their past mistakes.

dspillett
0 replies
2h41m

> The copilot team rushed to slap a copyright filter on top to keep these verbatim examples from showing up, and now claims they never happen.

More than that: the fact that they claimed it wasn't possible before adding the filter, to filter out the thing that said wasn't possible. This doesn't help me trust anything else they might say or have already said.

My take on that was always: if it isn't possible, then why are MS not training the AIs on their internal code (like that for Office, in the case of MS with their copilot product) as well as public code? There must be good examples for it to learn from in there, unless of course they thing public code is massively better than their internal works.

klabb3
0 replies
2h29m

This is so stupid. Going after likeness is doomed to fail against constantly mutating enemies like booming tech companies with infinite resources. And likeness itself isn’t even that big of a deal, and even if you win it’s such a minor case-by-case event that puts an enormous burden of proof on the victims to even get started. If the narrative centers around likeness, they’ve already won.

The main issue, as I see it, is that they took copyrighted material and made new commercial products without compensating (let alone acquiring permission from) the rights holders, ie their suppliers. Specifically, they sneaked a fair use sticker on mass AI training, with neither precedent nor a ruling anywhere. Fair use originates in times before there were even computers. (Imo it’s as outrageous as applying a free-mushroom-picking-on-non-cultivated-land law to justify industrial scale farming on private land.) That’s what should be challenged.

hn_throwaway_99
15 replies
20h27m

A slight aside, but this is the subtitle:

A few devs versus the powerful forces of Redmond – who did you think was going to win?

I hate that kind of obnoxious "journalism". Sometimes the little guy is actually wrong. To clarify, I'm not commenting on the specifics of this case, I just hate how fake our online discourse has been by appealing to "big guy evil" before even bringing up the specifics of the case.

epolanski
6 replies
20h9m

I think you're misinterpreting the sentence.

I think it merely implies MS has more resources to throw at the legal case.

gpm
2 replies
19h33m

I don't think that's something you can take away from the little-guy big-guy narrative. Class actions are funded by courts awarding lawyers huge payouts if they win, not directly by the plaintiffs. There should be plenty of resources on both sides of this fight.

mcmcmc
1 replies
18h45m

You are sorely underestimating the legal resources available to one of the most powerful companies on earth

gpm
0 replies
18h11m

I don't believe I am. To flush out my statement more fully there are diminishing returns on investing more money into a lawsuit, and both sides in a class action with this much money at stake should be sufficiently funded to be far beyond the point of diminishing returns.

I'm not claiming Microsoft doesn't have tons of resources, I'm claiming that the plaintiffs attorneys should be sufficiently funded that the difference in outcomes is negligible.

yieldcrv
0 replies
17h48m

they also have more resources to ensure they covered their liability surface before any legal case materialized

aka the plaintiffs were wrong and had no idea what they were talking about

megaman821
0 replies
18h38m

Maybe but lack of resources doesn't seem to be the main problem. A handful of devs claim copyright infringement, the Judge says show me and they can't. Maybe if they had millions of lawyers trying to get Copilot to produce their copyrighted code, their case would be stronger.

hn_throwaway_99
0 replies
17h38m

I strongly disagree. I don't see how you can interpret that sentence, especially given the "who did you think was going to win?" part, and ignore the implication that Microsoft won solely because of their size and money.

There is actually zero evidence that the judge issued his ruling based on Microsoft's superior legal team, so why even put that sentence in there anyway?

deciplex
6 replies
20h3m

Sometimes the little guy is actually wrong.

He is, sometimes. Also sometimes, the moon passes exactly between the sun and Earth, a new star appears in the sky, the magnetic field of our planet reverses, a proton decays (jury is still out on that one, actually). Etc.

Tools like Copilot are plagiarism machines. We know the data they're being trained on, and a conclusion of "that's plagiarism" is not - or anyway should not be - controversial. I'm not terribly against the notion of a plagiarism machine but I am against the owners of such machines reaping profits from them to the exclusion of the people who provide the source material. This is theft.

More importantly, getting back to big guys and little guys: big guys gang up on little guys all the time. It's usually how they get to be big. They tend to be the ones who realize that working together against the rest of us is to their benefit. So, in the interest of pushing back on that a little, and recognizing that I am after all a fellow "little guy" (figuratively speaking anyway), I tend to support the "little guy" unless I have overwhelming evidence confirming that they are, in fact, both wrong and that supporting them anyway would be against my best interest. Neither is the case, here.

At any rate, the subtitle here references a pretty ubiquitous and, I'm happy to report, increasingly well-known and understood facet of our economic and social institutions, which is that they absolutely positively do not work for us or further our interests in any sense.

JumpCrisscross
4 replies
18h49m

big guys gang up on little guys all the time

And obnoxious individuals gum up enterprises. It's lazy to the point of dismissal to conclude based on bigness.

ryandrake
1 replies
18h12m

You can't predict right or wrong based on bigness, but you can very often predict who will win.

EDIT: And by "win" I mean not who the judge will side with, but who will end up chugging along fine financially and who will end up broke.

hn_throwaway_99
0 replies
17h35m

EDIT: And by "win" I mean not who the judge will side with, but who will end up chugging along fine financially and who will end up broke.

I can certainly agree with that sentence, but that is definitely not how the Register was referring to "win" (they clearly just meant the judicial outcome), so it's obnoxious to imply the legal ruling went Microsoft's way solely due to their greater resources.

sensanaty
0 replies
6h14m

Those poor corporations, however will they survive? I say we let them dump chemicals straight into our oceans, after all we don't want to gum them up from earning infinite profit!

griftrejection
0 replies
17h2m

Won't anyone think of the corporations? :(

tpmoney
0 replies
19h30m

One would think if these were "plagiarism machines", that one of the plaintiffs would have been able to produce even a single instance of the copying they alleged.

Bognar
0 replies
19h10m

It's The Register, they are always like this. Especially when Microsoft is involved.

rolph
13 replies
23h47m

copilot was apparently snipping license bearing comments, and applying "semantic" variations of the remaining code.

i would package the entire code as a series of comments, [ideally this would be snipped by the pliagarists] leaving a snippet of example code that no one of sound mind would allow to execute, being proffered by copilot.

ChrisMarshallNY
12 replies
23h11m

> of sound mind

That's a reach, these days...

I'm seeing some really ... interesting ... behavior, being exhibited by folks that, at first blush, I think are kids, just out of bootcamp, but, on further inspection, turn out to be middle-aged professionals.

I really think Teh Internets Tubes have been rather corrosive to collective mental health.

klyrs
10 replies
23h6m

The ability to think for oneself will diminish rapidly in an environment that rewards one for not doing so.

Smart people still exist. They just aren't online.

satvikpendem
7 replies
22h40m

From Plato's dialogue Phaedrus 14, 274c-275b:

Socrates: I heard, then, that at Naucratis, in Egypt, was one of the ancient gods of that country, the one whose sacred bird is called the ibis, and the name of the god himself was Theuth. He it was who invented numbers and arithmetic and geometry and astronomy, also draughts and dice, and, most important of all, letters.

Now the king of all Egypt at that time was the god Thamus, who lived in the great city of the upper region, which the Greeks call the Egyptian Thebes, and they call the god himself Ammon. To him came Theuth to show his inventions, saying that they ought to be imparted to the other Egyptians. But Thamus asked what use there was in each, and as Theuth enumerated their uses, expressed praise or blame, according as he approved or disapproved.

"The story goes that Thamus said many things to Theuth in praise or blame of the various arts, which it would take too long to repeat; but when they came to the letters, "This invention, O king," said Theuth, "will make the Egyptians wiser and will improve their memories; for it is an elixir of memory and wisdom that I have discovered." But Thamus replied, "Most ingenious Theuth, one man has the ability to beget arts, but the ability to judge of their usefulness or harmfulness to their users belongs to another; and now you, who are the father of letters, have been led by your affection to ascribe to them a power the opposite of that which they really possess.

"For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom, for they will read many things without instruction and will therefore seem to know many things, when they are for the most part ignorant and hard to get along with, since they are not wise, but only appear wise."

courseofaction
3 replies
21h53m

Awesome. Serves as a counter-example - would HN consider literacy to be damaging to the mind, or are we similarly mistaken by thinking that LLMs necessarily degrade the abilities of their users?

Pre-writing 'texts' (such as the Iliad) were memorized by poets, which is reflected in their forms which made more use of memory-friendly forms like rhyming, consistent meter, and close repetition.

Writing allowed greater complexity and more complex/information dense literary forms.

I feel that intelligent, critical LLM usage is just writing with less laboriousnes, which opens up the writer's ability to explore ideas more widely rather than spend their time on the technical aspects of knowledge production.

klyrs
2 replies
21h24m

Does it serve as a counterexample? Or did the predicted loss of memory function come to pass?

Worth noting that people were smoking plain old opium back in those times; I'd be reluctant to apply their reasoning to fentanyl.

satvikpendem
1 replies
2h57m

What are you talking about with your second paragraph? I can't tell if it's supposed to be an analogy or whether you actually think everyone was smoking opium back then.

klyrs
0 replies
1h38m

Yes, the ancient Greeks were smoking opium. Nobody said that "everyone" was doing it, but its use was pretty widespread in neolithic Europe even before Sumerians were cultivating poppies Mesopotamia, back in 3400BCE.

https://en.wikipedia.org/wiki/Opium

ChrisMarshallNY
1 replies
21h54m

That's great!

They nailed us, what, four thousand years ago?

satvikpendem
0 replies
8h41m

Humans have been anatomically unchanged for 50,000 years, I'd imagine every generation lamented the young with their new technology, otherwise we wouldn't have seen so many examples in written history, it is just that we have no records from prehistory, by definition.

klyrs
0 replies
21h34m

Precisely the quote I was thinking of, thank you.

kirth_gersen
1 replies
23h3m

Suicide by words, here?

nyc_data_geek
0 replies
22h32m

The Internet is still one of the easiest ways to find and participate in communities and conversations with other smart people, if you're invested in vetting and filtering who/what you're engaging with.

That said, I expect the ease of such will continue to decline as we approach a largely dead Internet, primarily consisting of bots talking to bots trying to sell each other herbal brain force supplements or whatever

rolph
0 replies
23h6m

..that suggests there is actually a chance that someone would go for such a boobytrap.

pledess
12 replies
22h37m

I thought "the Copilot coding assistant was trained on open source software hosted on GitHub and as such would suggest snippets from those public projects to other programmers without care for licenses" was explicitly allowed by the GitHub Terms of Service: https://docs.github.com/en/site-policy/github-terms/github-t... "If you set your pages and repositories to be viewed publicly, you grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service." In other words, in addition to what's allowed by the LICENSE file in your repo, you are also separately licensing your code "to use ... through the GitHub Service" and this would (in my interpretation) include use by Copilot for training, and use by Copilot to deliver snippets to any other GitHub user.

dmitrygr
8 replies
22h35m

Lots of my code is on github (eg https://github.com/syuu1228/uARM), uploaded by others. I gave no license for its use in training. What now?

zdragnar
7 replies
22h30m

If the person didn't have your permission or permission from the license to agree to github's terms, then you sue the person who uploaded it to GitHub.

You don't get to go after GitHub because you have no contractual relationship with them. At best, you can get an injunction forcing them to take it down, though getting them to un-train copilot may not be feasible. At best you'd get a small cash offer, since you're unlikely to be able to justify any damages in a suit.

pton_xd
3 replies
22h12m

then you sue the person who uploaded it to GitHub.

You don't get to go after GitHub because you have no contractual relationship with them

What makes you say that? If someone eg uploads my copyrighted work to YouTube, I file a DMCA notice with YouTube to stop distributing my work. If YT ignores the notice then I can pursue them with a lawsuit.

How is this situation different?

singleshot_
2 replies
21h39m

DMCA explicitly gives you a cause of action against the party who does not properly comply with your request. GP asserts that you lack a cause of action against GitHub before they fail to comply with DMCA but I’m not certain I agree.

stefan_
1 replies
20h53m

DMCA is a narrow protection for operators of public websites like GitHub. I don't see what it has to do with GitHub taking the data submitted to it with dubious sourcing and developing their CoPilot whatever based on it. That has nothing to do with the privileges in DMCA.

singleshot_
0 replies
20h6m

That’s right. You have lost the thread of what we are talking about: causes of action based on privity vs those created by statute.

201984
1 replies
22h12m

So hypothetically, if a developer publishes GPL software on Codeberg, and someone uploads it to GitHub, could the original developer file takedowns against the Github copy?

I'm curious if Github's ToS make uploading GPL software you don't own a copyright violation.

votepaunchy
0 replies
22h3m

No, because the GPL is already more permissive than the GitHub TOS.

dredmorbius
0 replies
22h15m

17 USC §504 says otherwise:

... the copyright owner may elect, at any time before final judgment is rendered, to recover, instead of actual damages and profits, an award of statutory damages for all infringements ... in a sum of not less than $750 or more than $30,000. ... in a case where the copyright owner sustains the burden of proving, and the court finds, that infringement was committed willfully, the court in its discretion may increase the award of statutory damages to a sum of not more than $150,000.

<https://www.law.cornell.edu/uscode/text/17/504>

The issue isn't contract. It's copyright infringement.

Brian_K_White
1 replies
22h9m

That just means github can display the code, and you can see the code, but that does not mean you can then profit from or redistribute (profit or no) the code without attribution.

Amazon has the rights to publish a book, and you have the right to receive a copy of the book, but neither of those gives you the right to re-publish the book under your own name.

simion314
0 replies
22h34m

That will work if I upload only my code, but there are many open source projects where there are more then one author and GithHub did not acquired the rights from all the authors, the uploader to GitHub might not even be the author too.

purpleblue
11 replies
22h30m

Can you insist or put instructions that AIs do not train on your code? If they train on your code but don't produce the exact same output, is there any protection you can have from that?

archontes
9 replies
21h10m

When are people going to get that this isn't a right folks have?

If your code is readable, the public can learn from it.

Copyright doesn't extend to function.

ADeerAppeared
4 replies
20h43m

People aren't going to get it, because you don't get them.

People have the right to learn non-copyrightable elements from your code.

The claim is that AI learns copyrightable elements.

archontes
3 replies
20h32m

The comment chain you are replying to includes a request to not train an AI on one's code.

I agree it's certainly possible for AI to produce infringing output.

Nevertheless, people don't have the right to enforce a limitation on training.

warkdarrior
2 replies
19h24m

And to give a concrete example, in my view it should be allowed to use any source code to train a model such that the model learns that code is bad or insecure or slow or otherwise undesirable. In other words, it should be allowed to train on anything as long as the model does NOT produce that training data verbatim.

LegionMammal978
0 replies
16h4m

What copyrightable elements of the original work persist in the model, if it is incapable of outputting them? I can derive a SHA-1 hash from a copyrighted image, and yet it would be absurd to call that a derivative work.

carom
3 replies
19h13m

The public is not learning from it. A person or corporation is creating a derivative work of it. Training a model is deriving a function from the training data. It is not "a human learning something by reading it".

archontes
2 replies
17h8m

It's an extreme stretch to say that the model weights are a derivative work of the training data given the legal definition of "derivative work".

timeon
1 replies
11h29m

It is processed data at the end of the day. And no it is not like human reading. You can't read whole Github.

stale2002
0 replies
7h19m

That doesn't make it a derivative work.

If I "process data" by doing a word count of a book, and then I publish the number of words in that book (not the words themself! Just a word count!) I haven't created a derivative work.

Processing data isn't automatically infringement.

verandaguy
0 replies
37m

More thinking out loud than answering your question, but nightshade for code and other plain text formats would be cool.

loceng
7 replies
23h7m

This kind of argument makes me feel like it also supports the abolition of patents: eventually multiple other people will come up with the same obvious solution, which becomes obvious once a person spends enough time looking at a problem.

CodeWriter23
5 replies
22h42m

The Patent System is not intended to be a test of exclusive original thought.

The function of the Patent System is to incentivize search for solutions by temporarily securing exclusive right to market novel devices and processes for the discoverer.

loceng
4 replies
22h27m

Of non-obvious inventions. My argument being all inventions are obvious once attention is applied to that area and scope.

CodeWriter23
3 replies
3h40m

Requiring attention IMO takes something out of the realm of “obvious”. And the standard is “novel”.

loceng
2 replies
3h11m

Everything in the future is novel, so that's a moot qualifier.

Everything requires attention to be seen, once somethign becomes "obvious" is fully determined where you're looking and the scope you're zoomed in on.

E.g. "matter is solid" until you zoom in and realize matter is mostly made up of space.

CodeWriter23
1 replies
1h2m

Moot in your opinion. The idea is to bring the future more expediently by providing temporary incentive to pioneers reaching into the future.

loceng
0 replies
20m

You just proved my point with your second sentence - that everything in the future will come.

And bringing things more expediently is the actual opinion here, unsupported, where arguably it actually slows down not only progress but the value of that progress not being as widely distributed as it otherwise would be.

erik_seaberg
0 replies
21h7m

Unfortunately USPTO takes "non-obvious" to mean that it wasn't already suggested by combining patents or other written work, so if you are the first to work a problem you can claim easy solutions that anyone with a clue would have quickly reached. Land rushes to fence off new fields seem inevitable.

bsza
6 replies
17h25m

Should we move to modified versions of FOSS licenses that forbid AI training?

Found this: https://github.com/non-ai-licenses/non-ai-licenses

Legally sound or not, these should at least prevent your code from being included in Copilot's training data, hopefully without affecting any other use case. I'm going to use one of these next time I start a new project.

hardwaresofton
1 replies
17h23m

Note that wouldn’t be F/OSS — maybe OSS but the F wouldn’t be there.

bsza
0 replies
17h13m

Yes, that is clear. But personally I wouldn't want to write FOSS code anyway until Copilot learns to properly attribute FOSS code. Switching to a more permissive license later on shouldn't be an issue.

gpm
1 replies
15h56m

Legally sound or not, these should at least prevent your code from being included in Copilot's training data

Has microsoft said this or something?

stale2002
0 replies
7h23m

You can write whatever words you want on a piece of paper or uploaded to the info section of a GitHub repo.

That doesn't mean anyone has to follow it.

If it's legal to train on other people's stuff, without their permission, this would still apply to your code even if your code includes a license that said "I double extra declare that you can't train AI on this!!".

cmeacham98
0 replies
15h56m

If copilot is ruled fair use it doesn't matter what your license is, fair use superceeds it.

lumb63
4 replies
7h37m

It seems to me that regardless of the outcome of this case, some developers do not want to have their code used to train LLMs. There may need to be a new license created to restrict this usage of software. Or, maybe developers will simply stop contributing open source. In today’s day and age, where open source code serves as a tool to pad Microsoft’s pockets, I certainly will not publish any of my software open source, despite how much I would like to (under GPL) in order to help fellow developers.

If I were Microsoft, I’d really be concerned that I’m going to kill my golden goose by causing a large-scale exodus from GitHub or open source development more generally. Another idea I’ve considered is publishing boatloads of useless or incorrect code to poison their training data.

As I see it, people should be able to restrict how people use something that they gave them. If some people prefer that their code is not used to train LLMs, there should be a way to enforce that.

xinayder
1 replies
7h22m

I certainly will not publish any of my software open source, despite how much I would like to (under GPL) in order to help fellow developers.

I think this is a rather radical approach. You're undermining the OSS movement because you dislike Microsoft (I do too). I think adding a clause or dual licensing your work is more effective at stopping big-tech funded AI crawlers than just not adhering to open source.

You can host your code on sourcehut or Codeberg (Forgejo), you don't NEED to host it on a Microsoft owned platform.

elzbardico
0 replies
1h41m

I love the OSS movement. But the OSS movement is dependent on developers making a living somewhere else. If Microsoft effectively replace our class or at least a big part of it with AI, OSS becomes mostly irrelevant.

Not everyone is multi-generationally rich or absurdly frugal. Most people like having good jobs.

zamadatix
0 replies
6h18m

"I don't license as open source because $something which I don't like could use my code" is a pretty common note over time but, despite almost always coming with a warning of the end of some large open source segment, is rarely impactful in any way. Some people probably will use a special license and most won't care except for when they run across projects using said one offs and it becomes a pain to integrate licensing models.

infecto
0 replies
7h20m

I am personally happy to share all my public code to support the development of better models. While I believe the benefits of contributing to open source outweigh the drawbacks, and I don't foresee a "large-scale exodus from GitHub", it's ultimately up to individual developers to decide how their code is used.

epolanski
3 replies
20h4m

I am not strongly opinionated on this, but the very fact Microsoft used all the code it could find, bar their own has always looked suspicious to me.

cdrini
1 replies
8h20m

I mean, I imagine it used a lot of their public code, like VS code, typescript, the new windows terminal, or anything on https://github.com/microsoft . They didn't use their private code, but they didn't use anyone else's private code either.

sensanaty
0 replies
32m

They claim to not use anyone's private code, but I wouldn't trust the psychopathic C-suite at M$ not to murder kittens and human babies if it made the line go up a quarter of a percentage point, yet alone something like this.

jfoster
0 replies
16h56m

Is that a fact? If true, not sure whether it would have bearing on the legal questions, but certainly would make it seem like their actions are not in very good faith. Would love to hear their explanation if it did get raised in court.

MagicMoonlight
2 replies
10h46m

The issue I have is that these models are inherently trained to duplicate stuff. You train them by comparing the output to the original.

If I made an “advanced music engine” which rips Taylor swift files and duplicates them, I would be sued to oblivion. Why does calling it an AI suddenly fix that?

They should have to train them on information they legally own.

cdrini
1 replies
8h11m

They're not "inherently trained to duplicate"; I think that's a bit of a disingenuous oversimplification. They're trained to learn abstract patterns in large datasets, and remix those patterns in response to a prompt.

"You train them by comparing the output to the original." To the best of my knowledge this isn't correct; can you expand or cite a reference?

rrobukef
0 replies
4h57m

They are trained to duplicate, we just hope they do so by abstracting patterns. Various techniques stack the deck to make it difficult to memorize everything but it still happens easily, especially for replicated knowledge.

"You train them by comparing the output to the original." ->

You train neural networks by producing output for known input, comparing the output with a cost-function to the expected output, and updating your system towards minimizing the cost, repeatedly, until it stops improving or you tire of waiting. Cost functions must have a minimal value when the output matches exactly the expected to work mathematically. Engineering-wise you can possibly fudge things and they probably do so ... now.

I don't agree with your critiques. It isn't an oversimplification, published code literally works as stated.

yazzku
1 replies
17h5m

The judge disagreed, however, on the grounds that the code suggested by Copilot was not identical enough to the developers' own copyright-protected work, and thus section 1202(b) did not apply.

How did they reach this conclusion? How can you prove that it never copies a code snippet verbatim, versus just showing that it does for one specific code snippet? The latter is a lot easier to show, but I don't know what is it exactly that the prosecution claimed. I guess the size of the copy also matters in copyright violations?

cdrini
0 replies
8h25m

I think there's a difference between a mathematical proof and legal proof. The mathematical proof would be "show that it never copies a code snippet verbatim", and you of course cannot prove that by example.

Legal proof is I think different (not a lawyer). They're more pragmatic. If, observing a lot of cases where it does not verbatim copy, and, if an expert provides a reasonable argument as to why it is unlikely to verbatim copy, that is enough legal proof for a judge to conclude that the output is not identical enough to the developers copyrighted code.

sagarpatil
1 replies
15h29m

Off topic: How does the judiciary decide which judge to choose for such highly technical case?

nancyp
1 replies
2h53m

Linux/OSS is cancer. Said who? Anything in public domain is for grab by them.

Until the open tech community is chicken enough to not boycott their no open source stuff such as github and linked in a proof nothing will happen.

warkdarrior
0 replies
2h50m

Sir, are you OK??

mvdtnz
1 replies
23h11m

What were the plaintiffs even thinking when they submitted a claim based on identicality without being able to produce a single instance of copilot generating a verbatim copy. Even the research they submitted was unable to make a claim any stronger than "it's possibly in theory but we've never seen it".

AmericanChopper
0 replies
14h52m

A lot of people post AI outrage comments on HN that are clearly based on a rather poor understanding of the law and legal processes. This entire case and all of the plaintiffs statements about it reads like one of those comments.

lnxg33k1
1 replies
21h16m

Suddenly, you would steal a car

blooalien
0 replies
20h33m

| "Suddenly, you would steal a car"

Nah, but I would download a copy of one without hesitation... ;)

cellis
1 replies
17h20m

I would like to ask an obvious question to the legally inclined here. How is this any different than remixing a song (lyrics/audio)? It's not "identical", and doesn't output "verbatim" lyrics or audio. What is the distinction between <LLM> and <Singer/Remixer who outputs remixed lyrics/audio>. By a quick Google search it seems remixes violate copyright.

default-kramer
0 replies
15h0m

I'm not legally inclined, but... code and music are different? There must be different standards for when code is too similar, for when music is too similar, for when pictures are too similar, for when books are too similar.

Also, remixes almost always do contain verbatim lyrics and/or samples from the original song. LLM output isn't supposed to contain verbatim copies, but I've been told that sometimes it does. (I don't know much about LLMs and I don't think Copilot is useful. I want my 2010-era Intellisense back, when it was extremely fast and predictable.)

WesternWind
1 replies
19h24m

Wait... So Microsoft doesn't use Microsoft Teams, it uses Slack?

danpalmer
0 replies
19h13m

GitHub uses Slack, and has done since long before the Microsoft acquisition. GitHub also does a ton of chat-ops, or at least used to, so their migration from Campfire to Slack was a big move for the company, I doubt they want to move again.

snvzz
0 replies
12h50m

All GitHub needs to do to make most happy is offer an opt-out toggle.

It still doesn't.

slicktux
0 replies
14h7m

Yet people keep feeding it their code by using GitHub as their repo… Just how we use the internet to share information; there’s just no escaping it.

perlgeek
0 replies
9h58m

From the article:

The anonymous programmers have repeatedly insisted Copilot could, and would, generate code identical to what they had written themselves, which is a key pillar of their lawsuit since there is an identicality requirement for their DMCA claim. However, Judge Tigar earlier ruled the plaintiffs hadn't actually demonstrated instances of this happening, which prompted a dismissal of the claim with a chance to amend it.

So, the problem is really one of the lack of evidence, which seems... like a pretty basic mistake from the plaintiffs?

They could've taken a screencap video back when Copilot still produced code more verbatim, and used that as evidence, I assume.

passwordoops
0 replies
5h30m

"The lack of documents from the Windows maker is apparently down to "technical difficulties" in collecting Slack messages"

Wait, I'm forced to use Teams at work but Microsoft employees are on Slack?!

nashashmi
0 replies
19h12m

Big question: this thing called “training” AI off of data, how much of this is “training” and how much of this is “synthesizing”? It seems like if code is being copied and rephrased, it is synthetic. Not much “learning” and “training” going on here.

naikrovek
0 replies
6h12m

You silly whiners. The lawsuit was pure gesture from the beginning, and I said so at the time. You were all so sure that GitHub were breaking several laws, and now that you haven't gotten your way, you're saying the courts are corrupt.

The mere fact that this suit was dismissed means that there was not enough evidence to hold a trial. But you all think know better than the judge and the attorneys who worked on this, I assume?

Without commenting further about the merit of the suit, I will say that it is very telling that everyone here thinks they know better than the legal professionals who worked on this case for probably hundreds of hours over the past few months, while those of you who are active commenters here have given this case no more than 10 hours of thought at the most.

It is very sad to me that we no longer trust professionals, and each believe ourselves to be smarter and more capable than anyone else at a career that we don't even practice. Moreover, a lot of you seem to believe that you have unique realizations that the professionals working on these things have all somehow missed.

Each of you may be (and probably are, really) the world's foremost expert in something and I need you all to understand that being an expert in one or more things does not grant you expertise in anything else. You may be the most valuable software developer at a government contractor doing top secret work, and you may be so knowledgeable that other companies contract your time for help with their work, and that's awesome. but this skill inherently has zero bearing on your ability to understand a fucking lawsuit about copilot.

It is hard for people to swallow the fact that "I'm very smart here, but not there" and they will often default to "I'm very smart here, so I am very smart there." That is not true by default, and this is very rarely true, even with effort spent to make it happen.

The suit was dismissed because it didn't meet the criteria required. You do not know more than the people involved. You are not seeing some obvious fact that the experts have missed. You simply hate Microsoft and you want them to suffer, and you get mad when legal matters impede that.

chrismsimpson
0 replies
19h23m

If this is how the law is applied for code, are we to expect this is also how it will be applied for other data (e.g. audio a la Udio and Suno)?

albertTJames
0 replies
8h36m

Looking good ! Go Copilot !

Tomte
0 replies
14h44m

That‘s Matthew Butterick‘s case.