> Indeed, last year GitHub was said to have tuned its programming assistant to generate slight variations of ingested training code to prevent its output from being accused of being an exact copy of licensed software.
If I, a human, were to:
1. Carefully read and memorize some copyrighted code.
2. Produce new code that is textually identical to that. But in the process of typing it up, I randomly mechanically tweak a few identifiers or something to produce code that has the exact same semantics but isn't character-wise identical.
3. Claim that as new original code without the original copyright.
I assume that I would get my ass kicked legally speaking. That reads to me exactly like deliberate copyright infringement with willful obfuscation of my infringement.
How is it any different when a machine does the same thing?
You have a much smaller lobbying budget than the AI industry, and you didn't flagrantly rush to copy billions of copyrighted works as quickly as possible and then push a narrative acting like that's the immutable status quo that must continue to be permitted lest the now-massive industry built atop copyright violation be destroyed.
Violate one or two copyrights, get sued or DMCAed out of existence. Violate billions, on the other hand, and you magically become immune to the rules everyone else has to follow.
What about the copyrights purpose of furthering the arts and sciences?
Copyright has utterly failed to serve that purpose for a long time, and has been actively counterproductive.
But if you want to argue that copyright is counterproductive, I completely agree. That's an argument for reducing or eliminating it across the board, fairly, for everyone; it's not an argument for giving a free pass to AI training while still enforcing it on everyone else.
Without copyright, entire industries would've been dead a long time ago, including many movies, games, books, tv, music, etc.
Just because their lobbies tend to push the boundary of copyright into the absurd doesn't mean these industries aren't worth saving. There should be actually respectful lawmakers who seek for a balance of public and commercial interests.
> Without copyright, entire industries would've been dead a long time ago, including many movies, games, books, tv, music, etc.
Citation needed. There are many ways to make money from producing content other than restricting how copies of it can be distributed. The owner should be able to choose copyright as a means of control, but that doesn't mean nobody would create any content at all without copyright as a means of control.
There's nothing preventing people from producing works and releasing them without copyright restriction. If that were a more sustainable model, it would be happening far more often.
As it is now, especially in the creative fields (which I am most knowledgeable about), the current system has allowed for a incredible flourishing of creation, which you'd have to be pretty daft to deny.
that's not the argument. The fact that there currently are restrictions on producing derivative works is the problem. You cannot produce a star wars story, without getting consent from disney. You cannot write a harry potter story, without consent from Rowling.
That's not actually true. There's nothing stopping you from producing derivative works. Publishing and/or profiting from other people's work does have some restrictions though.
There's actually a huge and thriving community of people publishing derivative works, in a not-for-profit basis, on Archive of Our Own. (Among other places.)
> There's actually a huge and thriving community of people publishing derivative works, in a not-for-profit basis, on Archive of Our Own. (Among other places.)
Yes, and none of those people are making a living at creating things. That's why they are allowed by the copyright owners to do what they're doing--because it's not commercial. Try to actually sell a derivative work of something you don't own the copyright for and see how fast the big media companies come after you. You acknowledge that when you say there are "restrictions" (an understatement if I ever saw one) on profiting from other people's work (where "other people" here means the media companies, not the people who actually created the work).
It is true that without our current copyright regime, the "industries" that produce Star Wars, Disney, etc. products would not exist in their current form. But does that mean works like those would not have been created? Does it mean we would have less of them? I strongly doubt it. What it would mean is that more of the profits from those works would go to the actual creative people instead of middlemen.
Again, not true. One of the most famous examples is likely Naomi Novik, who is a bestselling author, in addition to a prolific producer of derivative works published on AO3. Many other commercially successful authors publish derivative works on this platform as well.
Speculate all you want about an alternative system, but you really don't know what would have happened, or what would happen moving forward.
> not true
Sorry, I meant they're not making a living at creating derivative works of copyrighted content. They can't, for the reasons you give. Nor can other people make a living creating derivative works of their commercially published work. That is an obvious barrier to creation.
Given that copyrighting is automatic at the instant of creation, that is, um, debatable.
Slapping 3 lines in LICENSE.TXT doesn’t override the Berne convention.
Are you claiming that an author cannot place their work in the public domain?
In most of the world no, they can't.
Yes, they can't, because there is no legally reliable way to do it (briefly, because the law really doesn't like the idea of property that doesn't have an owner, so if you try to place a work of yours in the public domain, what you're actually doing is making it abandoned property so anyone who wants to can claim they own it and restrict everyone else, including you, from using it). The best an author can do is to give a license that basically lets anyone do what they want with the work. Creative Commons has licenses that do that.
> the current system has allowed for a incredible flourishing of creation
No, the current system has allowed for an incredible flourishing of middlemen who don't create anything themselves but coerce creative people into agreements that give the middlemen virtually all the profits.
People do not put out their stuff. People get lured into contracts selling their IP to a shitty company that then publishes stuff, of course WITH copyright so they can make money while the artist doesnt
Books, music, and games are a lot older than copyright.
Have you looked at who created these things by and large? For the most part, you have: - aristocrats that were wealthy that didn't need to "work" to survive and put food on the table - crafts people supported through the patronage of a rich person (or religious order) who deign to support your art - (kinda modern world) national governments who want to support their national art often as a fear that other larger nations cultural influences will dwarf their
Are you implying that these three pillars will be able to produce anywhere near the current amount of content we produce?
How in the world where digital copies are effectively free to copy and infinitum would a creator reap any benefits from that network effect?
A modern equivalent would be famous YouTubers who all they do all day is "watch" other people's hard earned videos. The super lazy ones will not direct people to the original, don't provide meaningful commentary, just consumes the video as 'content' to feed their own audience and provides no value to the original creator. The position to kill copyright entirely would amplify this "just bypass the original source" to lower value of the original creator to zero.
Do you think the vast "amount of content we produce" is actually propped up by copyright? Have you ever heard of someone who started their career on YouTube due to copyright? On the contrary, how often have you heard of people stopping their YouTube career due to copyright, or explicitly limiting the content they create? I have only heard of cases of the latter. In fact, the latter partially happened to me.
You are making an assumption that people should reap (monetary) benefits for creating things. What you are ignoring is that the world where digital copies are effectively free is also the world where original works are insanely cheap as well. In this world, people create regardless of monetary gain.
To make this point: how much money did you make from this comment that you posted? It's covered by copyright, so surely you would not have created it if not for your own benefit.
For that matter, if you think China ripping everyone else off is bad now… well, just wait until every company can do that.
If everyone could do it, it wouldn't be as big a deal - small western businesses would be on a more level playing field, since they would be almost as immune from being sued by big businesses as Chinese businesses are. As it is, small businesses aren't protected by patents (because a patent is a $10k+ ticket to a $100k+ lawsuit against a competitor with a $1M+ budget for lawyers) while still being bound by the restrictions of big business's patents. It's lose/lose.
Trademark isn't copyright, so no.
Nobody cares anymore. We're sick of their rent seeking, of their perpetual monopolies on culture. Balance? Compromise? We don't want to hear it.
Nearly two hundred years ago one man warned everyone this would happen. Nobody listened. These are the consequences.
"At present the holder of copyright has the public feeling on his side. Those who invade copyright are regarded as knaves who take the bread out of the mouths of deserving men. Everybody is well pleased to see them restrained by the law, and compelled to refund their ill-gotten gains. No tradesman of good repute will have anything to do with such disgraceful transactions. Pass this law: and that feeling is at an end. Men very different from the present race of piratical booksellers will soon infringe this intolerable monopoly. Great masses of capital will be constantly employed in the violation of the law. Every art will be employed to evade legal pursuit; and the whole nation will be in the plot. On which side indeed should the public sympathy be when the question is whether some book as popular as “Robinson Crusoe” or the “Pilgrim’s Progress” shall be in every cottage, or whether it shall be confined to the libraries of the rich for the advantage of the great-grandson of a bookseller who, a hundred years before, drove a hard bargain for the copyright with the author when in great distress? Remember too that, when once it ceases to be considered as wrong and discreditable to invade literary property, no person can say where the invasion will stop. The public seldom makes nice distinctions. The wholesome copyright which now exists will share in the disgrace and danger of the new copyright which you are about to create. And you will find that, in attempting to impose unreasonable restraints on the reprinting of the works of the dead, you have, to a great extent, annulled those restraints which now prevent men from pillaging and defrauding the living."
https://www.thepublicdomain.org/2014/07/24/macaulay-on-copyr...
So true! Copyrights that last 20 years would be completely reasonable. Maybe with exponentially increasing fees for successive renewals, for super valuable properties like Disney movies.
Yeah many industries like:
- Big Corps that buy IP
- Patent Trolls
- Companies that fuck over artists
This debate is tired because nobody brings citations. The pro-copyright lobby cites numbers of jobs. The anti, nothing. In that midst, of course we're going to stick with the status quo.
This is a specious argument. It is impossible for us to gesture at the works of art that do not exist because of draconian copyright. Humans have been remixing each others' works for millions of years, and the artificial restriction on derivative work is actively destroying our collective culture. There should be thousands of professional works (books, movies, etc.) based on Lord Of The Rings by now, many of which would surpass the originals in quality given enough time, and we have been robbed of them. And Lord Of The Rings is an outlier in that it still remains culturally relevant despite its age; most works will remain copyrighted for far longer than their original audience was even alive, meaning that those millions of flowers never get their chance to bloom.
This is all true, and in a vacuum I agree with it. There's a pretty core problem with these kinds of assertions, though: people have to make rent. Never have I seen a substantiative, pass-the-sniff-test argument for how to make practical this system when your authors and your artists need to eat in a system of modern capital.
So I'm asking genuinely: what's your plan? What's the A to B if you could pass a law tomorrow?
Top priority: UBI, together with a world in which there's so much surplus productivity that things can survive and thrive without having "how does this make huge amounts of money" as its top priority to optimize for.
Apart from that: Conventions/concerts/festivals (tickets to a unique live event with a crowd of other fans), merchandise (pay for a physical object), patronage (pay for the ongoing creation of a thing), crowdfunding/Kickstarter (pay for a thing to come into existence that doesn't exist yet), brand/quality preference (many people prefer to support the original even if copies can be made), commissions (pay for unique work to be created for you), something akin to "venture funding", and the general premise that if a work spawns ten thousand spinoffs and a couple of them are incredible hits they're likely to direct some portion of their success back towards the work they build upon if that's generally looked upon favorably.
People have an incredible desire both to create and to enjoy the creations of others, and that's not going to stop. It is very likely that the concept of the $1B movie would disappear, and in trade we'd get the creation of far far more works.
Yeah, this is what I was expecting. I have no love for Disney et al but I think that this is dire (aside from UBI, which would be great but is fictional without a large-scale shift in American culture).
"Everybody else gets paid for the work they do; you get paid for things around the work you do, if you're lucky" is a way to expect creatives to live that, to put a point on it, always ends up being "for thee, but not for me". It's bad enough today--I think you described something worse.
The current model is "most people get paid for the work they do, but you get paid for people copying work you've already done", which already seems asymmetric. This would change the model to "people get paid for the work they do, and not paid again for copying work they've already done".
Not the person you responded to, but:
Patreon (or liberapay etc). Take a look at youtube: so many creators are actively saying "youtube doesn't pay the bills, if you like us then please support us on Patreon". Patreon works. Some of the time, at least - just like copyright. Also crowdsourcing (e.g. Kickstarter), which worked out well for games like FTL and Kingdom Come: Deliverance.
Although, I personally don't believe copyright should be abolished - it just needs some amendments. It needs a duration amendment - not a flat duration (fast-fashion doesn't need even 5 years of copyright, but aerospace software regularly needs several decades just to break profitable), but either some duration-mechanism or a simple discrimination by industry.
Also, I think any sort of functional copyright (e.g. software copyright) ought to have an incentive or requirement to publish the functional bits - for instance, router firmware ought to require the source code in escrow (to be published once copyright duration expires) for any legal protections against reverse-engineering to be mounted. Unpublished source code is a trade secret, and should be treated as such.
Also, these discussions don't seem to mention fanfiction, which demonstrates plenty of people write good works without being professionally paid and without the protection of copyright.
How many subscribers on patreon are there because the creators provides pay-walled extra content? How many would remain if that pay-walled content would be mirrored directly by youtube or on youtube?
Crowdsourcing might work better, but how many would donate to a game where, instead of getting it cheaper as a kickstarter supporter, they could get free after it is released?
Copyright is not optimized for making sure artists and authors get enough to eat. It's optimized for people with a lot of money to make even more money by exploiting artists and authors.
I doubt there's a simple answer (I certainly don't have one), but the current system is not exactly a creators' utopia.
My own business model is to create Things That Don't Exist Yet. This (typically bespoke work) is actually the majority of work in any era I think. For me, copyright doesn't do much, it mostly gets in the way.
If you pass the law tomorrow -all else being equal- my profits would stay equal or go up somewhat.
We can gesture at the tiniest tip of the iceberg by observing things that are regularly created in violation of copyright but not typically attacked and taken down until they get popular:
- Game modding, romhacks, fangames, remakes, and similar.
- Memes (often based on copyrighted content)
- Stage play adaptations of movies (without authorization
- Unofficial translations
- Machinima
- Speedruns, Let's Play videos, and streams (very often taken down)
- Music remixes and sampling
- Video mashups
- Fan edits/cuts, "Abridged" series
- Archiving and preservation of content that would otherwise be lost
- Fan films
- Fanfiction
- Fanart
- Homebrew content for tabletop games
Very often taken down, only by nintendo.
Fashion is traditionally not copyrightable[1] , and the fashion industry is doing rather well.
Similarly our IT infrastructure is now built mostly on [a set of patches to the copyright system][2] called F/L/OSS that provided more freedom to authors and users, and lead to more innovation and proliferation of solutions.
So even just in the modern west, we can see thriving ecosystems where copyright is absent or adjusted; and where the outcomes are immediately visible on the street.
[1] Though a quick search shows that lawyers are making inroads.
[2] One way of describing it at least, YMMV.
Could these "free passes" for AI training serve as a legal wedge to increase the scope of fair use in other cases? Pro-business selective enforcement sucks, but so long as model weights are being released and the public is benefiting then stubbornly insisting that overzealous copyright laws be enforced seems self-defeating.
That ship sailed long ago. While copyright can and is used at times to protect the "little guy", the law is written as it is in order to protect and further corporate interests.
The current manifestation of copyright is about rent-seeking, not promoting innovation and creativity. That it may also do so is entirely coincidental.
Also, if it wasn't about rent-seeking and preventing access to works, copyright wouldn't have to last for decades, many multiples of a work's useful commercial life. The fact that it does last this long shows that it's not about promoting innovation and creativity.
Copyright was invented by a cartel of noblemen, the British Stationer's Company, who, due to liberal reform, were going to lose their publishing monopoly. The implementation of copyright law as they helped pen allowed them to mostly continue their position while portraying it as "protecting the little guy".
Funny how both the rhetoric and intentions are the same after three hundred years.
Copyright’s purpose is a cudgel to be wielded to enrich the holder for, ideally, eternity. If “eternity” is threatened, you use proceeds from copyright to change copyright law to protect future proceeds.
You want to look at the Supreme Court case "Eldred v. Ashcroft." Eldred challenged Congress for retroactively extending existing copyrights, for extending the patent protections on existing inventions could not possibly further arts and sciences. They also argued that if Congress had the power to continually extend existing copyrights by N years every N years, the Constitutional power of "for a limited time" had no meaning.
The Supreme Court's decision was a bunch of bullshit around "well, y'know, people live longer these days, and some creators are still alive who expected these to last their whole lives, and golly, coincidentally this really helps giant corporations."
It is immutable.
What are you going to do about it? Confiscate everyone's home gamer PCs?
Even in the most extreme hypothetical where lawsuits shutdown OpenAI, that doesn't delete the stable diffusion models that I have on my external hard drives.
The tech is out there. It's too late.
There's a strong geopolitical angle as well. If you force American companies to license all training data for LLMs, that is such a gargantuan undertaking it would effectively set US companies back by years relative to Chinese competitors, who are under no such restrictions.
Bottom line, if you're doing something considered relevant to the national interest then that buys you a lot of leeway.
Sounds like the same concept as commonly said of "murderer vs conqueror".
Could probably be applied to many other fields for disruption too. Not the murderer bit (!), more the "break one or two laws -> scaled up massively to a potential new paradigm".
works the same for banks and owing them money
Violate billions or millions is what they used to nail warez folks with. So there is that.
You will need to first demonstrate that actual copying took place. And that what copying that did take place was actually illegal or infringing.
As we're seeing in court, that's a very interesting question. It turns out that the answers are very counter-intuitive to many.
You might not get your ass kicked. Copyright doesn't protect function, to the point where the court will assess the degree to which the style of the code can be separated from the function. In the even that they aren't separable, the code is not copyrightable.
https://www.wardandsmith.com/articles/supreme-court-announce...
https://easlerlaw.com/software-computer-code-copyrighted#:~:...
US copyright does protect for "substantial similarity" [0]. And at the other end of the spectrum, this has been abused in absurd ways to argue that substantially different code has infringed.
In Zenimax vs Oculus they basically argued that a bunch of really abstract yet entirely generic parts of the code were shared, we are talking some nested for loops, certain combinations of if statements, and due to a lack of a qualitative understanding of code, syntax, common patterns, and what might actually qualify for substantively novel code in the courtroom, this was accepted as infringing. [1]
Point is, the legal system is highly selective when it comes to corporate interests.
[0] https://en.wikipedia.org/wiki/Substantial_similarity
[1] https://arstechnica.com/gaming/2017/02/doom-co-creator-defen...
I don't even think it's that. In recent cases like Oracle v. Google and Corellium v. Apple, Fair Use prevailed with all sorts of conflicting corporate interests at play. The Zenimax v. Oculus case very much revolved around NDAs that Carmack had signed and not the propagation of trade secrets. Where IP is strictly the only thing being concerned, the literal interpretation of Fair Use does still seem to exist.
Or for a more plain example, Authors Guild. v. Google where Google defended their indexing of thousands of copywritten books as Fair Use.
In fact, go to far as to argue your example of Authors Guild v. Google is a good indication that most cases will probably go an AI platform's way. It's a pretty parallel case to a number of the arguments. Indexing required ingesting whole works of copyright material verbatim. It utilized that ingested data to produce a new commercial work consisting of output derived from that data. If I remember the case correctly, google even displayed snippets when matching a search so the searcher could see the match in context, reproducing the works verbatim for those snippets and one could presume (though I don't recall if it was coded against), that with sufficiently clever search prompts, someone could get the index search to reproduce a substantial portion of a work.
Arguably, the AI platforms have an even stronger case as their nominal goal is not to have their systems reproduce any part of the works verbatim.
The more recent Warhol decision argues quite strongly in the opposite direction. It fronts market impact as the central factor in fair use analysis, explicitly saying that whether or not a use is transformative is in decent part dependent on the degree to which it replaces the original. So if you're writing a generative AI tool that will generate stock photos that it generated by scraping stock photo databases... I mean, the fair use analysis need consist of nothing more than that sentence to conclude that the use is totally not fair; none of the factors weigh in favor it.
I think that decision is much narrower than "market impact". It's specifically about substitution, and to that end, I don't see a good argument that Co-Pilot substitutes for any of the works it was trained on. No one is buying a license to co-pilot to replace buying a license to Photoshop, or GIMP, or Linux, or Tux Racer. Nor is Github selling co-pilot for that use.
To the extent that a user of co-pilot could induce it to produce enough of a copyrighted work to both infringe on the content (remember that algorithms are not protected by copyright) and substitute for the original by licensing in lieu of, I would expect the courts to examine that in the ways it currently views a xerox machine being used to create copies of a book. While the machine might have enabled the infringement, it is the person using the machine to produce and then distribute copies that is doing the infringing not the xerox machine itself nor Xerox the company.
Specifically in the opinion the court says:
I find it difficult to come up with a good case that any given work used to train co-pilot and co-pilot itself share "the same or highly similar purposes". Even in the case of say someone having a code generator that was used in training of co-pilot, I think the courts would also be looking at the degree to which co-pilot is dependent on that program. I don't know off hand if there are any court cases challenging the use of copyright works in a large collage of work (like say a portrait of a person made from Time Magazine covers of portraits), but again my expectation here is that the court would find that while the entire work (that is the magazine cover) was used and reproduced, that reproduction is a tiny fraction of the secondary work and not substantial to its purpose.
Similarly we have this line:
Which I think supports my comparison to the xerox machine. If the plaintiffs against Co-Pilot could have shown that a substantial majority of users and uses of Co-Pilot was producing infringing works or producing works that substitute for the training material, they might prevail in an argument that co-pilot is infringing regardless if the intent of github. But I suspect even that hurdle would be pretty hard to clear.
Of the various recent uses of generative AI, Copilot is probably the one most likely to be found fair use and image generation the least likely.
But in any case, Authors Guild is not the final word on the subject, and anyone trying to argue for (or against) fair use for generative AI who ignores Warhol is going to have a bad day in court. The way I see it, Authors Guild says that if you are thoughtful about how you design your product, and talk to your lawyers early and continuously about how to ensure your use is fair and will be seen as fair in the courts, you can indeed do a lot of copying and still be fair use.
I agree. Nothing is going to be the final word until more of these cases are heard. But I still don't think Warhol is as strong even against other uses of generative AI, and in fact I think in some ways argues in their favor. The court in Warhol specifically rejects the idea that the AWF usage is sufficiently transformed by the nature of the secondary work being recognizably a Warhol. I think that would work the other way around too, that a work being significantly in a given style is not sufficient for infringement. While certainly someone might buy a license to say, Stable Diffusion and attempt to generate a Warhol style image, someone might also buy some paints and a book of Warhol images to study and produce the same thing. Provided the produced images are not actually infringements or transformations of identifiably original Warhol works, even if they are in his style, I think there's a good argument to be made that the use and the tool are non-infringing.
Or put differently, if the Warhol image had used Goldsmith's image as a reference for a silk screen portrait of Steve Tyler, I'm not sure the case would have gone the same way. Warhol's image is obviously and directly derived from Goldsmith's image and found infringing when licensed to magazines, yet if Warhol had instead gone out and taken black and white portraits of prince, even in Goldsmith's style after having seen it, would it have been infringing? I think the closest case we have to that would have been the suit between Huey Lewis and Ray Parker Jr. over "I Want a New Drug"/"Ghostbusters" but that was settled without a judgement.
I do agree that Warhol is a stronger argument against artistic AI models, but it would very much have to depend on the specifics of the case. The AWF usage here was found to be infringing, with no judgement made of the creation and usage of the work in general, but specifically with regard to licensing the work to the magazine. They point out the opposite case that his Campbell paintings are well established as non-infringing in general, but that the use of them licensed as logos for soup makers might well be. So as is the issue with most lawsuits (and why I think AI models in general will win the day), the devil is in the details.
A key finding that the judge said in the Authors Guild v. Google case was that the authors benefited from the tool that google created. A search tool is not a replacement for a book, and are much more likely to generate awareness of the book which in turn should increase sales for the author.
AI platforms that replaces and directly compete with authors can not use the same argument. If anything, those suing AI platforms are more likely to bring up Authors Guild v. Google as a guiding case to determine when to apply fair use.
Substantial similarity refers to three different legal analyses for comparing works. In each case what the analysis is attempting to achieve is different, but in no case does it operate to prohibit similarity, per se.
The Wikipedia page points out two meanings. The first is a rule for establishing provenance. Copyright protects originality, not novelty. The difference is that if two people coincidentally create identical works, one after another, the second-in-time creator has not violated any right of the first. (Contrast with patents, which do protect novelty.) In this context, substantial similarity is a way to help establish a rebuttable presumption that the latter work is not original, but inspired by the former; it's a form of circumstantial evidence. Normally a defendant wouldn't admit outright they were knowingly inspired by another work, though they might admit this if their defense focuses on the second meaning, below. The plaintiff would also need to provide evidence of access or exposure to the earlier work to establish provenance; similarity alone isn't sufficient.
The second meaning relates to the fact that a work is composed of multiple forms and layers of expression. Not all are copyrightable, and the aggregate of copyrightable elements needs to surpass a minimum threshold of content. Substantial similarity here means a plaintiff needs to establish that there are enough copyrightable elements in common. Two works might be near identical, but not be substantially similar if they look identical merely because they're primarily composed of the same non-copyrightable expressions, regardless of provenance.
There's a third meaning, IIRC, referring to a standard for showing similarity at the pleadings stage. This often involves a superficial analysis of apparent similarity between works, but it's just a procedural rule for shutting down spurious claims as quickly as possible.
Copyright is abused often. Our modern version of copyright is BS and only benefits large corps who buy a lot of IP.
While correct, the example given is that they COPY the code, then make adjustments to hide the fact. I suspect this is still a copyright violation. It’s interesting that a judge sees it differently when it’s just run through a programme. I’m not a legal expert so I’m guessing it’s a bit more complex than the headline?
No copy-paste was explicitly used. They compressed it into a latent space and recreated from memory, perhaps with a dash of "creativity" for flavor. Hypothetically, of course.
The distinction is pedantic but important, IMHO. AI doesn't explicitly copy either.
But isn’t that the same as memorising it and rewriting the implementation from memory? I’m sure “it wasn’t an exact reproduction” is not much of a defence.
I sure think so. I also think that (to first order) this is exactly what modern AI products do. Is a lossy copy still a copy?
Ok I read the article and it looks like the issue is the DMCA specifically, which require the code to be more identical than is presented. I’m guessing separate claims could still come from other copyright laws?
If I were to license a cover of a song for a music video, I'd have to license both the original song and the cover itself.
I'd say this is extremely relevant in this case.
if that is the case why do people ever license covers?
to clarify - I thought you just had to negotiate with the cover artist about rights and pay a nominal fee for usage of the song for cover purposes - that is to say you do not negotiate with the original artist, you negotiate with a cover artist and the whole process is cheaper?
The simple version is that code is copyrightable as an expression. And the underlaying algorithm is patentable.
The legal term you're looking for here is the "Abstraction-Filtration-Comparison" test; What remains if you subtract all the non-copyrightable elements from a given piece of code.
Algorithms have become patentable only very recently in the history of patents, without a rationale being ever provided for this change, and in some countries they have never become patentable.
Even in the countries other than USA where algorithms have become patentable, that happened only due to USA blackmailing those countries into changing their laws "to protect (American) IP".
It is true however that there exist some quite old patents which in fact have patented algorithms, but those were disguised as patents for some machines executing those algorithms, in order to satisfy the existing laws.
Software like Blackduck or Scanoss is designed to identify exactly that type of behaviour. It is used very often to scan closed source software and to check whether it contains snippets that are copied from open source with incompatible licenses (e.g. GPL).
To be able to do so, these softwares build a syntax tree of what your code snippet is, and compare the tree structure with similar trees in open source software without being fooled by variable names. To speed up the search, they also compute a signature for these trees so that the signature can be more easily searched in their database of open source code.
Because intent matters in the law. If you intended to reproduce copyrighted code verbatim but tried to hide your activity with a few tweaks, that's a very different thing from using a tool which occasionally reproduces copyrighted code by accident but clearly was not designed for that purpose, and much more often than not outputs transformative works.
> clearly was not designed for that purpose,
I'm not aware of evidence that support that claim. If I ask ChatGPT "Give me a recipe for squirrel lemon stew" and it so happens that one person did write a recipe for that exact thing on the Internet, then I would expect that the most accurate, truthful response would be that exact recipe. Anything else would essentially be hallucination.
i think you are misconceiving then how LLMs work / what they are
You can certainly try to hit a nail with a screw driver, but that doesn't make the screw driver a hammer.
As I understand it, LLMs are intended to answer questions as "truthfully" as they can. Their understanding of truth comes from the corpus they are trained on. If you ask a question where the corpus happens to have something very close to that question and its answer, I would expect the LLM to burp up that answer. Anything less would be hallucination.
Of course, if I ask a question that isn't as well served by the corpus, it has to do its best to interpolate an answer from what it knows.
But ultimately its job is to extract information from a corpus and serve it up with as much semantic fidelity to the original corpus as possible. If I ask how many moons Earth has, it should say "one". If I ask it what the third line of Poe's "The Raven" is, it should say "While I nodded, nearly napping, suddenly there came a tapping,". Anything else is wrong.
If you ask it a specific enough question where only a tiny corner of its corpus is relevant, I would expect it to end up either reproducing the possibly copyright piece of that corpus or, perhaps worse, cough up some bullshit because it's trying to avoid overfitting.
(I'm ignoring for the moment LLM use cases like image synthesis where you want it to hallucinate to be "creative".)
I get that's what you and a lot of people want it to be, but it isn't what they are. They are quite literally probabilistic text generation engines. Let's emphasise that: the output is produced randomly by sampling from distributions, or in simple terms, like rolling a dice. In a concrete sense it is non-deterministic. Even if an exact answer is in the corpus, its output is not going to be that answer, but the most probable answer from all the text in the corpus. If that one answer that exactly matches contradicts the weight of other less exact answers you won't see it.
And you probably wouldn't want to - if I ask if donuts are radioactive and one person explicitly said that on the internet you probably aren't going to tell me you want it to spit out that answer just because it exactly matches what you asked. You want it to learn from the overwhelimg corpus of related knowledge that says donuts are food, people routinely eat them, etc etc and tell you they aren't radioactive.
They are all hallucinations. Calling lies hallucinations and truths normal output is nonsense.
Perfect analogy.
Recipes are not copyrightable for that exact reason.
Substitue recipe for literally any other piece of unique information.
Um, the entire intent of these "AI" systems is explicitly to reproduce copyrighted work with mechanical changes to make it not appear to be a verbatim copy.
That is the whole purpose and mechanism by which they operate.
Also the intent does not matter under law - not intending to break the law is not a defense if you break the law. Not intending to take someone's property doesn't mean it becomes your property. You might get less penalties and/or charges, due to intent (the obvious examples being murder vs manslaughter, etc).
But here we have an entire ecosystem where the model is "scan copyrighted material" followed by "regurgitate that material with mechanical changes to fit the surrounding context and to appear to be 'new' content".
Moreover given that this 'new' code is just a regurgitation of existing code with mutations to make it appear to fit the context and not directly identical to the existing code, then that 'new' code cannot be subject to copyright (you can't claim copyright to something you did not create, copyright does not protect output of mechanical or automatic transformations of other copyrighted content, and copyright does not protect the result of "natural processes", e.g 'I asked a statistical model to give me a statically plausible sequence of tokens and it did'). So in the best case scenario - the one where the copyright laundering as a service tool is not treated as just that, any code it produces is not protectable by copyright, and anyone can just copy "your work" without the license and (because you've said if you weren't intending to violate copyright it's ok) they can say they could not distinguish the non-copyright-protected work from the protected work and assumed that therefore none of it was subject to copyright. To be super sure though they weren't violating any of your copyrights, they then ran an "AI tool" to make the names better and better suit your style.
I am so sick of these arguments where people spout nonsense about "AI" systems magically "understanding" or "knowing" anything - they are very expensive statistical models, the produce statistically plausible strings of text, by a combination of copying the text of others wholesale, and filling the remaining space with bullshit that for basic tasks is often correct enough, and for anything else is wrong - because again they're just producing plausible sequences of tokens and have no understanding of anything beyond that.
To be very very very clear: if an AI system "understood" anything it was doing, it would not need to ingest essentially all the text that anyone has ever written, just to produce content that is at best only locally coherent, and that is frequently incorrect in more or less every domain to which it is applied. Take code completion (as in this case): Developers can write code without essentially reading all the code that has ever existed just so that they can write basic code, because developers understand code. Developers don't intermingle random unrelated and non-present variables or functions in their code as they write, because they understand what variables are and therefore they can't use non existent ones. "AI" on the other hand required more power than many countries to "learn" by reading as much as possible all code ever written, and then produce nonsense output for anything complex because they're still just generating a string of tokens that is plausible according to their statistical model - the result of these AIs is essentially binary: it has been in effect asked to produce code that does something that was in its training corpus and can be copied essentially verbatim, with a transformation path to make it fit, or it's not in the training corpus and you get random and generally incorrect code - hopefully wrong enough it fails to build, because they're also good at generating code that looks plausible but only fails at runtime because plausible sequence of tokens often overlaps with 'things a compiler will accept'.
Intent frequently matters a great deal when applying laws.
In the specific area of copyright law, it doesn't itself make the use non infringing, but it can absolutely impact the damages or a fair use argument.
I actually once tracked this claim down in the case of stable diffusion.
I concluded that it was just completely impossible for a properly trained stable diffusion model to reproduce the works it was trained on.
The SD model easily fits on a typical USB stick, and comfortably in the memory of a modern consumer GPU.
The training corpus for SD is a pretty large chunk of image data on the internet. That absolutely does not fit in GPU memory - by several orders of magnitude.
No form of compression known to man would be able to get it that small. People smarter than me say it's mathematically not even possible.
Now for closed models, you might be able to argue something else is going on and they're sneakily not training neural nets or something. But the open models we can inspect? Definitely not.
Modern ML/AI models are doing Something Else. We can argue what that Something Else is, but it's not (normally) holding copies of all the things used to train them.
Not in copyright. The work speaks for itself, and the function of code is not a copyrightable aspect.
The intent of the work can matter when determining if de minimis applies as well as fair use.
It's equally plausible to say you don't intend to reproduce copyrighted code verbatim but occasionally do so given either a sufficiently specific prompt or because the reproduced code is so generic that it probably gets rewritten a hundred times a day because that's how people learned to do basic things from books or documentation or their education.
The guy who owns the machine is really rich, while you are more or less (all due respect of course) not worth suing.
That’s why I think the opposite of what you claim is true: if you were to do this, absolutely nothing would happen. When they do it, they will get sued over and over until the law changes and they can’t be sued, or they enter some mutually-beneficial relationship with the parties who keep suing.
Read up on the DMCA and the impact it has on e.g. nintendo emulators and the developers thereof
Those emulators are very popular though to the point of potentially impacting another business's bottom line. Where an individual putting it out a small block of code isn't exactly going to attract expensive lawyers.
I'm skeptical Github Copilot reproducing a couple functions potentially used by some random Github project is going to be a threat to another party's livelihood.
When AI gets good enough to make full duplicates of apps I'd be more concerned about the source. Thousands of smaller pieces drawn from a million sources and being combined in novel ways is less worrying though.
There is no impact to a company's bottom line when you are emulating a product they do not sell.
Yuzu, the emulator that was sued by Nintendo, was emulating the Nintendo Switch, which is a product Nintendo does sell.
Yuzu is not the only emulator taken down by Nintendo and Nintendo is not the only company that has gone after emulators.
In that case, could you clarify what instances of this you're referring to?
The death of Citra wasn't really a deliberate action on the part of Nintendo, it was collateral damage. Citra was started by Yuzu developers and as part of the settlement they were not able to continue working on it. Citra's development had long been for the most part taken over by different developers, but the Yuzu people were still hosting the online infrastructure and had ownership of the GitHub repository, so they took all of it down. Some of the people who were maintaining Citra before the lawsuit opened up a new repository, but development has slowed down considerably because the taking down of the original repository has caused an unfortunate splintering of the community into many different forks.
There is some speculation Nintendo was involved with the death of the Nintendo 64 emulator UltraHLE a long time back, but this was never confirmed. If indeed they did go after UltraHLE, then this would just like Yuzu be a case of them taking down an emulator for a console they were still profiting from, as UltraHLE was released in 1999.
The most famous example of companies going after emulators is Sony, which went after Connectix Virtual Game Station and Bleem!. Both were PS1 emulators released in 1999, a period during which Sony was still very much profiting from PS1 sales. Sony lost both lawsuits and hasn't gone after emulators since.
In 2017, Atlus tried to take down the Patreon page for RPCS3, a PS3 emulator. However, Atlus only went after the Patreon page, not the emulator itself, which they did because of their use of Persona 5 screenshots on said page. The screenshots were simply taken down and the Patreon page was otherwise left alone. Of note is that Atlus is a game developer, so they were never profiting from PS3 sales. However, they were certainly still profiting from Persona 5 sales, which had only released in 2016.
These are the only examples I can remember. Did I miss anything?
emulators for many nintendo consoles have been developed and released while the console was still sold and have been left alone as long as they had no direct links to piracy, recent events are a bit of a change.
iirc it got c&d but a case was never filed in court, the source code turned up eventually anyways.
the bnetd emulator, that let Diablo and StarCraft players not have to pay Blizzard for the privilege of buying the game, though that's a bit different.
Yes there is. If I can emulate Super Mario Odyssey on my PC, I don't need to buy a Nintendo Switch. If it wasn't available there, I'd have to buy a Nintendo Switch to play it. That's a lost sale for Nintendo. You could argue that I wasn't going to buy a switch anyway, but then we're getting too into hypotheticals.
This is the same reasoning the music and movie industries use when they go after people downloading music. And contrary to the popular opinion, I think it is wrong: if people want to pay, they will pay. Same for movies: if people would really want to pay for a movie, they would go to a cinema. Or stream it after a week or two. But there are also people who would jump through hoops than pay for music or movies. And that is not a lost sale because there was never an intention to buy something in the first place.
I enjoy how you removed the “I think” qualifier which suggested that it’s very possible that you’re right.
I’m quite well read on the DMCA but admit you probably know far more about how Nintendo wields it.
Still, I suggest that it’s a lot more likely that GitHub is going to get sued than you or GP.
Finally, I believe using the legal system to bully independent software developers is, in legal terms, super lame. We are probably in the same side here.
DMCA (at least the take down requests part) is not really suing someone and not really about making money. Its about getting certain works off the internet.
You are probably more likely to be on the wrong end of a dmca take down request as a poor person since you dont have the resources to fight it, and its not about recovering damages just censorship.
We are really losing the plot of what this thread is about here, but: DMCA takedown requests that are ignored or wheee the site does not comply with the process are subject to private civil action. Obviously, a takedown request is distinct from suing someone. And the way that the rights holder forces the site to remove the content is under threat of monetary penalties.
It looks like wilful obfuscation because the obfuscation is so simplistic. But as the obfuscation gets increasingly sophisticated, it becomes ever harder to distinguish wilful obfuscation from genuine originality.
for the purposes of copyright, originality is not required, just different expressions. It's ideas (aka, patent) that require originality.
The 'sufficiently complex obfuscation' is exactly what people's brains go through when they learn, and re-produced what they learnt in a different context.
I argue that AI-training can be considered to be doing the same.
Some different scenarios:
(1) You leave your employer, don’t take any code with you, start your own company, reimplement your ex-employer’s product from scratch, but you do it in a very different way (different language, different design choices, different tech stack, different architecture)
(2) You leave your employer, take their code with you, start your own company, make some superficial changes to their code to obscure your theft but the copying is obvious to anyone who scratches the surface
(3) You leave your employer, take their code with you, start your own company, start very heavily manually refactoring their code, within a few months it looks completely different, very difficult to distinguish from (1) unless you have evidence of the process of its creation
(4) You leave your employer, take their code with you, start your own company, download some “infringement obfuscation AI agent” from the Internet and give it your employer’s codebase, within a few hours it has transformed it into something difficult to distinguish from (1) if you didn’t know the history
(1) is unlikely to be held to be infringing. (2) is rather obviously going to be held to be infringing. But what about (3)? IANAL, but I suspect if you admitted that is how you did it, a judge would be unlikely to be very sympathetic. Your best hope would be to insist you actually did (1) instead. And then the outcome of the case might come down to whether the judge/jury believes your claim you actually did (1), or the plaintiff/prosecution’s claim you did (3).
And (4) is basically just (3) with AI to make it a lot faster and quicker. Such an agent likely doesn’t exist yet, but it could happen.
Timing is obviously a factor. If you leave your employer and launch a clone of their app the next week, everyone is going to think either you stole their code, or you were moonlighting on writing it (in which case they may legally own it anyway). If it takes you 12 months, it becomes more believable you wrote it from scratch. But if someone uses AI to launder code theft, maybe they can build the “clone” in a few days or weeks, and then spend a few months relaxing and recharging before going public with it
Numbers 2, 3, & 4 are all illegal because they start with an illegal action.
If I find a dollar on the sidewalk and put it in my wallets, is that stealing? If I punch a man getting change at a hotdog stand and a dollar falls on the sidewalk and then I put that in my wallet, is that stealing?
It doesn't matter what the scenario is after you stole code from your former employer, all actions are poisoned after.
From the article:
So (not a lawyer!) this reads like the point about GitHub tuning their model is not a generic defense against any and all claims of copyright infringement, but a response to a specific claim that this violates a provision of the DMCA.
I don't know whether this is a reasonable defense or not, but your intuitions or mine about whether there is a general copyright violation or what's fair are not necessarily relevant to how the judge construes that very specific bit of legal code.
What I got from this is, you can copy someone's copyrighted work provided you tweak a few things here and there. I wonder how this holds up in court if you don't have billions at your disposal.
Weird Al should be in the clear then, he changes probably 85% of all the song lyrics in his covers.
Weird Al explicitly seeks out permission from copyright holders and won't do a cover if he doesn't get their go-ahead [1].
Pretty much the exact opposite of all these AI companies :p
https://www.weirdal.com/archives/faq/
That's a significant over simplification of how it works though to the point of almost not being a useful analogy.
If your analogy was you were a human who memorized every variation of a problem (and every other known problem) and there was a tiny perctange of a chance where you reproduced that exact varation of one you memorized, but then added an after the fact filter so you don't directly reproduce it...
It's more like musicians who basically copy a bunch of music patterns or chord progressions before then notice their final output sounds too similar to another song (which happens often IRL) then changes it to be more original before releasing it to the public.
This is mere assumption. AI is supposed to work like that, but that's a goal, and not the result of current implementations. Research shows that they do memorize solutions as well, and quite regularly so. (This is an unavoidable flaw in current LLMs; They must be capable of memorizing input verbatim in order to learn specific facts.)
This is copyright infringement. Actionable copyright infringement. The big music publishers go after this kind of accidental partial reproduction.
"Legally distinct" is a gimmick that only works where the copyright is on specific identifiable parts of a work.
Changing a variable name does not make a code snippet "legally distinct", it's still copyright infringement.
Meh I still see that as a big oversimplification. Context matters. Even if the copyright courts often ignore that for wealthy entities. Someone reproducing a song using AI and publishing it as their own copyright infringement, a person specifically querying an AI engine, that sucked up billions of lines of information and generates what you ask it do with a sma probability it will reproduce a small subset of a larger commercial project and sends it to someone in a chatbox is not exactly the same IMO.
This is Github Copilot after all. I use it daily and it autocompletes lines of code or generates functions you can find on stackoverflow. It's not letting giving you the source code to Twitter in full and letting you put it on the internet as a business under another name.
We are currently seeing the music industry reacting to AI learning a bunch of music patterns and chord progressions and outputting works that sounds very similar to existing music and artists. They are not liking it.
To just see how much they disliked it, youtube copyright strikes is basically a trained AI to detect music patterns to identify sound with slight variations or copyrighted songs and take videos down. Generating slight variations was one of the early method that videos used to bypass the take down system.
The machine alone doesn't do anything. The user and machine together constitute a larger system, and with autocomplete, the user is charge. What's the user's intent?
I suspect that a lot of copyright violations are enabled by cut-and-paste and screenshot-taking functionality, and maybe we need to be careful with autocomplete, too? It's the user's responsibility to avoid this. We should be careful using our tools. Do users take enough care in this case? Is it possible to take enough care while still using CoPilot?
I've switched from CoPilot to Cody, but I use them the same way, to write my code. There's no particular reason to use CoPilot's output verbatim and lots of good reasons not to. By the time I've adapted it to my code base and code style and refactored it to hell and back, it's an expression of how I want to solve a problem, and I'm pretty confident claiming ownership.
Is that confidence misplaced? Are other people more careless?
By the same token, the machine alone can't download pirated movies. Yet the sites hosting those movies are targeted as the infringers.
There's a point at which foisting this responsibility on the users is simply socializing losses. Ultimately Copilot is the one serving the code up - regardless of the user's request. If the user then goes on to republish that work as their own it becomes two mistakes. It'll be interesting to see if any lawyers are capable of articulating that well enough in any of these lawsuits.
I would say yes, for two reasons. One is that using code of unknown provenance means you're opening yourself to unknown legal risks. The second is if you're rewriting it fully (so as not to run afoul of easily spotted copyright) that's not actually "clean room" and you're still open to problems. I'd also wonder what the point of using a code writing LLM is anyways if you're doing all the authorship yourself. It seems like doing double the work.
It is a lot of work to do a lot of rewrites, but it’s noncommercial and I’m not in a hurry. And autocomplete is still pretty useful.
Why stop there? Extrapolate that thought, keep generating more variants of the code, claim copyright, and seek rent from other people doing the same thing. To extrapolate full circle, there would be a business opportunity to generate as many variants as possible for the original author, to prevent all this from happening.
As long as we're not required to register copyright there's no reason to think the above will play out. International copyright agreements are not limited to verbatim copies only.
This has already been done[1] in music, though in their case they released them to the public domain. Admittedly I think that was more of a protest than anything.
[1]: https://www.vice.com/en/article/wxepzw/musicians-algorithmic...
You probably do this all the time. Forget memorizing but undoubtedly you've read code, learned from it, and then likely reproduced similar code. Probably nothing terribly important, just a function here or there. Maybe even reproduced something you did for a previous employer.
arr.sort((a, b) => a - b);
comes to mind. I bet most js devs have written this verbatim.
If you tell a programmer to implement a function foo(a, b) then there are actually only a tiny number of ways to do that, semantically speaking, for any given foo. The number of options narrows quickly as the programmer implementing it gets more competent.
Choosing function signatures is an art form but after that "copying" is hard to judge.
I'd argue there are infinite ways to implement any function, just almost all of them are extremely bad.
You would not get your ass kicked legally speaking. Copyright is not that broad. It's not a patent.
Just to set the stage and not entirely specific to this complaint... It really depends on what is and isn't subject to copyright for software.
Broadly, there is the distinction between expressive and functional code. [1]
And then there are the specific tests that have been developed by the courts to separate the expressive and functional aspects of software. [2] [3]
In practice it is very expensive for a plaintiff to do such analysis. For the most part the damages related to copyright are not worth the time and money. Plaintiffs tend to go for trade secret related damages as they are not restricted by the above tests.
There are also arguments to be made of de minimis infringements that are not worth the time of the court.
Most importantly the plaintiff fundamentally has the burden of proof and cannot just say that copying must have taken place. They need concrete evidence.
[1] https://en.wikipedia.org/wiki/Idea–expression_distinction
[2] https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...
[3] https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...
it depends on how much tax you are paying really. if you pay billions in taxes annually, they might see past it. if the company you copied from pays billions in taxes anually. you will go to jail. if this isn't painfuly obvious by now...
It seems the total disregard that the tech community showed toward copyright when it was artists losing out has come back to bite. Face-eating leopards, etc.
Days like this, I wonder what Borges would have made of such questions.
"Pierre Menard, author of redis"
I know from experience that parents are aggressively pushing their children into STEM to maximize their chances of being economically secure, but, I really feel that we need a generation of philosophers and humanists to sift through the issues that our technology is raising. What does it mean to know something? What does authorship mean? Is a translated work the same as the original? Borges, Steiner, and the rest have as much to contribute as Ellison, Zuckerberg, and Altman.
No clue.
But what if the generative AI were used to create music instead of code would the court have ruled differently?
CONSIDER:
In 2015, a federal judge order Thicke & Pharrell to pay 50% of proceeds to the Marvin Gaye estate for being “too similar” to the song, “Gots to Give It Up”.
Comparison and commentary: https://youtu.be/7_UiQueteN4?si=SkClbyBMOcucigRm
Comparison of both songs: https://youtu.be/ziz9HW2ZmmY?si=3_VZzfoLT-NrozoK
It would. And this is where some legislation "in the spirit of" would have helped. So Microsoft's huge legal arm can't just wiggle their way out on technicalities. Clearly, the law is not prepared to face the challenge of copyright violations on the scale created by the LLMs.
I also think it's not just copyright. It's simply not right to create a product on top of the collective work of all open source developers monetize them on the absurt scale Microsoft operates and never ever credit the original creators.
Adding to the sibling comments:
First: every human is per se doing that already. We have – to handwave – a "reasonable person" bar to separate violations versus results of learning and new innovation.
Second: You can be a holder of copyright and your creations result in copyrightable artifacts. Anything generated by the program has been held as uncopyrightable.
Why? This is no different than copy pasting and modifying a bit of code from some documentation/other project/tutorial/SO. Surely if that were a basis for copyright infringement most semi-large software projects would be infringing on copyright.
I don't think anyone here should be willing to open the can if worms that is copy pasting small snippets of code and modifying them.
The judge seems to argue that the non-identical copies are at issue here and that they only happen under contrived circumstances. My moral opinion is that this is irrelevant and that even the defendant is the wrong person. Even verbatim copies of code snippets shouldn't be copyright infringement and suing the company providing the AI is wrong to begin with, as the AI or its providercan not possibly be the one to infringe.
Regardless of the details here, it's become quite clear that the judicial system is for corporations. It doesn't matter whether they win, lose, or settle, as they win regardless, since the monetary benefits of what got them in court in the first place far outweigh any punishment or settlement cost.
I agree. I don’t see the difference.
That’s the entire reason “clean room reverse engineering” is done.
Using nothing but the binary itself, work out how things are done. Making sure that the reverse engineers don’t even have access to any material that could look like it came from the other organization in question. And that it is provable.
Rules for thee but not for me (rich companies). Think of the shareholders!
I think the argument is that the machine is not doing that, or at least there isn't evidence that it is doing that.
Specificly no evidence that github is doing both 1 and 2 at the same time. There might be cases where it makes trivial changes to code (point 2) but for code that does not meet the threshold of originality. Similarly there might be cases with copyrighted code where the idea of it is taken, but it is expressed in such a different way that it is not a straightforward derrivitave of the expression (keeping in mind you cannot copyright an idea, only its expression. Using a similar approach or algorithm is not copyright infringement)
And finally, someone has to demonstrate it is actually happening and not just in theory could happen. Generally courts dont punish people for future crimes they haven't comitted yet (sometimes you can get in trouble for being reckless even if nothing bad happens, but i dont think that applies to copyrighg infringement)
Maybe, maybe not. It's not as simple as you made it out to be. If you write a book with lots of stuff and you got inspiration from other books, and even put in phrases wholesale, but modified to use your own character names instead, I'm not convinced you would lose.
The court would look at the work as a whole, not single pieces of it.
They would also check if you are just copying things verbatim, or if you memorize a pattern and emit the same pattern - for example look at lawsuits about copying music, where they'll claim this part of the music is the same as that part.
It's really not as cut and dry as you make it out to be.
who gets to copyright claim the various array sorting algorithms then?
The actual answer here, regardless of a court ruling, is that you'd go broke if anyone big enough tried to go after you for it.
Legal protections for source code are still pretty fuzzy, understandably so given how comparatively new the industry is. That doesn't stop lawyers from racking up huge fees though, it actually helps because they need so much more prep time to debate a case that is so unclear and/or lacking precedent.
Literally the bank account behind the action...
You are taking the plaintiff statement as is, which is wrong. You can blame the media that didn't made it clear that it was a statement from the plaintiff.
I don't think it works that way. During the course of your professional career as a developer you change jobs. And let's say that at every job you create APIs. Besides the particular functions those API provide, the API code itself (how you interact with clients, databases etc.) will be pretty much the same as whatever you did at previous jobs. Does this constitute copyright experience or is just experience?
My analogy is that if Copilot doesn't provide 100% code from another repository it is OK to be used by other people trained with code available on GitHub.