Is training with user-generated content a way to launder copyrighted images? That is, if I upload an image of Ironman or whatever to my Facebook or Instagram page as a public post and Meta trains their model on that data, is there wording in my user agreement that says that I declare that I own the content, which then gives Meta plausible deniability when it comes to training with copyrighted material?
(apologies for the run-on sentence - it is early still)
I think Meta is already assuming that there will be no liability for training with copyrighted material. I find it very unlikely that image owners will win the AI training battle.
I'd be extremely surprised if the "Mickey Mouse standing on the moon" example image was a legitimate way to "launder copyright".
The interesting question is just who will be liable for the copyright violation: The party that hosts the AI service? The party that trained it on copyrighted images? The user entering a prompt? The (possibly different) user publishing the resulting image?
I can draw as many Disney characters as I want to and Disney has no recourse as long as I'm not publishing them somewhere.
Posting them on IG, Facebook, etc. is publishing them.
Yes, but importantly, generating them with the AI trained on Mickey is not.
But is publishing a model which can generate images of Mickey a copyright violation? It's definitely a violation if the model is overfitted to the extent that you can, perhaps lossily, extract the original images.
A photocopier can generate images of Mickey. Does that make a photocopier illegal?
A photocopier is extremely different because the user is providing the copyrighted material. In the AI case, it is much more like writing a Google search for the copyrighted material. If I ask an artist to draw a cartoon of Mickey Mouse in violation of copyright, the artist is in violation of copyright if they produce said drawing and give it to me. Are we to give special rights to AI that human artists don't enjoy?
When I've heard people talking about using copyright to defend against AI, they've always talked about it in the sense that their works being used to train the AI is where the copyright violation takes place.
That stance is clearly not supported by copyright law.
If, however, we're talking about copyright violations applying to the distribution of works generated by AI, that's an entirely different conversation. It's still not really clear-cut, but there are ways that could be in violation of copyright law.
It isn't the case that AI is being treated differently, though. The issues would be the same if a human were doing all of this stuff.
Is selling colored pencils that can draw images of Mickey a copyright violation?
The way I see it, the tool can't ever be at fault for its use, unless its sole use (or something close enough to its sole use) is to infringe in copyright.
Besides, the safeguarding of copyright isn't the single variable we as a society should be solving for. General global productivity is way more valuable than guaranteeing Disney's bottom line.
Even then, you could look at a tape recorder or a photocopier and one of their primary uses is to make a copy of a copyrighted work.
The question isn't "can it be used for" but rather "does it have valid non-infringing use" and "when it does infringe, is it the person who uses the tool or the tool that is at fault?"
That is certainly not clear, unless its only purpose was to do that.
I don't think that courts have ruled on that specifically (yet), but I seriously doubt that it would be. Taking the image of Mickey and distributing it would certainly be, though.
True. This is why I think it's pointless to try to use copyright law to defend yourself against AI companies. Right now, anyway, I don't see any law (or any other mechanism) that provides any protection. If I did, I wouldn't have had to remove all of my websites from the public web.
You can't make revenue off those drawings. An AI generator will presumably make money off generating content that violates copyright.
Tattoo artists also make money off generating infringing content all the time. I thought the issue was not in the generation but in the subsequent usage. Outlawing generation borders on thoughtcrime.
Well thats my point though.
Are tattoo artists breaking the law by creating tattoos of copyrighted material? I think they are. And if an artist becomes really popular for their mickey mouse tattoos, then they will provably be noticed by Disney and there will be consequences.
Clearly you're still living in a pre-Neuralink™ world.
MM will be public domain in Jan.
Unless Disney can engineer yet another oppressive extension to copyright durations.
Not going to happen.
When Disney did their copyright extension last time, they had bipartisan influence.
Now Disney is in the middle of the culture war, and there is no Republican that will risk being primaried to support Disney.
Given that you de facto need 60 votes in the Senate, it is not happening.
I guess that's some sort of silver lining to the state of things today!
Still protected by trademark depending on how it’s used.
the author of a novel has a copyright to the contents, but can't trademark the contents of the novel.
the same is true for artwork.
Some early versions will.
Only the first movie, the trademark is not expiring.
Here the problem isn't that the AI was trained on Mickey, but that it generated Mickey. The generated images can still violate copyright if too similar to copyrighted artwork - if published.
I think AI companies are working hard on preventing generated images from being similar to training images unless the user very explicitly asks the result to look like some well known image/character.
It can violate copyright, but as or equally important, companies have trademark protection on their characters and symbols.
You can violate copyright by intentionally drawing Mickey Mouse, the medium of drawing is not relevant (AI can be considered a medium, as much as a digital camera is a medium)
I don't think this is going to be hard for courts. If you borrow your friends copy of a copyright text, got to kinkos and duplicate it, then distribute the results - you are the one violating copyright, not your friend or kinkos.
The same will hold here I think, mutatis mutandis. This is all completely separable from the training issue.
The person getting sued there would be the user of the model, not meta, as much as I wish that wasn't how it is. If you use photoshop to infringe on copyright, you're at fault, not Adobe.
Ultra shitty corporate interests win again...
I don’t agree in this case. Well, maybe I agree on the ultra shitty corporate part. But these are public photos, and if I’d looked at one it could have some influence, probably tiny, on my own drawings. Seems reasonable that the same would be true of my tools.
If they were scanning my private messages, things would be different.
So you think a model trained on only a single copyrighted image would be a violation but one trained on many copyrighted images isn't?
No. I mean two things:
1 - human experience ends up informing human ingenuity. A sketch of Wile E. Coyote comes from someone’s (Chuck Jones?) experience of dogs and seeing coyotes, plus innumerable experience with things that are funny, constraints from experience of certain features that do or don’t work well on animation cels etc. Perhaps a stray tweak in his ears come from a Rembrandt seen as a child or from a glance at a sketch in progress by the person sitting at the next easel in a drawing class long ago.
In todays’s jargon our experiences are all parts of our training set (though today’s massive RNN models are infinitesimal by comparison).
And I think of my tools the same: a ton of inputs stirred together is fine by me.
2 - a difference is that fb’s model is made from public posts: posts offered for anyone to see. In the human case even my private experiences are part of my “training set.”
I don't think any argument in favor of these models that includes reasoning about how humans learn is any good. That's a completely separate process that has very little to do with how these systems work. The issue here is Facebook is creating a commercial system based on data their users have uploaded to their system. If artists had known their work would be used this way, I think they'd rethink using this platform. Facebook's monopoly power over internet content also makes it impractical for you not to have a social media presence if you're trying to make a living as an artist. So you either submit to bullshit like this or damn yourself to obscurity. The fact that it's only training on public content is irrelevant.
It is in big bold letters right in instagram's terms of service: "We do not claim ownership of your content, but you grant us a license to use it."
This isn't about copyright, it is about the fact that most people don't realize that by posting photos, they are licensing those photos.
A lot of the content posted there isn't owned by the people who post it, that's a big part of the problem.
They don't own the copyright, but they do have a "non-exclusive, royalty-free, transferable, sub-licensable, worldwide license to host, use, distribute, modify, run, copy, publicly perform or display, translate, and create derivative works". https://www.facebook.com/help/instagram/478745558852511
They user might upload something that they don’t have rights to.
Technically the user is the one misbehaving, but we, Facebook, and any reasonable court know that users are doing that.
That's why there is a safe harbor provision in DMCA.
Does that provision allow them to build derivative works, when they get a dmca request do they retrain the AI after removing the copyrighted work?
Copyright law as it exists today allows one to create transformative works. There is little to suggest that an AI trained on copyrighted works is in any way violating that copyright when inference is run.
Copyright law as it exists allows a creative process to create transformative works.
Computers cannot create copyright. They are not creative. Just because I save your image as webp or jpeg or whatever doesn't mean I have changed the copyright. Just because I zip it up with a hundred other images doesn't mean the zipfile is free of your copyright.
Effectively, computers are executing math, and math by itself does not construct new copyright, since copyright is the result of a creative human process.
As far as I can tell, current AI are fundamentally not too different from wildly complex compression algorithms. You compress a billion images down to a model. The model now can reproduce a fraction or the whole of the copyrighted work with some low probability. Rote and probabilistic compression.
The creator of the AI might own the copyright for what it produces if constructing the AI was suitably creative, i.e. if you construct an AI that trains on random noise and produces images, those are clearly something you, the author of the AI's code, can claim copyright over... But current AI seem like math more than anything else. It's plausible that reinforced learning or some other part of training does imbue creativity into the process, but that doesn't seem obviously true to me.
Argument "it's just math underneath" is flawed. Photoshop also has math underneath - does it mean that if you use Photoshop, you're not doing a creative process?
Also, saying that it's math ergo it's not creative is something that most people on HN would not agree with.
As for "it's just compression" - compression means that you can recover the original data - perhaps with a loss of quality, but still you can. With modern ML you mostly can't.
The human using photoshop is providing the creativity, and thus the human using photoshop owns the copyright, if they do sufficiently creative work (like actually drawing).
However, when using the current image gen AIs, the input you provide is a sentence of text and a couple parameters, a minimal amount of creativity.
This would be akin to opening photoshop and doing minimal work, such as choosing "resize image, apply blur filter".
If you open photoshop and do a few rote transformations, you indeed have not imbued enough creativity to create a new copyrighted work, the work retains its original copyright if you just open it in photoshop and resize it.
Is creativity a legal concept?
Yes. But the bar for creativity is very low for a work to be considered copyrightable.
See eg https://www.copyright.gov/comp3/chap300/ch300-copyrightable-...
308.2 Creativity A work of authorship must possess “some minimal degree of creativity” to sustain a copyright claim. Feist, 499 U.S. at 358, 362 (citation omitted). “[T]he requisite level of creativity is extremely low.” Even a “slight amount” of creative expression will suffice. “The vast majority of works make the grade quite easily, as they possess some creative spark, ‘no matter how crude, humble or obvious it might be.’” Id. at 346 (citation omitted).
'the input you provide is a sentence of text and a couple parameters, a minimal amount of creativity"
Have you tried creating art with AI? Usually it takes hundreds of iterations of text-to-image, image-to-image, inpainting, outpainting using dozens of different models.
"A sentence is all it takes" is like saying all it takes to make a million is crossing some numbers on a grid.
By the mere act of using Photoshop, no. By the act of providing your own inputs to Photoshop, yes.
You’re confusing the training process with inference, you’re confusing the copyright status of a model with the copyright status of the model output, and your confusing compressed data with a compression algorithm.
The VAE can be thought of as a codec, but the denoising process can recover images that are far removed from anything that is in the training data. Nobody has ever created an impressionist painting of Winston Churchill riding a purple lizard through the gates of retrofuturist Constantinople, yet almost infinite variations of that image exist in the latent space. If anything, it can be thought of as an intricate form of collage, which we do give special treatment for copyright purposes.
If they didn't have that (or something similar) they couldn't serve the image to other users. Well, they could, but without something like that someone will sue them for showing a picture they uploaded to someone they didn't want to see it (or any number of other gotchas).
They store the image or video (host/copy), distribute it over their network and to users (use/run), they resize it and change the image format (modify/translate), their site then shows it to the user (display/derivative work), and they can't control the setting in which a user might choose to pull up an image they have access to (the "publically" caveat)
It sounds like a lot, but AFAIK that's what that clause covers and why it's necessary for any site like them.
It certainly does cover the needs of hosting and display to other users, but it doesn't permit just that. It's expansive enough to let them do just about anything they could imagine with the pictures.
Only insofar as legal precedent has established it to mean that. If someone sues you for a use that hasn't been found in court to fall under this clause it will be more difficult to win that case.
IANAL, and my jargon may be off, but I think that in the scenario where you get sued for something that's been litigated to fall under this clause in the past, you can basically say "even if we assume the evidence and claims are accurate, it's obviously in the clear based on prior cases", if the judge agrees, you win without going to trial, which is a "summary judgement" I think.
On the flip side, if someone is trying to apply the clause in a novel, not previously litigated way, you're way less likely to get that summary judgement and it will have to be argued in court.
It works the other way too, if I wrote a eula that used different phrasing than what's been established prior, say to make it more obviously cover just the normal stuff for user uploaded images, summary judgement is less likely to succeed because no court had ever weighed in on my novel phrasing as covering those actions in that way.
There's also the risk that if you make the phrasing too narrow (specifying resizing of the image) then when a new tech comes along that's reasonable to apply (e.g. some ML process to derive a 3d scene from images, or make them) exactly zero user uploaded images you store at that point could benefit from that until you go back and ask the user to agree to that too. The question then becomes how worth is narrowing the wording when you can accidentally paint yourself into a corner.
Or how about if it had been phrased "display on a monitor" had been used years back pre-smartphone era? You could be sued for making user uploaded media available to view on phones since that wasn't in the license granted to you by your users!
When you cover all the little edge cases, you end up with the seemingly overbroad clause most companies use.
An important thing to remember is that the legal interpretation of a text can differ almost arbitrarily from the plain English meaning of the text as written.
Training generative ML tools is qualitatively different from showing on website, even if both are technically “derivative works”, so this is a massive bait-and-switch. Is it the first time something is acceptable by the letter of pre-existing law but not the spirit?
Well .. no. It happens each time that Google et. al find a new way to use your data. It's what all we German "privacy nuts" have warned people about for years and the reason that the older German data protection laws and now EU regulations require you to state exactly what you are doing with data ("purpose limitation"). If companies can just write "oh well, we will use it for something" how can anyone evaluate whether they should accept without knowing the future? Right. They cant.
So, this could be another case of the EU kicking Facebook in the face. We'll see.
You're just stating an agreement between Meta and cowboyscott. The copyright holder of Ironman image never agreed it.
The problem here is cowboyscott doesn't own a copyright of Ironman image. But his uploading of image may match the condition of fair use of US or similar copyright exemption rule in their country's copyright law. It effectively works as copyright laundering.
Do we even do Fair Use in the US anymore?
DMCA take downs seem to feel that this is not a thing any longer.
You don't even really need the middleman - Disney has surely uploaded pictures of Ironman to these sites so it would have them either way.
But I don't know if it's really laundered anything. If you say "Hey Meta AI, make me poster for my cookie company that has Iron man eating my cookies" I'm pretty sure Disney could still sue you. It could still sue you if you instructed a human to draw a picture that had Ironman in it so I don't even know if you need a new legal framework.
You forgot the "in perpetuity" /s
This is why you don't also download the music from stories when you download stories, no such agreement with Spotify.
When an image us uploaded is it re-licensed:
So if you delete your image the entire trained data set is invalid because they no longer have license to the copyright?
If having copyright were a prerequisite of training data this would be true.
But in the US this hasn't been tested in the courts yet, and there's reason to think from precedent this legal argument might not hold (https://www.youtube.com/watch?v=G08hY8dSrUY - sorry don't have a written version of this).
And the lawsuits so far aren't fairing well for those who think training should require having copyright (https://www.hollywoodreporter.com/business/business-news/sar...)
I would imagine if we use a very strict interpretation of copyright, then things like satire or fan-fiction and fan-art would be in jeopardy.
As well as learning, as a whole.
Unless there is literally a substantial copy of some particular piece of copyrighted material, it seems to be a massive hurdle to prove that analyzing something is copyright infringement.
Most people in the fanfiction community recognize that it's probably not strictly allowed under copyright. However, the community response has generally been to do it anyway and try to respect the wishes of the author. Hence why you won't find Interview with a Vampire fanfiction on the major sites.
If anything, I think that severely hinders the pro-AI argument if fanfiction made by human authors are also bound by copyright.
ETA: I just tested it out and you can totally create Interview with a Vampire fanfiction with Bing Compose. That presumably is subject to at least as strong copyright as human authors and is thus a copyright violation.
I would suggest also a read of https://en.wikipedia.org/wiki/Copyright_protection_for_ficti...
Creating a work using Harry Potter or Darth Vader or Tarzan ("As of 2023, the first ten books, through Tarzan and the Ant Men, are in the public domain worldwide. The later works are still under copyright in the United States.") is a copyright infringement.
You may also find https://www.hollywoodreporter.com/business/business-news/dc-... interesting as well as the entire legal saga of Eleanor.
---
Creating Interview with a Vampire fan fiction with Bing - Bing didn't have any agency. The question of copyright infringement (I believe) should be only applied to entities with agency to (or not) ask for copyright infringing works.
Transformative works are a thing:
https://www.transformativeworks.org/faq/#:~:text=investments...
https://www.transformativeworks.org/faq/#:~:text=Open%20Door...
That’s the output of the model, it doesn’t have much bearing on the copyright status of the model.
The difference is when writing satire its not strictly necessary to possess the work to do so. You can merely hear of something and make a joke or a fake story. Training data on the other hand uses the actual material not some derivative you gleamed from a thousand overheard conversations.
Satire, criticism, reviews and journalism are explicitly permitted under fair use.
If I wish to publicly express my disdain or praise for your art, it is necessary that I can show samples / pictures/ photos when I express whatever my deal is.
Let's say you post an image, and I learn something by viewing it, then you delete the image. Is my memory of your now deleted image wiped along with everything I learned from viewing it?
Unfortunately computer memory, unlike your memory, is so easily wiped. Having the infrastructure in place to make sure it happens on the other hand, seems more like human memory.
I have seen plenty of images on the internet where I would gladly accept this as thing. Unfortunately, what's been seen, can't be unseen.
Now that is a multi-million dollar question.
How derived data is handled after copyright is revoked is a question thats hard to answer.
I suspect that the data will be deleted from the dataset, and any new models will not contain derivatives from that image.
How legal that is, is expensive to find out. I suspect you'd need to prove that your image had been used, and that it's use contradicts the license that was granted. It would take a lot of lawyer and court time to find out. (I'm not a lawyer, so there might already be case history here. I'm just a systadmin who's looking after datasets. )
postscript: something something GDPR. There are rules about processed data, but I can't remember the specifics. There are caveats about "reasonable"
s/m/tr/
Huh? I think you want s/(:?m[^m]*)m/tr/
The portion of the training set might. The actual trained result -- the outcome of a use under the license -- would, at least arguably, not.
Of course, that's also before the whole "training is fair use and doesn't require a license" issue is considered, which if it is correct renders the entire issue moot -- in that case, using anything you have access to for training, irrespective of license, is fine.
Yeah, derative works in this case afaik was always be meant as "we can generate thumbnails etc" and not "we will train our AI with it". I am pretty sure this is illegal in many countries...
Another method of copyright laundering is doing ML learning in a country where it doesn't protected under copyright law.
Personally, I'm on a side of using copyrighted data for machine learning input source doesn't violate copyright. Statistically, learned model for generative Ai doesn't retain even 1 bit of input. It's hard to say NN model data infringe any copyright of the input source. The copyright is applied to the expression, not the process. If the generative AI produces an image that's clearly a copy of a specific Ironman image which existed before the image generation, that's copyright infringement.
"learned model for generative Ai doesn't retain even 1 bit of input". If that was true, it shouldn't be possible to trick the models into regurgitating their source material, but cleary that is possible [0].
[0] https://stackdiary.com/chatgpts-training-data-can-be-exposed...
Very LLMs are quite a different from diffusion models. The model size vs training set size is skewed the other way.
It does. The data is just obfuscated.
I agree with you but I think the argument is flawed. If you think about it like this h265 also just steals 10% (or whatever the compression ratio is) of an artifact
Copyright doesn't require a single bit of input to be shared. You can't avoid copyright by using a paintbrush for example, you're simply creating a derived work. You might still be in violation even if you create an entirely new context around the copied elements or substitute for the original in the market, as was the case with Warhol vs. Goldsmith.
Obviously not every generative output is a copyright violation, but it seems equally clear that there are outputs that would be if they were produced by humans.
Doubt it. If you upload child porn to Instagram and they distribute it - it's still an Instagram problem, AFAIK.
Child porn is not a copyright issue, so the DMCA safe harbor for UGC doesn't apply, and its criminal, so the Section 230 safe harbor doesn't apply, so its very much not an applicable example as to whether use of UGC in other contexts is a way of leveraging safe harbor protections for content, whether for copyright or more generally.
It's still an Instagram problem if someone uploads copyrighted info and Instagram distributes it...
It literally/legally isn't and is one of the reasons US is king for hosting services like IG. Read Section 230.
As long as Instagram follows the DMCA and takes it down, they're covered by Section 230, do I don't know if it's a problem per se.
"Is training with user-generated content a way to launder copyrighted images?" Pretty much.
You are very very unlikely to stumble upon something resembling a training image closely enough for copyright to take effect, and in any event this is not the purpose of these systems. You may be running into trademarked content, but in that case you can not speak of laundry, because you can not use a trademark even if the image is AI generated.
"You are very very unlikely to stumble upon something resembling a training image closely enough for copyright to take effect" That is definitely not the case, and is completely contingent on the prompt matching closely what the training set has in it.
I think you misunderstand copyright and perhaps conflate it with a trademark. A given prompt may yield a result closely resembling some copyrighted work, but that in and of itself does not violate copyright. Getting a nearly identical result is very unlikely, perhaps with enough tries on a very famous painting. Even then, that is not the purpose.
At this point all big players assume it's okay to train on copyrighted materials.
If you can[0]crawl materials from other sites, why can't you crawl from your own site?
[0]: "can" in quotes
Because your users have agreed to terms of service that don't mention analyzing the images to train an AI model.
If their legal assumption is it's not a copyright violation to train a model on some image, then it's logical that their ToS doesn't mention it, as they need the user's permission only for the scenarios where the law says that they do.
Within Polish (European?) legislation, an agreement on use of copyright needs to explicitly state in what areas you are allowed to copy/use the copyrighted work. So, e.g. if an agreement didn't explicitly state that a company can use the work in TV (or Radio, or sth), then they don't have the right to do so.
When new mediums are invented (like internet), you need to sign an annex to the agreement extending it to this medium.
Having said that, I would still consider it a fair use to train model on given images, but using the trained model to replicate a specific style etc, would most likely be considered a new medium. (IANAL though)
What about all the photos of people at Disney taking pictures of themselves standing next to Mickey Mouse etc.
I don’t think there’s a question that people are allowed to upload photos like that.
Technically, that's a copyright violation. Disney just opts not to enforce their rights for that sort of use.
Similarly, you technically can't take and post pictures of statues, paintings, some buildings, etc., and some rightsholders do enforce their copyright when people do those things.
Outside of things within the scope of fair use would be within Disney’s rights to restrict, but given the actual public policies and guidance on photography at Disney parks, I think there is a very strong case that noncommercial photography (for people are present as paid guests) is permitted by implied license.
Well, not buildings if they are in or visible from a public place in the US, at least under copyright law. (Photography of some, particularly government, buildings may run afoul of other law.) This may be different in other countries.
Ahhh, you're correct. This was apparently changed in 1990. I just hadn't updated my mental model in accordance with that change.
https://www.nolo.com/legal-encyclopedia/copyright-architectu...
It's not a legal way to "launder" copyrighted images, because for things where copyright law grants exclusive rights to the authors, they need the author's permission, and having permission from someone and plausible deniability is not a defense against copyright violation - the only thing that it can change is when damages are assessed, then successfully arguing that it's not intentional can ensure that they have to pay ordinary damages, not punitive triple amount.
However, as others note, all the actions of the major IT companies indicate that their legal departments feel safe in assuming that training a ML model is not a derivative work of the training data, they are willing to defend that stance in court, and expect to win.
Like, if their lawyers wouldn't be sure, they'd definitely advise the management not to do it (explicitly, in writing, to cover their arses), and if executives want to take on large risks despite such legal warning, they'd do that only after getting confirmation from board and shareholders (explicitly, in writing, to avoid major personal liability), and for publicly traded companies the shareholders equals the public, so they'd all be writing about these legal risks in all caps in every public company report to shareholders.
I think the move will be to argue fair use, declaring the derivative work to be transformative, and possibly to point out that only a small amount (1%-3%) of the original data is retained.
It seems like this is still very much a legal gray area. If it's concretely decided in court that generative AI cannot produce copyrighted work then I assume it makes no difference what the source of the copyrighted training material was.
Training on copyrighted content isn't a copyright violation. Sarah Silverman is currently learning that the hard way.
It is not any different than actual live artist learning from works of others