Isn't it a bit hypocritical of them to use other people's copyrighted work or "output" they weren't given license/permission to use, but deny people the same opportunity with their work/output?
I'm genuinely wondering how this is different from them using others work without consent, but IANAL so maybe I'm just confusing mortality and legality
The fun thing is, is that IIRC, AI outputs aren’t copyrighted so this is actually more ethical that’s what OpenAI did. The only thing is that it was probably against the terms of service.
why is that? all works, as long as it is "original" should be granted copyright. It should belong to the person who made it (in this case, the user, not openAI).
What if the AI is fully autonomous, pays its own hosting bill etc? Who gets the copyright?
No one, because according to current legislation, copyright only applies to works by humans.
Obvious new business model providing a human shill service where you look at AI output and say "yep, I made that". Now it's copyright, assigned to the customer of your startup.
So you're suggesting that people commit fraud, and base their business around committing fraud? I don't see that being super scalable.
Of course, people can always lie, but they also can be caught lying.
It's not that cut-and-dry.
If I say to one of my human friends "write me a nursery rhyme", the copyright of the resulting rhyme would obviously belong to my friend - despite me prompting them. Clearly the prompt itself does not universally count as "making" it.
Let's say I made a "NovelSnippetAI", which contains a corpus of prewritten material. You can prompt it, and it will return a page which matches the sentiment of your prompt best. I think we can agree that the copyright of the page will still belong to the original writer - the user only did a query.
What if I did "NovelMixAI", which did exactly the same but alternated lines from the two best matches? What about "NovelTransformAI", which applied a mathematical formula to the best match and fed the output to a fixed Markov Chain? Now we're suddenly at "LLMAI", which does the same using a neural network - what makes it different from the rest?
Long-standing precedent is that any automated work does not qualify for copyright. You can only copyright human work.
> If I say to one of my human friends "write me a nursery rhyme", the copyright of the resulting rhyme would obviously belong to my friend - despite me prompting them. Clearly the prompt itself does not universally count as "making" it.
This is the wrong analogy. The LLM is more like Photoshop and the prompt little more than a filter configuration. A machine cannot copyright its own output but a human guiding that machine can.
It is more like a child's speak and say. Push a button a cow noise comes out. Someone else pushes it the same noise comes out. Even if a billion buttons exist the child doesn't own the cow noise.
No, it's the right analogy for the point I was making. That analogy is there to point out that it is about more than just the prompt, and prompting a machine is covered in the rest of my comment.
Where the line is - ie how much human creative input would be needed for a work to be covered by copyright seems to be unclear legally. There are some interesting parallels to this dispute I think: https://en.m.wikipedia.org/wiki/Monkey_selfie_copyright_disp...
Machine-generated works are not eligible for copyright, and courts have been ruling this way about AI content. https://www.reuters.com/legal/legalindustry/who-owns-ai-crea... Early days and there are still appeals as well as new legislation in progress, but that's where it stands now.
In the Monkey selfie copyright dispute Slater could not claim copyright because he did not operate the camera. Corporate personhood or juridical personality is the legal notion that a juridical person such as a corporation, separately from its associated human beings (like owners, managers, or employees), has at least some of the legal rights and responsibilities enjoyed by natural persons. If Mr. Slater was incorporated would his copyright ownership be clear? If an ai was a corporation not an asset of a corporation and could prompt it's self would the output then be copyrightable? There are a lot of ifs in there but still interesting.
IANAL, but I would guess it depends on whether the animal is an "employee" of the company. Obviously there are animals that are owned by companies and, if they create something, the company I assume would own it whether or not it was considered a work for hire in the usual sense. But that would presumably not have been the case here.
Generative AI has basically made what were once fairly irrelevant edge cases (what if I tie together a bunch of random number generators to create art work?) a lot more interesting. And laws will probably have to be adapted.
thats the contention, its not a person who's made it.
no AI output is "original"
Should have read the small print.
The irony. AI would help with that too! :D
I asked ChatGPT and it said it was fine.
Yes, but we're doing the same thing with industrialization. The US is past the stage of factories so we'll deny everyone else factories because we deem them too pollutant now.
Not really and I’m suspicious you know these are not directly analogous.
We’re not trying to curb emissions for the purpose of kneecapping other economies. Short of China, we don’t really have any incentive to do that (bigger markets = better under US industrial policy [contradictory opinions from left wing undergrad students on Twitter don’t count as “US industrial policy”]). What’s actually happening is that we caused a problem and now it’s getting worse and in order to fix it we need to not allow everyone else to continue it.
This is a novel and advanced philosophical argument called, “two wrongs don’t make a right.”
It could be seen as a non-trade tariff barrier if you squint quite hard.
There's another example - intellectual property. The US was fine playing fast and loose with IP (Most famous example is Dickens' attempts to point out he was being pirated left and right in the States and not seeing a penny: https://www.charlesdickenspage.com/copyright1842.html)
Now America is on top of the IP pile, it sees other nations as playing fast and loose: https://www.forbes.com/sites/johnlamattina/2013/04/08/indias...
Huh I wonder if there have been any substantial changes to international cooperation, investments made thereupon, and agreements made thereupon between the 1850s and 2013.
Countries don’t have to join international trade regimes. They also don’t have to join climate/emission commitments. They do both of them because they come with benefits.
Cursory searching suggests the first real work on international copyright, by contrast, came about in 1886. Even early versions came after Dickens’ story here.
You must understand that we have to hold China to a 19th century standard while holding western nations to a 21st century standard because... uh.. reasons. Don't question it!
What? Absolutely question it. Let me know if you find an answer that’s significantly more believable than, “trying to balance local quality of life, long term environmental and economic viability, and short term economic prosperity and political stability.”
If you have a different balance to strike that you think is significantly and obviously better, I’m sure the whole world is interested in hearing it.
He was being sarcastic.
I’m aware. The implication of the sarcasm is there’s no good reason China and western countries have different standards, so that’s what I was addressing.
We share the same planet, so we need to share the same environmental standards. The self-touted "world's oldest civilization" no less than the rest.
It’s a great ideal, but there are interests other than purely environmental (or purely “fairness”) that must be taken into account as a matter of sheer necessity.
You appealed to fairness, not me.
It’d be easier to have productive conversations if you just plainly stated your position on the topic at hand, if you have one.
Did I not?
All of the excuses to hold China to a different standard are horse shit. Is that stated plainly enough for you?
Huh, so now I’m confused as to how you claim not to be invoking fairness
They're aiming to hit net zero before 2060. It's.... ambitious (which I use here as a synonym for unlikely). USA and EU are aiming for 2050.
Will anybody meet their goals? The planet is the ultimate Commons. Personally, I think we're boned.
Well obviously, change is constant.
History doesn't repeat, but it does rhyme. Look at the shape of the story. The people on top support rules that completely coincidentally help keep them on top. It's a universal impluse.
"It is difficult to get a man to understand something when his salary depends on his not understanding it." (Upton Sinclair) is another example with a similar shape, but at the scale of individuals.
John Rawls had the right idea.
One wrong gets punished and the other - "we are exceptional" so we do as we please and do not fuck with us or else.
Morality and legality aside, there's a substantive difference between use of content and use of a model. Pretraining a GPT 4-class model from raw data requires trillions of tokens and millions of dollars in compute, whereas distilling a model using GPT 4's output requires orders of magnitude less data. Add to that the fact that OpenAI is probably subsidizing compute at their current per-token cost, and it's clearly unsustainable.
The morality of training on internet-scale text data is another discussion, but I would point out that this has been standard practice since the advent of the internet, both for training smaller models and for fueling large tech companies such as Google. Broadly speaking, there is nothing wrong with mere consumption. What gets both morally and legally more complex is production - how much are you allowed to synthesize from the training data? And that is a fair question.
“Content” requires as much, if not more, effort and expense than pretraining GPT-4.
All you’re doing is redefining content, ie thoughts, ideas, movies, videos, literature, sounds, writing, etc as “raw data”. But that isn’t raw data. There was a ton of effort that went into creating the “content”. For example, a single Wikipedia page may have many hundreds of people, some who have done years of college level studies and original research, to produce a few thousand words of content. Others have done research using primary sources. All of them have had to use effort and ingenuity to craft those into actual high quality statements, which in itself was only possible in many cases due to years of training and education. Finally, they had to setup a validation process to produce useful output from this collaborative process which included loads of arguments etc to generate what you are calling “raw data”.
I’m not sure what makes GPT’s output is any less raw than all the effort that went into producing a single Wikipedia page? Further, Wikipedia actually goes out of its way to cite its sources. GPT is designed to go out of its way to obscure its sources.
The only thing GPT does, IOW, that apparently makes the data it uses is not to cite its sources, something that would at the very least lead to professional disgrace for the people who created the “raw data” GPT uses without thought, and would even lead to lawsuits and prosecution in many cases.
So besides going out of its way to obscure the source of its data, what makes GPT’s output less raw than the output people have spent billions of man hours creating?
Except that the content already exists and there is no cost to maintain it.
If GPT incurred a non negligible cost on the content owners by accessing their resources it may have been different but that's not the case.
The only thing that content owners may be able to complain about is that potentially ChatGPT/DallE may reduce their potential income and this would have to be proven. I have not stopped buying books or art of any kind since I use ChatGPT/DallE. And low quality automated content producers existed before OpenAI and were already diluting the attention to more carefully produced content (as can be seen with videos on youtube).
That's great to hear that there's no cost to maintaining content! I'll tell AWS they've been overcharging me :)
Not what I am saying. I am saying it is much much smaller than inference/model running cost. Easy exercise
How many books do you store in 1GB How much does it cost a year to store it and have OpenAI gather it once. How much does it cost to run a GPT4 level model that will output 1GB.
That's my point here that's all. It is a huge cost for OpenAI to run a system that produces dynamic content. And it is not comparable to the cost of storing static content.
I didn't talk about the cost of producing the original data.
And I do not talk about training costs.
If cost is your primary concern, shouldn't you support ByteDance's efforts to reduce inference costs by distilling the model?
(while at the same time reducing future costs for everyone by distributing the capability more widely to prevent monopolization)
Sure, but your comment said "maintain", not "store". Even if storage were free, and even if you discount the value of the initial creation to zero, there are still nontrivial serving costs associated with many sites. What I share with people on the Web may look like a static byte sequence to the robots consuming it, but it takes a lot of work to compute those bytes (in the moment, I mean). Aggregated over the whole web, no, that is not smaller than OpenAI's expenditures.
It seems like you have no idea how much effort it takes to write a book.
Quite often it contains the experience of a life of a person condensed to a few hundred pages.
ChatGPT gives easier access to the knowledge contained in tens of thousands of these books. As for me I have been reading less and less books as more wisdom is accessable on the internet in better forms (now GPT).
I'm not against what OpenAI is doing as it moves humanity forward, but like you said I won't stop using ChatGPT just because ByteDance scrapes it.
The effort and resources required to train from raw data are nothing compared to the effort and resources that went into producing the "training" input. How much dors it cost to produce all the things they scrapped from the internet? So morally they are in the wrong - I don't care if it's standard practice since "the beginning of the internet" or not.
It’s also not standard practice since the beginning of the internet. Referencing original input through links is almost foundational to the internet (at least the original internet).
In fact, the power of linking to data sources is what Google is almost entirely built upon.
And millions of documents authored by people that weren't compensated.
The difference is consolidating all of that value into a single company.
Others have already pointed out that you’re just shrugging off billions of hours and money that went into the content that is used to (pre-)train a model, so I’ll leave that for what it is.
I’m just curious how you start off with:
Only to then follow it up immediately with an argument for why one is more moral. Just because you didn’t end it with “and that’s why I think OpenAI is more moral,” doesn’t mean it’s not obvious and less of an irony.
Morality and legality are the only relevant questions in the discussion. The two methods are virtually the same... in fact I'd argue that ByteDance's usage is more fair and moral. It really doesn't matter than it's cheaper and more efficient.
The cost of hiring humans to write the trillions of tokens they trained from scratch would surely be much larger than the training cost. Except they avoided that cost by using what's available on the Internet. [1]
Similarly, people are avoiding the cost of pre-training GPT-4 class model by scraping its output.
So I think it's fair to question the moral consistency of their ToS.
[1] Please note that I am not passing a judgement on this, just stating a fact in order to make an argument.
Do we know what data was used? And what the constraints were around it?
Do we know it was used without permission or are we just jumping in the “ai bad” bandwagon?
Yes, we've known for a long time that they don't shy away from taking any old code on GitHub and regurgitating it without explicit permission. They don't have benefit of the doubt anymore.
It’s not as cut and dried as you’re making it out to be.
For years we’ve accepted that search engines, for example, can grab all the code on GitHub and use it to build a search index.
Google image search, in particular, ‘regurgitates’ all the images it has indexed when it thinks they match a search term.
It has a little disclaimer it shows next to the results saying that ‘images may be copyrighted’ - figuring out if they are copyrighted and if so by whom is left as an exercise for you the searcher. Depending on what you are using the image search for, the copyright of the images may, after all, not be relevant. Like, if you’re using a Google image search to get inspiration for home decor designs, do you care who owns the copyright of each image? Should Google?
GPT poses similar risks to that. It has the explicit disclaimer that things it produces might be subject to copyright. Depending on what you’re using the output for, the copyright may or may not be relevant.
There seems to be a fairly clear distinction between importing copyrighted material to make an index for the narrow purpose of directing people towards the copyrighted material at its original location in its original form, and importing copyrighted material to make an index which improves their own service's ability to generate unattributed derivative work. It's a bit muddier when it comes to things like image searches, but they're not exactly difficult to opt out of.
Google actually pays license fees to News Corp to excerpt their content in Google News following a legal challenge so it's not exactly conclusively established that search engines have global rights to do what they do anyway. But search engines are mostly beneficial to contact creators rather than mostly competitive with them.
Google does its best, but there are limits to “directing people towards the copyrighted material at its original location in its original form” - not everything in the intellectual property world is ‘internet native’. The original form of a song lyric or a movie screenplay doesn’t have an ‘original location’ you can be directed to. You can be directed to various online sources that may or may not accurately reproduce the original, and may or may not be scrupulous about attribution, and may or may not have legitimate copyright license to distribute it in the first place.
Yes, Google will often unwittingly point to other people's copyright violations (it's useful like that!) and will usually only take down the cache/link when requested to do so by the copyright holder.
This is irrelevant to original point about the purpose of a search engine being to highlight rather than replace existing information sources, and OpenAI's purpose for indexing content and policy of not engaging with copyright holders being completely different
No, but their products regurgitate copyrighted material so I guess they either come clean about the data they used or we have to assume they stole it.
I can quote the entire script of Monty Python and the Holy Grail - would you assume I stole it?
No, but then as you're (presumably) a person rather than an information retrieval system you would be legally responsible for ensuring you had performance rights and paid royalties if you were quoting that script as part of a commercial service. That responsibility rests with you, not whoever gave you access to the film
Conversely, photocopiers and text-to-speech engines and LLMs don't exercise choice over whether they reproduce copyrighted material and so can't be held responsible, so responsibility for clearing rights to redistribute/transform in that format clearly lies with the people inputting the copyrighted material. Obviously, OpenAI has tended to avoid making any attempts to secure those rights whatsoever
Most libraries have photocopiers for their patrons to use. It’s their patrons’ responsibility to determine if any copying they do is permissible under fair or personal use rules. The library doesn’t know what you’re planning on doing with the information they shared with you.
Few libraries use the scanner themselves to make a digital copy of every single book to import into their proprietary information retrieval service.
The ones that do secure permission where the works involved are subject to copyright.
> Do we know what data was used? And what the constraints were around it?
The fact that the question/accusation has been raised a great many times and they have not stated "we know we haven't used information without licence because we had procedures to check licensing for all the data used to train our models", would certainly imply that they scraped data for training without reference to licensing, which makes it very likely that the models are based significantly on copyrighted and copyleft covered information.
> Do we know it was used without permission
No, we don't know for sure. But the balance of probabilities is massively skewed in that direction.
There are enough examples of image producing AIs regurgitating obvious parts of unlicensed inputs, as an indication of the common practise of just scraping everything without a care for permission. So asking for those with other models to state how they checked for permission for the input data is reasonable.
I have yet not seen a definite conclusion that training a model equals breaching intellectual property rights. There is always the right of citation, and as long as the model is not producing copies of copyrighted material: where then is the violation?
IP rights do not per se give the author an absolute right to determine how their work is used. It does give rights to prevent a reproduction, but that is not what an AI model does.
But did they purchase every non public domain book they used? Highly doubt it.
I agree, also my thought, but that is a principally different case then requiring consent from the author. That suggests that OpenAI would have downloaded pirated material, or hacked paywalls, how did they get this material?
BTW I noticed that GPT-4 is good at writing legal letters of the sort that is widely available online. But a subpoena, ('dagvaarding', the Dutch version I have researched) it completely fails to create. Also there are not many subpoenas available online, and the court (in the Netherlands) only publishes the verdicts, not the other documents. Lawyers OTOH have a lot more of this available in their libraries.
So, my impression is that there is still a lot of material out there that is not in the corpus.
"Hello, we are a non-profit that wants to make AI models to benefit humanity. Can you give us access to your data to help us in our work?"
This is not what they did. They fed off enormous databases of pirated material.
Yeah, but they wont get non-public data that way. I'd bet they did get access to a lot of non-public data just by asking and stating they do it for a non-profit mission.
Did Google Books?
What if they just bought dirt cheap used copies meaning the creators saw nothing thanks to the first sale doctrine?
They could without question set up a very nice physical library in Mountain View and even invite the public in. They can probably in general scan those books for their own internal use. What got shut down was scanning the books and making them available to everyone in their entirety.
That shouldn’t matter, regarding copyright violation. Purchasing a book only makes the physical object your own, it doesn’t change anything with regard to the contained work.
This is currently untested (though, trials are in progress) and really could go either way in the courts.
Why would you conclude that ? While an AI model does not ONLY reproduce, it most certainly can make verbatim reproductions. The things preventing the user from getting copyrighted material from chatgpt are probably only rules/guardrails. The most prominent example of this is perhaps the Bible which you could get from it quote by quote within token limit.
Yes, but the original intent was to help other companies create ethical AI models. If they've already turned their back on those core values, a bit more hypocrisy won't stop them.
Stated intent and actual intent are usually different, I expect they intended the opposite all along but were just riding the open source ethical AI wave to profit.
I strongly doubt it. Sam was pretty damn rich before OpenAI. He's written a lot of stuff about AI and how he worried someone would do it. The original plan was roughly build the basis for ethical AI, have someone else build on top of it, which somehow pivoted into everyone using their API.
If they just wanted money, they could have cut out a lot of the ethical filters and choke out competitors. Google didn't stop NSFW. Twitter, Reddit, Tumblr, etc didn't. AI is bound to be used for NSFW among other things, but they've set the standards to make it ethical.
I think eventually they did let loose to try to keep ahead of competitors. This probably pissed off the board and led to the drama recently? Just speculation. Because the new models are nowhere near as anal as the initial release.
I think you are wrong here, the safety filters (“safety” and “ethics” for AI are labels for boundaries and concerns of different ideological factions, and Altman and OpenAI are deep in the “safety” side—which has the nost money and power behind it so “safety” is also becoming the generic term for AI boundaries) are an important oart if the PR and regulatory lobbying effort, which is key to OpenAI’s long term moneymaking plan, given the absence of a durable moat for commercial AI.
Neither, in practice, has OpenAI—there are whole communitids built around using OpenAI's models for very, very NSFW purposes. They've prevented casual NSFW, to oresent the image they want to thr government and interest grouoa whose support they want as they lovby to shape regulation of AI. avoiding being a target for things like the 404 Media campaign against CivitAI where the NSFW is more readily visible.
Ive talked to a few people formerly at OpenAI a few years ago - the deviation from the original mission was a “boiling frog” process that is best demarcated when the Anthropic founders left. The core team doing the actual science and engineering work did legitimately believe in the open research mission. But when funding was hard to find, things kind of broke.
well nothing is "hypocritical" in the Ayn Randian world of "I have the marbles so now you pay me"
Yes
Nope. OpenAI is bearing all the associated legal risk, not ByteDance.
There's a saying, "thief who steals from a thief has a hundred years of forgiveness".
Since when have corporations ever been ashamed of hypocrisy? Corporations can't feel shame about anything.
Bytedance violated the contractual arrangement.
That’s not the same thing as general copyright.
My guess is that using the OpenAI models to generate training data for a bespoke transformer model is extremely taxing on OpenAI's computing usage. If I were to guess, that is why that behavior is proscribed by the TOS, and why ByteDance was banned. Probably has nothing to do with the ethics of how training data is gathered.
When it comes to cutting edge business vs business decisions, the legality is often defined post-factum, e.g. in courts.
For an outsider to know whether something in a case like this is legal or not is near impossible, considering how opaque such businesses are.