Even if they win against openAI, how would this prevent something like a Chinese or Russian LLM from “stealing” their content and making their own superior LLM that isnt weakened by regulation like the ones in the United States.
And I say this as someone that is extremely bothered by how easily mass amounts of open content can just be vacuumed up into a training set with reckless abandon and there isn’t much you can do other than put everything you create behind some kind of authentication wall but even then it’s only a matter of time until it leaks anyway.
Pandora’s box is really open, we need to figure out how to live in a world with these systems because it’s an un winnable arms race where only bad actors will benefit from everyone else being neutered by regulation. Especially with the massive pace of open source innovation in this space.
We’re in a “mutually assured destruction” situation now, but instead of bombs the weapon is information.
I don't think they're looking to prevent the inevitable, but rather see a target with a fat wallet from which a lot of money can be extracted. I'm not saying this in a negative way, but much of the "this is outrageous!" reaction to AI hasn't been about the building of models, but rather the realization that a few players are arguably getting very rich on those models so other people want their piece of the action.
If NYT wins this, then there is going to be a massive push for payouts from basically everyone ever…I don’t see that wallet being fat for long.
If LLMs actually create added value and don't just burn VC money then they should be able to pay a fair price for the work of people they're relying upon.
If your business is profitable only when you get your raw materials for free it's not a very good business.
By that logic you should have to pay the copyright holder of every library book you ever read, because you could later produce some content you memorised verbatim.
Copyright holders do get paid for library copies, in the US.
You make it seem as if the copyright holder is making more money on a library book, than on one sold in retail, which does not appear to be the case in the US.
The library pays for the books and the copyright holder gets paid. This is no different from buying a book retail, which you can read and share with family and friends after reading, or sell it, where it can be read again and sold again. The book is the product, not a license for one person to access the book.
The rules we have now were made in the context of human brains doing the learning from copyrighted material, not machine learning models. The limitations on what most humans can memorize and reproduce verbatim are extraordinarily different from an LLM. I think it only makes sense to re-explore these topics from a legal point of view given we’ve introduced something totally new.
Human brains are still the main legal agents in play. LLMs are just a computer programs used by humans.
Suppose I research for a book that I'm writing - it doesn't matter whether I type it on a Mac, PC, or typewriter. It doesn't matter if I use the internet or the library. It doesn't matter if I use an AI powered voice-to-text keyboard or an AI assistant.
If I release a book that has a chapter which was blatantly copied from another book, I might be sued under copyright law. That doesn't mean that we should lock me out of the library, or prevent my tools from working there.
gets paid
That is the case. It's just that the fair price is fairly low and is often covered by the government in the name of the greater good.
When for-profit companies seek access to library material they pay a much much higher price.
The difference here is scale. For someone to reproduce a book verbatim from memory it would take years of studying that book. For an LLM this would take seconds.
The LLM could reproduce the whole library quicker than a person could reproduce a single book.
What do you actually believe, with that statement? Do you believe Libraries are operating illegally? That they aren't paying rightsholders?
Also: GPT is not a legal entity in the united states. Humans have different rights than computer software. You are legally allowed to borrow books from the library. You are legally allowed to recite the content you read. You're not allowed to sell verbatim recitation of what you read. This is, obvious, I think? But its exactly what LLMs are doing right now.
Imagine if tomorrow it was decided that every programmer had to pay out money for every single thing they went on the internet to learn about beyond official documentation, every Stack Overflow question they looked at, every question they went to a search engine to find. The amount of money was decided by a non-tech official who was in charge of figuring out how much of the money they earned was owed to the places they learned from. And people responded, "Well, if you can't pay up for your raw materials, then this just isn't a good business for you."
Except that every stackoverflow post is explicitly creative commons: https://stackoverflow.com/help/licensing
So I suppose it would be the like saying that if you used Stack Overflow to find answers, all of the work you created using information from it would have to be explicitly under the Creative Commons license. You wouldn't even be able to work for companies who aren't using that license if some of your knowledge comes from what you learned on Stack Overflow. Used Stack Overflow to learn anything about programming? You're going to have to turn down that FAANG offer.
And if you learned anything from videos/books/newsletters with commercial licenses, you would have to pay some sort of fee for using that information.
If your code contains verbatim copy-paste of entire blocks of non-trivial code lifted from those videos/books/newsletters with commercial licenses, then yes you would be liable for some licensing fees, at minimum.
What is a fair price? The entire NYT library would be a fraction of a fraction of the training set (presumably).
What if even though it's a small portion of the training data, their content has an outsized influence on the output being generated? A random NYT article about Donald Trump and a random Wikipedia article about some obscure nematode might be around the same share of training data but if 10,000x more users are asking about DJT than the nematode, what is fair? Obviously they'll need to pay royalties on the usage! /s
Yup, and I think that'll quickly uncover the reality that LLMs do not generate enough value relative to their true cost. GPT+ already costs $20/month. M365 Copilot costs $30/user/month. They're already the most expensive B2B-ish software subscriptions out there, there's very little market room to add in more cost to cover payments to rightsholders.
The data will have to become more curated. Exclusivity deals will probably become a thing too. Good data will be worth the money and hassle; garbage (or meh) data won't.
If they are determined to have broken the law then they should absolute be made to pay damages to aggrieved parties (now, determining if they did and who those parties are is an entirely unknown can of worms)
If this is inevitable (and I'm not saying it's not), who will produce high quality news content?
AI. And, I fear, it will be good.
Curious how AI gets the raw information if there are no reporters nor newspapers. Does AI go to meetings or interview politicians?
I can certainly imagine email correspondence. Even audio interviews. You're right that it seems at least presently AI is less likely to earn confidences. But I don't know how far off the movie "Her" actually is.
This suggests to me that copyright laws are becoming out of date.
The original intent was to provide an incentive for human authors to publish work, but has become more out of touch since the internet allowed virtually free publishing and copying. I think with the dawn of LLMs, copyright law is now mainly incentivising lawyers.
What incentive do people have to publish work if their work is going to primarily be consumed by a LLM and spat out without attribution at people who are using the LLM?
To have a positive impact on the world? Also, presumably NYT still has a business model unrelated to whatever OpenAI is doing with their data and everyone working there is still getting paid for their work...
Oh thank goodness we can rely on charity for our information economy
That’s exactly the question. They are claiming it is destroying their business, which is pretty much self-evident given all the people in here defending the convenience of OpenAI’s product: they’re getting the fruits of NYTimes’ labor without paying for it in eyeballs or dollars. That’s the entire value prop of putting this particular data into the LLMs.
You seem to be assuming an "information economy" should exist at all. Can you justify that?
Yep! I like having access to high-quality information and producing, collecting, editing, and publishing that is not free.
Much of it is only cost-effective to produce if you can share it with a massive audience, I.e. sure if I want to read a great investigative piece on the corruption of a Supreme Court Justice I can hypothetically commission one, but in practice it seems much much better to allow people to have businesses that undertake such matters and publish their findings to a large audience at a low unit price.
Now what’s your argument for removing such an incentive?
Why did you specify that this stuff you like, you only like if it's "not free"?
The hidden assumption is that the information you like wouldn't be made available unless someone was paying for it. But that's not in evidence; a lot of information and content is provided to the public due to other incentives: self-promotion, marketing, or just plain interest.
Would you prefer not to have access to Wikipedia?
I’ll restate it for clarity: I like high-quality information. Producing and publishing high-quality information is not free.
There are ways to make it free to the consumer, yes. One way is charity (Wikipedia) and another way is advertising. Neither is free to produce; the advertising incentive is also nuked by LLMs; and I’m not comfortable depending on charity for all of my information.
It is a lot cheaper to produce low-quality than high-quality information. This is doubly so in a world of LLMs.
There is ONE Wikipedia, and it is surely one of mankind’s crowning achievements. You’re pointing to that to say, “see look, it’s possible!”?
We absolutely need an information economy where people can research things and publish what they find without needing some deep pocketed sponsors. Some may do it for money, some may do it for recognition. Once AI absorbs all that information and uses it without attribution these incentives go away. I am sure OpenAI, Microsoft and others will love a world where they have a quasi monopoly on what information goes to the public but I don't think we want that.
Without 200 years of copyright protection, how will any author be able to afford food?
The fact that copyright protection is far too long is entirely separate from the need for some kind of copyright protection to exist at all. All evidence suggests that it's completely impossible to live off your work unless you copyright it for some reasonable period, with the possible exception of performance art (music, theater, ballet).
A writer or journalist just can't make money if any huge company can package their writing and market it without paying them a cent. This is not comparable to piracy, by the way, since huge companies don't move into piracy. But you try to compete with both Disney and Fox for selling your new script/movie, as an individual.
This experiment has also been tried to some extent in software: no company has been able to live off selling open source software. RedHat is the one that came closest, and they actually live by selling support for the free software they sell. Others like MySQL or Mongo lived by selling the non-GPL version of their software. And the GPL itself depends critically on copyright existing. Not to mention, software is still a best case scenario, since just having a binary version is often not enough, you need the original sources which are easy to guard even without copyright - no one cares so much for the "sources" of a movie or book.
Which evidence?
The fact that it has never been done successfully outside performance arts.
That's a large category that includes everything from YouTubers to furry artists to live concerts.
I think you're making a profound point here.
I believe you equate incentive to monetary rewards. And while that it probably true for the majority of news outlets, money isn't always necessarily what motivates journalists.
So considering the hypothetical situation where journalists (or more generally, people that might publish stuff) were somehow compensated. But in this hypothetical, they would not be attributed (or only to very limited extent) because LLMs are just bad at attribution.
Shouldn't in that case the fact that information distribution by the LLM were "better" be enough to satisfy the deeper goal of wanting to publish stuff? Ie.: reach as many people looking for that information as possible, without blasting it out or targeting and tracking audiences?
I would guess the monetisation is going to be limited to either subscriptions or advertising if your reputation allows people to especially value your curation of facts/reporting etc. The big issue with LLMs is the lack of reliability - it might be accurate or it might be an hallucination.
Personally, I think it would be a lot simpler if the internet was declared a non-copyright zone for sites that aren't paywalled as there's already a legal grey area as viewing a site invariably involves copying it.
Maybe we'll end up with publishers introducing traps/paper towns like mapmakers are prone to do. That way, if an LLM reproduces the false "fact", it'll be obvious where they got it from.
I notice this in myself, even though I've never particularly made money from published prose on the internet.
But (under different accounts) I used to be very active on both HN and reddit. I just don't want to be anymore now for LLM reasons. I still comment on HN, but more like every couple of weeks than every day. And I have made exactly one (1) comment on reddit in all of 2023.
I'm not the only one, and a lot of smaller reddit communities I used to be active on have basically been destroyed by either LLMs, or by API pricing meant to reflect the value of LLM training data.
Maybe a specific example will help here. An Author spends a year writing a technical book, researching subtle technical issues, creating original code and finding novel ways of explaining difficult abstractions.
A few weeks after the release it finds books on Amazon who plagiarized the book. Finds copies of the book available for free from Russian sites, and ChatGPT spitting verbatim parts of the source code on the book.
Which parts of copyright law would you say are out of date for the example above?
The expectation that the author will get life+70 years of protection and income, when technical publications are very rarely still relevant after 5 years. Also, the modern ease of copying/distribution makes it almost impossible for the author to even locate which people to try to prosecute.
The expectation to make money from artificially restricting an abundant resource. While copyright is a way to create funding, it also massively harms society by restricting future creators from being able to freely reuse previous works. Modern ways to deal with this are patronage, government funding, foundations (e.g. NLNet) and crowdfunding.
Also, plagiarism has nothing to do with copyright. It has to do with attribution. This is easily proven: you can plagiarise Beethoven's music even though it's public domain.
https://questioncopyright.org/minute-memes-credit-is-due
The cost of copying and publishing has been almost irrelevant to the need for copyright at least since the times of the printing press. In fact, when copying books was extremely expensive work, copyright was not even that needed - the physical book was about as valuable as the contents, so no money was there to be made from copying someone else's work vs coming up with your own.
And yet the content industry still creates massive profits every year from people buying content.
I think internet-native people can forget that internet piracy doesn’t immediately make copyright obsolete simply because someone can copy an article or a movie if sufficiently motivated. These businesses still exist because copyright allows them to monetize their work.
Eliminating copyright and letting anyone resell or copy anything would end production of the content many people enjoy. You can’t remove content protections and also maintain the existence of the same content we have now.
So Chinese LLMs are bad actors, but USA LLMs are the good guys?
I don't see it that way, but I'm sure from an American perspective that how it seems.
What? This is about whether one country wants to cede a massive economic advantage to another country.
So the US should stop enforcing copyright or child labor laws because some other countries may not, giving them an economic advantage?
In contrast to child labor laws, which are intended and written to protect vulnerable people from exploitation, current copyright laws are tailored to the interests of Disney et al.
If they were watered down, I wouldn't see any moral or ethical loss in that.
Copyright law is far from perfect, but the concept is not morally bankrupt. It is certainly abused by large entities but it also, in principle, protects small content creators from exploitation as well. In addition to journalists, writers, musicians, and proprietary software vendors, this also includes things like copyleft software being used in unintended ways. When I write copyleft software, it is my intention that it is not used in proprietary software, even if laundered through some linear algebra.
I'm also far more amenable to dismissing copyright laws when there is no profit involved on the part of the violator. Copying a song from a friend's computer is whatever, but selling that song to others certainly feels a lot more wrong. It's not just that OpenAI is violating copyright, they are also making money off of it.
With the exception of source code availability, copyleft is mostly about using copyright to destroy itself. Without copyright (which I feel is unethical), and with additional laws to enforce open sourcing all binaries, copyleft need not exist.
So it is not good when people use copyleft as a justification for copyright, given that its whole purpose was to destroy it.
Source code availability (and the ability to modify the code on a device) is the most important part, IMO , regardless of RMS's original intention. Do you feel that it's ethical that OpenAI is keeping their model closed?
Well yeah... If they want to keep the lead on AI (which everything indicates they want).
yes
On the other hand, you could also argue that if AI takes all financial incentives from professionals to produce original works, then the AI will lose out on quality material to train on and become worse. Unless your argument is there’s no need for anything else created by humanity, everything worth reading has already been written, and humanity has peaked and everyone should stop?
Like all things, it’s about finding a balance. American, or any other, AI isn’t free from the global system which exists around us— capitalism.
People produce countless volumes of unpaid works of art and fiction purely for the joy of doing so; that's not going to change in future.
Anecdotal but I know lots of creatives (and by creatives I also include some devs) who've stopped publishing anything publicly because of various AI companies just stealing everything they can get their hands on.
They don't mind sharing their work for free to individuals or hell, to a large group of individuals and even companies, but AIs really take it to a whole different level in their eyes.
Whether this is a trend that will accelerate or even make a dent in the grand scheme of things, who knows, but at least in my circle of friends a lot of people are against AI companies (which is basically == M$) being able to get away with their shenanigans.
Why should OpenAI be the one making money off their hard work, even if they do it for free?
You've missed the point he was making -- that Chinese and Russian companies don't care about American copyright and will do whatever is in their interest.
And although you were being flippant, yes, Chinese LLMs are bad actors.
I don’t really see it as good guys or bad guys - just that China (and Russia) don’t really care too much about American copyright.
And there seems to be an an obvious advantage from my perspective to having an information vacuum that is not bound by any kind of copyright law.
If that’s good or bad is more of a matter of opinion.
This argument is moot. Just because some countries - see china - steal intellectual property it doesnt mean we should. There are rules to the games we play specifically so we dont end up like them.
The word ‘moot’ does not mean what you think it means.
It can do though. While the proper definition is "worthy of discussion / debatable", it can also refer to a pointless debate.
"Moot derives from gemōt, an Old English name for a judicial court. Originally, moot referred to either the court itself or an argument that might be debated by one. By the 16th century, the legal role of judicial moots had diminished, and the only remnant of them were moot courts, academic mock courts in which law students could try hypothetical cases for practice. Back then, moot was used as a synonym of debatable, but because the cases students tried in moot courts were simply academic exercises, the word gained the additional sense "deprived of practical significance." Some commentators still frown on using moot to mean "purely academic," but most editors now accept both senses as standard."
- Merriam-Webster.com
Do you really think the commenter meant to use moot to mean “purely academic?”
"Moot" means "arguable". That's what GP was saying.
It's impossible to "steal" intellectual property without some kind of mind wiping device.
You must have used that device if you're making that argument in good faith.
Okay, so how is it possible to take and deprive the author of their original? The correct term would be "unauthorised copying".
Ok, let’s address this from the standpoint of a node in the network of the thoughtscape. A denizen of the “inter”net, and also a victim of the exploitive nature of artists.
Media amalgamated power by farming the lives of “common” people for content, and attempt to use that content to manage lives of both the commons and unique, under the auspice of entertainmet. Which in and of itself is obviously a narrative convention which infers implied consent (id ask to what facetiously).
Keepsake of the gods if you will…
We are discussing these systems as though they are new (ai and the like, not the apple of iOS), they are not…
this is an obfuscation of the actual theft that’s been taking place (against us by us, not others).
There is something about reaping what you sow written down somewhere, just gotta find it.
-mic
Countless Americans are happily 'stealing' intellectual property everyday from other Americans by accessing two websites — SciHub and LibGen — who owe their very existence to them being hosted in foreign countries with weak intellectual property protection and not being subject to US long-arm jurisdiction. Even on this website, using sites like archive.is (which would be illegal if they operated in the US) to bypass paywalls to access copyrighted material is common and rarely frowned upon. I doubt a culture of respecting copyright is as characteristic of "us" as you seem to think.
An LLM in Russia can commit the same crime in Russia, and get sued in Russia. No idea about China, but I know Russia has a working legal system.
For some definitions of “working”.
Working enough that people and companies there exist, live, and are to some degree successful, yes. I've visited multiple times in the past few years and I found it to be pretty normal
My understanding is they have one of the most corrupt and unjust legal systems of the developed countries.
“Works on my machine!”
Navalny probably has a different opinion.
There isn’t a country on the planet that doesn’t have people and companies. That doesn’t mean they all have functional legal systems.
They probably didn’t start with a lawsuit. They started asking for royalties. They probably didn’t get an offer they thought was fair and reasonable so they sued.
These media businesses have shareholders and employees to protect. They need to try and survive this technological shift. The internet destroyed their profitability but AI threatens to remove their value proposition.
Sorry, how exactly LLM threatens NYT? Are people supposed to generate news themselves? Or like wait a year or so before NYT articles are consumed by LMMs?
NYT doesn't just publish "news" as in what happened yesterday; they also publish analysis, reviews of books and films, history, biography and so on. That's why people cite NYT articles from decades ago.
I’m ambivalent.
On the one hand, they should realize they are one of today’s horse carriage manufacturers. They’ll only survive in very narrow realms (someone has to build the Central Park horse carriages still), but they will be miniscule in size and importance.
On the other hand, LLMs should observe copyright and not be immune to copyright.
Foreign companies can be barred from selling infringing products in the United States.
Russian and Chinese consumers are less interested in English-language articles.
I can’t really get behind the argument that we need to let LLM companies use any material they want because other countries (with other languages, no less) might not have the same restrictions.
If you want some examples of LLMs held back by regulations, look into some of the examinations of how Chinese LLMs are clearly trained to avoid answering certain topics that their government deems sensitive.
But they're not; you can download open source Chinese base models like Yi and Deepseek and ask them about Tianmen Square yourself and see, they don't have any special filtering.
I suspect they will crack down on that within the next few years.
Isn't it just one additional step to automatically translate them?
The NYT's strongest argument for infringement is that OpenAI is reproducing their content verbatim (and to make matters worse, without attribution). IANAL but it seems super likely to me that this will be found to be infringing sooner or later.
Do I really want to use a Chinese word processor that spits unattributed passages from the NYT into the articles I write? Once I publish that to my blog now I'm infringing and I can get sued too. Point is I don't see how output which complies with copyright law makes an LLM inferior.
The argument applies equally to code, if your use of ChatGPT, OpenAI etc. today is extensive enough, who knows what copyrighted material you may have incorporated illegally into your codebase? Ignorance is not a legal defense for infringement.
If anything it's a competitive advantage if someone develops a model which I can use without fear of infringement.
Edit: To me this all parallels Uber and AirBnB in a big way. OpenAI is just another big tech company that knew they were going to break the law on a massive scale, and said look this is disruptive and we want to be first to market, so we'll just do it and litigate the consequences. I don't think the situation is that exotic. Being giant lawbreakers has not put Uber or AirBnB out of business yet.
It better. Copyright has essentially fucking ceased to exist in the eyes of AI people. Just because you have a shiny new toy doesn't mean the law suddenly stops applying to you. The internet does its best to route around laws and government but the more technologically up to date bureaucracy becomes, the faster it will catch up.
Yeah I mean I'm not even really a fan of how copyright law works, but I don't see how you can just insert an "AI exemption." So OpenAI can infringe because they host an AI tool, but we humans can't? That would be ridiculous. Or is "I used AI when I created this" a defense against infringement? Also seems ridiculous. Why would we legally privilege machine creation of creative works over human creation in the first place? So I don't see what the credible AI-related copyright law reform is going to be yet.
Which means that either OpenAI is allowed to be the only lawbreaker in the country (because rich and lawyers), or nobody is. I say prosecute 'em and tell them to make tools that follow the law.
Trying to prevent AI from learning from copyrighted content would look completely stupid in a decade or two when we have AIs that are just as capable as humans, but solely due to being made of silicon rather than carbon are banned from reading any copyrighted material.
Banning a synthetic brain from studying copyrighted content just because it could later recite some of that content is as stupid as banning a biological person from studying copyrighted content because it could later quote from it verbatim.
It's not exactly a synthetic brain though, is it? LLMs are more like lookup tables for the texts they're trained on.
We will not have "AIs as capable as humans" in a couple decades. AIs will keep being tools used by humans. If you use copyrighted texts as input to a digital transformation, that's vopyright infringement. It's essentially the same situation as sampling in music, and imo the same solutions can be applied here: e.g. licenses with royalties.
We have this now with humans. I've been in a lifelong sruggle for knowledge and tools that I can afford.
I see a complete economic collapse unless creators start getting paid both for their data upfront, and paid royalties when their data is used in an LLM response
Copyright doesn’t protect data, it only protects expression.
While I didn't say anything about copyright (obviously our current copyright laws are completely ill-equipped to handle how LLMs work), feel free to replace data with whatever you like. writing, art, music, etc. It's all the same.
All of this can be true (I don’t think it necessarily is, but for the sake of argument), but it’s legally irrelevant: the court is not going to decide copyright infringement cases based on geopolitical doctrines.
Courts don’t decide cases based on whether infringement can occur again, they decide them based on the individual facts of the case. Or equivalently: the fact that someone will be murdered in the future does not imply that your local DA should not try their current murder cases.
The issue here is that the case law is not settled at all and there is no clear consensus on whether OpenAI is violating any copyright laws. In novel cases like this where the courts essentially have to invent new legal doctrines, I think the implications of the decision carries a tremendous amount of weight with the judges and justices who have to make that decision.
Any piece of pie deemed too big for one person to eat will be split accordingly.
I don’t think NYT, or any other industry, for that matter knows AI isn’t going away: in fact, they likely prefer it doesn’t, so long as they can get a slice of that pie.
That’s what the WGA and SAG struck over, and won protections ensuring AI enhanced scripts or shows will not interfere with their royalties, for example.
The war on drugs has also been unwinnable from the start and yet they built an economy on top of it, with entire agencies and a prison industry. When it comes to the fabrication and exploitation of illegality, unwinnability may be a feature, not a bug.
What's the actionable advice here? US regulation should be the lowest common denominator of all countries one considers in competition? Certainly Chinese and Russian LLMs could vacuum up all the information. China already cares little about copyright and trademark, should they stop being enforced in the US?
My opinion is that the US should do things that are consistent with their laws. I don't think a Chinese or Russian LLM is much of a concern in terms of this specific aspect, because if they want to operate in the US they still need to operate legally in the US.
SciHub was an early warning, IMHO, that there's a strong risk of the first world fumbling the ball so badly with IP that tech ecosystems start growing in the third world instead. The dominant platform for distributing scientific journal papers is no longer Western. Maybe SciHub is economically inconsequential, but LLM's certainly are not!
Imagine if California had banned Google spidering websites without consent, in the late 90's. On some backwards-looking, moralizing "intellectual property" theory, like the current one targeting LLM's. 2/3rd of modern Silicon Valley wouldn't exist today, and equivalent ecosystems would have instead grown up in, who knows where. Not-California.
We're all stupidly rich and we have forgotten why we're rich in the first place.
Another way to look at it is to consider being stolen part of business model.
There are massive number of piracy content in China, but Hollywood are also making billions in the same time, and in fact China already surpassed NA as #1 market for Hollywood years ago [1].
NYT is obvious different than Disney, and may not be able to bend their knees far enough, but maybe there can be similar ways out of this.
[1] https://www.theatlantic.com/culture/archive/2021/09/how-holl...
Access to ressources is hardly a new problem: when I was an NLP graduate student about a decade ago a teacher of us had scrapped (and continued to do so) a major newspaper for years to make a corpus. The legality of that was questionable at best, yet it was used in academic paper and a subset for training.
The same is equally applicable to image: Google got rich in part by making illegal copies of whatever image he could find. Existing regulations could be updated to include ML model but that won't stop bad or big enough actors to do what they want.
No, we aren't. Very good spam generators aren't comparable to mass destruction weapons.
I have faith in your ability to make it through these difficult times.