There are two different questions at play here, and we need to be careful what we wish for.
The first concern is the most legitimate one: can I stop an LLM from training itself on my data? This should be possible and Perplexity should absolutely make it easy to block them from training.
The second concern, though, is can Perplexity do a live web query to my website and present data from my website in a format that the user asks for? Arguing that we should ban this moves into very dangerous territory.
Everything from ad blockers to reader mode to screen readers do exactly the same thing that Perplexity is doing here, with the only difference being that they tend to be exclusively local. The very nature of a "user agent" is to be an automated tool that manipulates content hosted on the internet according to the specifications given to the tool by the user. I have a hard time seeing an argument against Perplexity using this data in this way that wouldn't apply equally to countless tools that we already all use and which companies try with varying degrees of success to block.
I don't want to live in a world where website owners can use DRM to force me to display their website in exactly the way that their designers envisioned it. I want to be able to write scripts to manipulate the page and present it in a way that's useful for me. I don't currently use llms this way, but I'm uncomfortable with arguing that it's unethical for them to do that so long as they're citing the source.
It's funny I posted the inverse of this. As a web publisher, I am fine with folks using my content to train their models because this training does not directly steal any traffic. It's the "train an AI by reading all the books in the world" analogy.
But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable. A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.
There are many benefits to having people visit your content on a property that you own. e.g., say you are a SaaS company and you have a bunch of Help docs. You can analyze traffic in this section of your website to get insights to improve your business: what are the top search queries from my users, this might indicate to me where they are struggling or what new features I could build. In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.
Google has been providing summaries of stuff and hijacking traffic for ages.
I kid you not, in the tourism sector this has been a HUGE issue, we have seen 50%+ decrease in views when they started doing it.
We paid gazzilions to write quality content for tourists about the most different places just so Google could put it on their homepage.
It's just depressing. I'm more and more convinced that the age of regulations and competition is gone, US does want to have unkillable monopolies in the tech sector and we are all peons.
It's a legitimate complaint, and it sucks for your business. But I think this demonstrates that the sort of quality content you were producing doesn't actually have much value.
That line of thinking makes no sense. If the "content" had no value, why would google go through the effort of scraping it and presenting it to the user?
They don't present it all, they summarize it.
And let's be serious here, I was being polite because I don't know the OPs business. But 99% of this sort of content is SEO trash and contributes to the wasteland that the internet is becoming. Feel free to point me to the good stuff.
I would also think that the intrinsic value is different. If there is a hotel on a mountain writing "quality content" about the place, to them it really doesn't matter who "steals" their content, the value is in people going to the hotel on the mountain not in people reading about the hotel on the mountain.
Like to society the value is in the hotel, everything else is just fluff around it that never had any real value to begin with.
Travel bloggers and vloggers, but that is an entirely different unaffected industry (entertainment/infotainment).
I've no doubt some good ones exist, but my instinct is to ignore every word this industry says because it's paid placement and our world is run by advertisers.
Pedantry aside, let's restate as "present the core thoughts" to the user, which still implies value. I agree that most of google front page results are SEO garbage these days, but that's a separate issue from claiming that are summary of a piece of information removes the original of its value. I'd even argue that it transfers it from one entity to the other in this case.
It's not that it has no value, it's that there is no established way (other than ad revenue) to charge users for that content. The fact that google is able to monetize ad revenue at least as well as, and probably better than, almost any other entity on the internet, means that big-G is perfectly positioned to cut out the creator -- until the content goes stale, anyway.
This will be quite interesting in the future. One can usually tell if a blog post is stale, or whether it’s still relevant to the subject it’s presenting. But with LLMs they’ll just aggregate and regurgitate as if it was a timeless fact.
This is already a problem. Content farms have realised that adding "in $current_year" to their headlines helps traffic. It's frustrating when you start reading and realise the content is two years out of date.
I'd argue it only demonstrates that it doesn't produce much value for the creator.
The Google summaries (before whatever LLM stuff they're doing now) are 2-3 sentences tops. The content on most of these websites is much, much longer than that for SEO reasons.
It sucks that Google created the problem on both ends, but the content OP is referring to costs way more to produce than it adds value to the world because it has to be padded out to show up in search. Then Google comes along and extracts the actual answer that the page is built around and the user skips both the padding and the site as a whole.
Google is terrible, the attention economy that Google created is terrible. This was all true before LLMs and tools like Perplexity are a reaction to the terrible content world that Google created.
It would be a lot better if Google just prioritised concise websites.
If Google preferred websites that cut the fluff, then website operators would have an incentive to make useful websites, and Google wouldn't have as much of an incentive to provide the answer in a snippet, and everyone wins.
I guess it's hard to rank website quality, so Google just prefers verbose websites.
Google has at least two incentives to provide that answer, both of which wouldn't change. The bad one: they want to keep you on their page too, for usual bullshit attention economy reasons. The good one: users prefer the snippets too.
The user searching for information usually isn't there to marvel at beauty of random websites hiding that information in piles of noise surrounded by ads. They don't care about websites in the first place. They want an answer to the question, so they can get on with whatever it is they're doing. When Google can give them an answer, and this stops them from going from SERP to any website, then that's just few seconds or minutes of life that user doesn't have to waste. Lifespans are finite.
I strongly disagree with you.
The only reason that users prefer snippets is because websites hide the info you are looking for. The problem is that the top ranked search results are ad-infested SEO crap.
If the top ranked website were actually designed with the user in mind, they would not hide the important info. They would present the most important info at the top, and contain additional details below. They would offer the user exactly what they want immediately, and provide further details that the user can read if they want to.
Think of a well written wikipedia article. The summary is probably all that you need, but it's good that the rest of the article with all the detail is there as well. I'm pretty sure that most people prefer a well designed user-centric article to the stupid Google snippet that may or may not answer the question you asked.
Most people looking for info don't look for just a single answer. Often, the answer leads to the next question, or if the answer is surprising, you might want to check out if the source looks credible, etc. Even ads would be helpful, if they were actually relevant (eg. if I am looking for low profile graphic cards, I'd appreciate an ad for a local retailer that has them in stock).
But the problem is that website operators (and Google) just want to distract you, capture your attention, and get you to click on completely irrelevant bullshit, because that is more profitable than actually helping you.
I think optimising for that just leads to another kind of SEO slop. I mostly use the summaries for answers to questions like "what's the atomic number of aluminium". The sensible way of laying this out on a website is as a table or something like that, which requires another click, load, and manual lookup in the table. The summaries are useful for that, and if the websites want to answer that question directly, it means they want to make a bunch of tiny pages with a question like that and the answer, which is not something I want to browse through normally. (And indeed, I have seen SEO slop in this vein)
I'm curious about the tourism sector problem. In tourism, I would think the goal would be to promote a location. You want people to be able to easily discover the location, get information about it, and presumably arrange to travel to those locations. If Google gets the information to the users, but doesn't send the tourist to the website, is that harmful? Is it a problem of ads on the tourism website? Or is more of problem of the site creator demonstrating to the site purchaser that the purchase was worthwhile?
We would employ local guides all around the world to craft itinerary plans to visit places, give tips, tricks, recommend experiences and places (we made money by selling some of those through our website) and it was a success.
Customers liked the in depth value of that content and it converted to buys (we sold experiences and other stuff, sort of like getyourguide).
One day all of our content ended up on Google "what time is best to visit the Sagrada Familia" and you would have a copy pasted answer by Google.
This killed a lot of traffic.
Anyway, I just wanted to point out that the previous user was a bit naive taking his fight to LLMs when search engines and OSs have been leeching and hijacking content for ages.
I totally get that it killed your traffic. If a thousand people a day typing in "what time is best to visit the Sagrada Familiar" stopped clicking on the link to your page because Google just told them "4 PM on Thursdays" at the top of the page, you lost a bunch of traffic.
But why did you want the traffic? Was your revenue from ad impressions, or were you perhaps being paid by the city of Barcelona to provide useful information to tourists? If the former, I get that this hurt you. If the latter, was this a failure or a success?
Moreover, if it's the former, then good riddance. An ad-backed site is harming users a little on the margin for the marginal piece of information. Getting the same from a search engine is saving users from that harm.
Parent has the right question here: why did you want the traffic? Did you intend for anything good to happen to those people?. I'm going to guess not; there's hardly a scenario where people who complain about loss traffic and mean that traffic any good.
Now think of the 2nd order effects: they paid money to collect that useful information. If it’s no longer feasible to create such high quality content, it won’t magic itself into existence on its own. It’ll all be just crap and slop in a few years.
In my experience, the highest-quality content on the internet was created without a profit motive.
Except it kind of does. Almost all high-quality free content on the Internet has been made by hobbyists just for the sake of doing it, or as some kind of expense (marketing budget, government spending). The free content is not supposed to make money. An honest way of making money with content is putting up a paywall. Monetizing free content creates a conflict of interest, as optimizing value to publisher pulls it in opposite direction than optimizing for value to consumer. Can't save both masters, and all. That's why it's effectively a bullet-proof heuristic, that the more monetization you see on some free content, the more wrong and more shit it is.
Put another way, monetizing the audience is the hallmark of slop.
Google Search is ad-backed site. Especially for highly commercial queries.
They just prefer internet users to consume their ads, rather than the ads of the content creators.
They literally explained the business model in the post you replied to.
I think I have answered this already in the post, haven't I?
We sold experiences, thus we created a lot of free content from local experts and hoped that they would buy some of the tickets through our website.
If your content has a yes/no or otherwise simple, factual answer that can be conveyed in a 1-2 sentence summary, then I don't see this as a problem. You need to adapt your content strategy, as we all do from time to time.
There was never a guarantee -- for anyone in any industry at all -- that what worked in the past will always continue to work. That is a regressive attitude.
However I do have concerns about Google and other monopolies replacing large swaths of people who make their livings doing things that can now be automated. I am not against automation but I don't think the disruption of our entire societal structure and economy should be in the hands of the sociopaths that run these companies. I expect regulation to come into play once the shit hits the fan for more people.
Google snippets are hilariously wrong, absurdly often; I was recently searching for things while traveling and I can easily imagine relying on snippets getting people into actual trouble.
Presumably the issue is more the travel guides/Time Out/Tripadvisor type websites.
They make money by you reading their stuff, not by you actually spending money in the place.
Yes, Google hijacked images for some time. But in general there has "always" been the option to tell Google not to display summaries etc with these meta tags:
Google has been in trouble for doing so several times in the past and removed key features because of it. Examples: Viewing cached pages, linking directly to images, summarized news articles.
While I personally believe it should be opt-in, Google does have multiple ways to opt out of snippets while still being indexed. [1]
[1] https://developers.google.com/search/docs/appearance/snippet...
If I visit your site from Google with my browser configured to go straight to Reader Mode whenever possible, is my visit more useful to you than a summary and a link to your site provided by Perplexity? Why does it matter so much that visitors be directly on your content?
Traffic numbers, regardless if it using reader mode or not, are used as a basic valuation of a website or page. This is why Alexa rankings have historically been so important.
If Perplexity visit the site once and cache some info to give to multiple users, that is stealing traffic numbers for ad value, but also taking away the ability from the site owner to get realistic ideas of how many people are using the information on their site.
Additionally, this is AI we are talking about. Whos to say that the genrated summary of information is actually correct? The only way to confirm that, or to get the correct information in the first place, is to read the original site yourself.
As someone who uses Perplexity, I often do do this. And I don't think I'm particularly in the minority with this. I think their UI encourages it.
Yeah that's one of the best things about them for me. And then I go to the website and often it's some janky UI with content buried super deep. Or it's like Reddit and I immediately get slammed with login walls and a million annoying pop ups. So I'm quite grateful to have an ability to cut through the noise and non-consistency of the wild west web. I agree the idea that we're somewhat killing traffic to the organic web is kind of sad. But at the same time I still go to the source material a lot, and it enables me to bounce more easily when the website is a bit hostile.
I wonder if it would be slightly less sad if we all had our own decentralized crawlers that simply functioned as extensions of ourselves.
This is something I'm (slowly) working on myself. I have a local language model server and 30 tb usable storage ready to go, just working on the software :)
I have another comment that says something similar, but: is valuing a website based on basic traffic still a thing? Feels very 2002. It's not my wheelhouse, but if I happened to be involved in a transaction, raw traffic numbers wouldn't hold much sway.
If you were considering acquiring a business that had a billion pageviews a month versus 10 pageviews a month, you don't think that would affect the sale price?
The inaccuracy point is particularly problematic as either they cite you as the source despite possibly warping your content to be incorrect.. or they don't cite you and more directly steal the content. I'm not sure which is worse
Well for one thing you visiting his site and displaying it via reader mode doesn't remove his ability to sell paid licenses for his content to companies that would like to redistribute his content. Meanwhile having those companies do so for free without a license obviously does.
Should OP be allowed to demand a license for redistribution from Orion Browser [0]? They make money selling a browser with a built-in ad blocker. Is that substantially different than what Perplexity is doing here?
[0] https://kagi.com/orion/
Orion browser presuming it does what does what it's name says it does doesn't redistribute anything... so presumably not.
I asked you this in the other subthread, but what exactly is the moral distinction (I'm not especially interested in the legal one here because our copyright law is horribly broken) between these two scenarios?
* User asks proprietary web browser to fetch content and render it a specific way, which it does
* User asks proprietary web service to fetch content and render it a specific way, which it does
The technical distinction is that there's a network involved in the second scenario. What is the moral distinction?
Just put a long comment on the other thread addressing this.
I'm not sure what you mean exactly. If Perplexity is actually doing something with your article in-band (e.g. downloading it, processing it, and present that processed article to the user) then they're just breaking the law.
I've never used that tool (and don't plan to) so I don't know. If they just embed the content in an iframe or something then there's no issue (but then there's no need or point in scraping). If they're just scraping to train then I think you also imply there's no issue. If they're just copying your content (even if the prompt is "Hey Perplexity, summarise this article <ARTICLE_TEXT>") then that's vanilla infringement, whether they lie about their UA or not.
Sure it is, but which of the many small websites are going to be able to fight them legally? Most companies would go broke before getting a ruling.
Reality is, the law doesn't matter if you're big enough. As long as they're not stealing content from the big ones, they're going to be fine.
Well, I guess what I mean is if the situation is as I describe in my previous comment, then anyone who did have the money to fight it would be a shoe-in. It's a much stronger case than, for example, the ongoing lawsuits by Matthew Butterick and others (https://llmlitigation.com/).
Thanks for the link, that's fantastic to hear!
I'm seriously sick of that whole "laundering copyright via AI"-grift - and the destruction of the creative industry is already pretty noticable. All the creatives who brought us all those wonderful masterworks with lots of thought and talent behind, they're all going bankrupt and getting fired right now.
It's truly a tragedy - the loss of art is so much more serious than people seem to think it is, considering how integral all kinds of creative works are to a modern human live. Just imagine all of that being without any thought, just statistically optimized for enjoyment... ugh.
Can you explain what you mean by this? I’d be interested to know what jobs have been lost to AI (or if you are talking about something else)
Sorry for the late reply, was way too tired yesterday.
The most extreme situation is concept artists right now. Essentially, the entire profession has lost their jobs in the last year. Or casual artists making drawings for commission - they can't compete with AI and mostly had to stop selling their art. Similar is happening to professional translators - with AI, the translations are close enough to native that nobody needs them anymore.
The book market is getting flooded with AI-crap, so is of course the web. Authors are losing their jobs.
Currently, it seems to be creeping into the music market - not sure if people are going to notice/accept AI-made music. All the fantastic artists creating dubs are starting to go away as well, after all you can just synthesize their voices now.
It's quite sad, all considered.
Except, it can't possibly be like that - that would kill the Internet as you know it. It makes sense to consider scrapping for purposes of training as infringement - I personally disagree, I'm totally on the side of AI companies on this one, but there's a reasonable argument there. But in terms of me requesting a summary, and the AI tool doing it server-side before sending it to me, without also adding it to the pile of its own training data? Banning that would mean banning all user-generated content websites, all web viewing or editing tools, web preview tools, optimizing proxies, malware scanners, corporate proxies, hell, maybe even desktop viewers and editing tools.
There are always multiple programs between your website and your user's eyeballs. Most of them do some transformations. Most of them are third-party, usually commercial software. That's how everything works. Software made by "AI company" isn't special here. Trying to make it otherwise is some really weird form of prejudice-driven discrimination.
This is why media publishers went behind paywalls to get away from Google News
Ironically, I’ve just started asking LLMs to summarize paywalled content, and if it doesn’t answer my question I’ll check web archives or ask it for the full articles text.
Perplexity has source references. I find myself visiting the source references. Especially to validate the LLM output. And to learn more about the subject. Perplexity uses a Google search API to generate the reference links. I think a better strategy is to treat this as a new channel to receive visitors.
The browsing experience should be improved. Mozilla had a pilot called Context Graph. Perhaps Context Graph should be revisited?
This seems like a missing feature for analytics products & the LLMs/RAGs. I don't think searching via an LLM/RAG is going away. It's too effective for the end user. We have to learn to work with it the best we can.
Alternative take: Perplexity is protecting users' privacy by not exposing them to be turned into "insights" by the SaaS.
My general impression is that the subset of complaints discussed in this thread and in the article, boils down to a simple conflict of interest: information supplier wants to exploit the visitor through advertising, upsells, and other time/sanity-wasting things; for that, they need to have the visitor on their site. Meanwhile, the visitors want just the information without the surveillance, advertising and other attention economy dark/abuse patterns.
The content is the bait, and ad-blockers, Google's instant results, and Perplexity, are pulling that bait off the hook for the fish to eat. No surprise fishermen are unhappy. But, as a fish, I find it hard to sympathize.
I’m not sure if this is relevant but i go to a lot of sites because perplexity has it noted in its answer
I don't know what the typical usage pattern is, but when I've used Perplexity, I generally do click the relevant links instead of just trusting Perplexity's summary. I've seen plenty of cases where Perplexity's summary says exactly the opposite of the source.
This hits the point exactly, it’s an extension of stuff like Google’s zero click results, they are regurgitating a website’s content with no benefit to the website.
I would say though, it feels like the training argument may ultimately lead to a similar outcome, though it’s a bit more ideological and less tangible than regurgitating the results of a query. Services like chatgpt are already being used a google replacement by many people, so long term it may reduce clicks from search as well.
This appears to be self-contradictory. If you let an LLM to be trained* on “all the books” (posts, articles, etc.) in the world, the implication is that your potential readers will now simply ask that LLM. Not only will they pay Microsoft for that privilege, while you would get zilch, but you would not even know they ever read the fruits of your research.
* Incidentally, thinking of information acquisition by an ML model as if it was similar to human reading is a problematic fallacy.
You're missing the part where Perplexity still makes a request each time it's asked about the URL. You still get the traffic!
What will happen if:
Website owners decide to stop publishing because it’s not rewarded by a real human visit anymore?
Then perplexity and the like won’t have new information to train their models on and no sites to answer the questions.
I think there is a real content dilemma here at work. The incentives of Google and website owners were more or less aligned.
This is not the case with perplexity.
What is a "visit"? TFA demonstrates that they got a hit on their site, that's how they got the logs.
Is it necessary to load the JavaScript for it to count as a visit? What if I access the site with noscript?
Or is it only a visit if I see all your recommended content? I usually block those recommendations so that I don't get distracted from the article I actually came to read—is my visit a less legitimate visit than other people's?
What exactly is Perplexity doing here that isn't okay that people don't already do with their local user agents?
It's in the title of TFA: they're being dishonest about who they are. PerplexityBot seems to understand that robots.txt is addressed to it.
It's understood that site operators have a right to use the User-Agent to discriminate among visitors; that's why robots.txt is a standard. Crawlers that disrespect the standard have for many years been considered beyond the pale; thieves and snoopers. TFA's complaint is entirely justified.
First, I'm ignoring the output of Perplexity. I have no reason to believe that they gave the LLM any knowledge about its internal operations, it's just riffing off of what OP is saying.
Second, PerplexityBot is the user agent that they use when crawling and indexing. They never claimed to use that user agent for ad hoc HTTP requests (which are notably not the same as crawling).
Third, I disagree that anyone has an obligation to be honest in their User-Agent. Have you ever looked at Chrome's user agent? They're spoofing just about everyone, as is every browser. Crawlers should respect robots.txt, but I'd be totally content if we just got rid of the User-Agent string entirely.
Is that a distinction without a difference?
I think the robots.txt RFC was addressed specifically to crawlers; so technically "ad hoc" requests generated automatically (i.e. by robots) aren't included. But the distinction operators would like to make is between humans and automata. Whether some automaton is a crawler or not isn't relevant.
If explicitly telling it to access a URL is an access by automaton, then isn't every web browser load an access by automaton?
The flaw with that example is your web browser isn't between other users and the website, turning 500 views into one.
And if we took the analogy to the other end, one could argue that all crawlers have to be kicked off manually at some point...
The problem is here in reality the differentiation is somewhat more understood.
The honor system web is going away, that's for sure.
There are a lot of people making this assumption about the way Perplexity is working, but there is no evidence in TFA that Perplexity is caching its ad hoc requests.
And even if they were, what's left unsaid is why it even would matter if 500 views turned into one. It matters either because of lost ad revenue or lost ability to track the users' behavior. Personally, I'm okay with moving past that phase of the internet's life and look forward to new business models that aren't built around getting large numbers of "views".
So, a caching proxy? That has its own issues, but it's the opposite of access by automaton. One button press causes less than one access to the server. Though one button press still results in one user view, so it's only reducing loads in some ways.
But also is that happening here?
One button press causing a million page loads is access by automaton. The distinction seems pretty simple to me.
Actually, no, the fact that it's a crawler is the most important fact. The reason why website operators care at all about robots accessing their site (as distinct from humans controlling a browser) is historically one of two reasons:
* The pattern of requests can be very problematic. Impolite crawlers are totally capable of taking down a website by hitting it over and over and over again for hours in a way that humans won't.
* Crawlers are generally used to build search indexes, so instructing them about URLs that would be inappropriate to have show up in a search is relevant.
The behavior that OP is complaining about is that when the user pastes a URL into Perplexity, Perplexity fetches that URL. Neither the traffic pattern nor the persistence profile are remotely similar to typical crawler behavior. As far as I can see there's almost nothing to distinguish it from someone using Edge and then using Edge's built-in summarizer.
A visit is a human reader.
At the very least they get exposed to your website name.
Notice your product/service if you get lucky.
Become a customer at a later visit.
We are talking about cutting the first step off so that everything which may come afterwards is cut off as well.
In other words, content is bait, reward is a captured user whose attention - whose sanity, the finite amount of life - can be wasted or plain used against them.
I'm more than happy to see all the websites with attention economy business models to shut down. Yes, that might be 90% of the Internet. That would be the 90% that is poisonous shit.
The attention economy will never die. Attention will only shift. From websites to aggregators like perplexity.
Perplexity isn't playing in the attention economy unless they upsell you, advertise to you, or put any other kind of bullshit between you and your goal. Attention economy is (as the name suggests) about monetizing attention; it does so through friction.
I didn’t write they would. I said “like”. The next perplexity will show ads.
The attention economy will not die. Because it’s hasn’t for the last 100 years. The profits just shift to where the attention is now.
Fair enough, I agree with that. Hell, we may not need a next Perplexity, this one may very well enshittify couple years down the line - as it happens to almost any service offered commercially on the Internet. I was just saying it isn't happening now - for the moment, Perplexity has arguably much better moral standing than most of the websites they scrape or allow users to one-off browse.
The behavior that TFA is complaining about is that when the user drops a link to a site into Perplexity it is able to summarize the content of that link. This isn't about the discoverability aspect of Perplexity, they're specifically complaining that the ad hoc "summarize this post" requests don't respect robots.txt [0]. That's what I'm arguing in favor of and that's the behavior that TFA is attacking.
[0] Which, incidentally, is entirely normal. robots.txt is for the web crawler that indexes, not for ad hoc requests.
There was a human reader on the other side of the summarization feature. And they did get exposed to the website name. Is that not enough? Would it be different if equivalent summarization was being done by a browser extension?
Whats stopping perplexity caching this info say for 24 hours, and then redisplaying it to the next few hundred people who request it?
Then they don't get the extra hits. So is that it—is a "visit" important because of the data that you're able to collect from the visit?
Does this place HN's rampant use of archive.md on the same moral footing as Perplexity?
How would an LLM training on your writing reduce your reward?
I guess if you're doing it for a living sure, but most content I consume online is created without incentive (social media, blogs, stack overflow).
I write a fair amount and have been for a few years. I like to play with ideas. If an llm learned from my writing and it helped me propagate my ideas, I'd be happy. I lose on social status imaginary internet points but I honestly don't care much for them.
The craziest one is the stack overflow contributors. They write answers for free to help people become better programmers but they're mad an llm will read their suggestions and answer questions that help people become better programmers. I guess they do it for the glory of having their handle next to the answer?
I think a concern for people who contribute on Stack Overflow is that an LLM will pollute the water with so many subtly wrong answers that the collective work of answering questions accurately will be overwhelmed by a tsunami of inaccurate LLM-generated answers, more than an army of humans can keep up with checking and debugging (or debunking).
It's nice that people are willing to create content on Stack Overflow so that Prosus NV can make advertising revenue from their free labor. But ultimately only a fool would trust answers from secondary sources like Stack Overflow, Quora, Wikipedia, Hacker News, etc. They can be useful sources to start an investigation but ultimately for anything important you still have to drill down to reliable primary sources. This has always been true, and the rise of LLMs doesn't change anything.
For what it's worth, the Stack Exchange terms of service do prohibit AI generated content. I'm not sure how they actually enforce that, and in practice as the LLMs improve it's going to be almost impossible to reliably detect.
https://meta.stackexchange.com/help/gen-ai-policy
What is even more helpful than answers on S.O. are the comments. Of course it is only to begin an investigation. But who will want to clarify properly if most of the answers are LLM garbage, too many to keep up with?
It is not simply "nice", or for internet points, to take time to answer other people's questions.
Being able to pass on knowledge is the glue of society and civilization. Cynicism about the value or reason of doing so is not a replacement for a functioning structure to educate people who want to learn or to point them in the right direction.
We managed to pass on knowledge and keep civilization functioning before Stack Overflow existed. We'll be fine without it.
Yes, it's hardly surprising that people find upvotes and direct social rewards more exciting than being slurped somewhere into GPT-4's weights.
But they get to enjoy both the social proof on SO and GPT-4 existing.
It's not like they're getting validation from most readers anyway. People who vote and comment on answers are playing the SO social/karma game and will continue to do so whether GPT-4 exists or not. Conversely, people who'll find answers via an LLM instead of viewing it on SO are people who wouldn't bother logging in to SO, even if they had accounts on it in the first place.
People are complaining about losing the audience they never had.
Speaking as an SO contributor, I'm perfectly fine with having an LLM read my answers and produce output based on them. What I'm not okay with is said LLM being closed-weight so that its creator can profit off it. When I posted my answers on SO, I did so under CC-BY-SA, and I don't think it's unreasonable for me to expect any derivatives to abide by both the letter and the spirit of this arrangement.
This hits the nail completely on the head.
If the issue here was "just" training LLMs, like some AI bros want to deflect it to be, the conversation around this topic would be very different, and I would be enthusiastically defending the model trainers.
But that's not this conversation. These are companies that are trying to fold our permissively-license content into weights, close source it, and make themselves the only access point, all while pre-emptively perform regulatory capture with all the right DEI buzzwords so that the open source variants are sufficiently demonized as "alt-right" and "dangerous".
The thing that truly frightens me is that (even here on Hacker News) there is an increasing number of people that have fallen for the DEI FUD and are honestly cheering on the Sam Altmans of the world to control the flow of information.
In my experience they do it for points and kudos. Having people get your answers from LLMs instead of your answer on SO stops people from engaging with the gamification tools and so users get less points on the site.
Yeah. I don’t think people do much of anything for truly no reward. Most people want to directly impact and be recognized by others.
Because you're not getting the ad impressions anymore. The harsh reality is that people do not click on to sources, so when sites like Perplexity copy your content, you lose the revenue on that content.
This, in turn, drives all real journalism out of business. And then everyone's screwed, including these AI reposting sites.
It's a literal tragedy of the commons
It's not really a dilemma.
This is exactly what copyright serves to protect authors from. Perplexity copied the content, and in doing so directly competes with the original work, destroying it's market value and driving the original author out of business. Literally what copyright was invented to prevent.
It's the exact same situation as journalists going after Google & social media embeds of articles, which these sites propagandized as "prohibiting hyperlinking", but the issue has been the embedded (summary of the) content. Which people don't click through, and this is the entire point of those features for platforms like Facebook; Keeping users on facebook and not leaving.
This is why quite a few jurisdictions agreed with the journalists and moved to institute restrictions on such embedding.
By all practical considerations, perplexity is doing the exact same thing and trying to deflect with "we used an AI to paraphrase".
The key difference here is that linking is and always has been fine. Google's Book search feature is fair use because the purpose is to send you to the book you searched for, not substitute the book.
Google's current AI summary feature is effectively the same as Perplexity. People don't click through to the original site, the original site doesn't get ad impressions or other revenue, and is driven out of business.
What will happen is what already is happening: Journalists are driven out of business, replaced by AI slop.
And then what? AI needs humans creating original content, especially for things like journalism and fact-finding. It'd be an eternal AI winter, all LLMs doomed to be stuck in 2025.
It's in every AI developer's best interest to halt the likes of Perplexity immediately before they irreparably damage the field of AI.
I see no competition. I use Perplexity regularly to give me summaries of articles or to do preliminary research. If I like what I'm seeing, then I go to the source. If a source chooses to block their content because they don't want it to be accessed by AI bots then they reduce even further the chance of me - and increasingly more persons - touching their site at all.
"Let us steal your content or you won't get any traffic" sounds extortionate
It is what it is. AI is increasingly being used to make lives easier. Those who choose to isolate from AI choose to isolate from the many using it.
We're burning long term value and the open web for shitty chat bots.
You can say that, it doesn't matter. The statistics show that these tools reduce views.
And really, "I'm going to replace my entire news intake with the AI slop even if it's entirely hallucinated lies or propaganda" is perhaps not something you ought to say out loud.
>And then what? AI needs humans creating original content, especially for things like journalism and fact-finding. It'd be an eternal AI winter, all LLMs doomed to be stuck in 2025.
It's in every AI developer's best interest to halt the likes of Perplexity immediately before they irreparably damage the field of AI.
That’s exactly the problem and we all know that it will happen.
A lot of the public website content targeted towards consumers is already SEO slop trying to sell you something or maximize ad revenue. If those website owners decide to stop publishing due to lack of real human visits then little of value will be lost. Much of the content with real value for consumers has already moved to sites that require registration (and sometimes payment) for access.
For technical content of value to professionals, much of that is hosted by vendors or industry organizations. Those tend to get their revenue in other ways and don't care about companies scraping their content for AI model training. Like the IETF isn't going to stop publishing new RFCs just because Perplexity uses them.
This feels like the fundamental core component of what copyright allows you to forbid.
Which is a huge difference. The latter is someone asking for a copy of my content (from someone with a valid license, myself), and manipulating it to display it (not creating new copies, broadly speaking allowed by copyright). The former adds in the criminal step of "and redistributing (modified, but that doesn't matter) versions of it to users without permission".
I mean, I'm all for getting rid of copyright, but I also know that's an incredibly unpopular position to take, and I don't see how this isn't just copyright infringement if you aren't advocating for repealing copyright law all together.
I'm curious to know where you draw the line for what constitutes legitimate manipulation by a person and when it becomes distribution.
I'm assuming that if I write code by hand for every part of the TCP/IP and HTTP stack I'm safe.
What if I use libraries written by other people for the TCP/IP and HTTP part?
What if I use a whole FOSS web browser?
What about a paid local web browser?
What if I run a script that I wrote on a cloud server?
What if I then allow other people to download and use that script on their own cloud servers?
What if I decide to offer that script as a service for free to friends and family, who can use my cloud server?
What if I offer it for free to the general public?
What if I start accepting money for that service, but I guarantee that only the one person who asked for the site sees the output?
Can you help me to understand where exactly I crossed the line?
Obviously not legal advice and I doubt it's entirely settled law, but probably this step
You're allowed to make copies and adaptations in order to utilize the program (website), which probably covers a cloud server you yourself are controlling. You aren't allowed to do other things with those copies though, like distribute them to other people.
Payment only matters if we're getting into "free use" arguments, and I don't think any really apply here.
I think you're probably already in trouble with just offering it to family and friends, but if you take the next step offering it to the public that adds more issues because the copyright act includes definitions like "To perform or display a work “publicly” means (1) to perform or display it at a place open to the public or at any place where a substantial number of persons outside of a normal circle of a family and its social acquaintances is gathered; or (2) to transmit or otherwise communicate a performance or display of the work to a place specified by clause (1) or to the public, by means of any device or process, whether the members of the public capable of receiving the performance or display receive it in the same place or in separate places and at the same time or at different times."
Why is that the line and not a paid web browser? What about a paid web browser whose primary feature is a really powerful ad blocker?
Why would a paid web browser be the line?
No one is distributing copies of anything to anyone then apart from the website that owns the content lawfully distributing a copy to the user.
Also why is a paid web browser any different than a free one?
Paid is arguably different than free because the code that is actually asking for the data is owned by a company and licensed to the user, in much the same way as a cloud server licenses usage of their servers to the user. That said, I'll note that my argument is explicitly that the line doesn't exist, so I'm not saying a paid browser is the line.
I'm unfamiliar with the legal questions, but in 2024 I have a very hard time seeing an ethical distinction between running some proprietary code on my machine to complete a task and running some proprietary code on a cloud server to complete a task. In both cases it's just me asking someone else's code to fetch data for my use.
Great, so we agree that your previous comment asking I address "paid browsers" in particular was an unnecessary distraction.
It's important to recognize that copyright is entirely artificial. Congress went "let's grant creators some monopolies on their work so that they can make money off of it", and then made up some arbitrary lines for what they did and did not have a monopoly over. There's no principled ethical distinction between what is on one side of the line and the other, it's just where congress drew the arbitrary line in the sand. It then (arguably) becomes unethical to do things on the illegal side of the line precisely because we as a society agreed to respect the laws that put them on the illegal side of the line so that creators can make money in a fair and level playing field.
Sometimes the lines in the sand were in fact quite problematic. Like the fact that the original phrasing meant that running a computer program would almost certainly violate that law. So whenever that comes up congress amends the exact details of the line... in the US in the case of computers carving out an exception in section 117 of the copyright act. It provides that (in part)
and provides the restriction that
By my very much not a lawyer reading of the law, those are the relevant parts of the law, they allow things like local ad-blockers, they disallow a third party website which downloads content (acquiring ownership on a lawfully made copy), modifies it (valid under the first exception if that was a step in using the website) and distributes the adapted website to their users (illegal without permission).
How is using perplexity any more so making a copy than your browser is making a copy? Unless you are distributing your website on thumb drives or floppy disks all distribution is achieved by making a copy. That's how networks work.
Your logic would also imply that viewing a website through a VPN not operated by yourself would require the VPN operator to have a redistribution license for all the content on the website which is not the case.
How do you think google is able to scrape whatever they like and redistribute summaries of the pages they have visited without consulting everyone who has ever made a website for a redistribution license.
That being said, Copyright is not enforced or interpreted consistently. It seems that individual cases can be decided based on what people ate for lunch on the day of the case, who the litigants are, and maybe the alignment of the planets.
Both are, the difference is that your browser doesn't transfer the copy to a new legal entity after modifying it. Rather the browser is under the control of the end user and the end user owns the data (not the copyright, but the actual instance of the data) the whole time.
It doesn't because the VPN doesn't modify it, and the law explicitly distinguishes between the two cases and allows for transferring in the case of exact copies (provided you transfer all rights). I left this part of section 117 out because it wasn't relevant, but I'll quote it here
A fair use argument, which I think is less likely (and I'd go so far as to say unlikely) to apply to a service like perplexity.ai but is ultimately a judgement call that will be made by the legal system and like all fair use arguments has no clear boundaries.
TECHNICAL ANALYSIS
The key, as many here have missed, is authentication and authorization. You may have authorization to log in and view movies on Netflix. Not to rebroadcast them. Even the question of a VCR for personal use was debated in the past.
Distributing your own scripts and software to process data is not the same as distributing arbitrary data those scripts encountered on the internet for which you don’t have a license.
If someone wrote an article, your reader transforms it based on your authenticated request, and your user would have an authorized subscription.
But if that reader then sent the article down to a remote server to be processed for distribution to unlimited numbers of people, it would be “pirating” that information.
The problem is that much of the Web is not properly guarded against this. Xanadu had ideas about micropayments 30 years ago. Take a look at what I am building using the current web: https://qbix.com/ecosystem
LEGAL ANALYSIS
Much of the content published on the Web isn’t secured with subscriptions and micropayments, which is why the whole thing becomes a legal battle as silly as “exceeding authorized access” which landed someone like Aaron Swartz in jail.
In other words, it is the question of “piracy”, which has acquired a new character only in that the AI is trained on your data and transforms it before it republishes it.
There was also a lawsuit aboot scraping LinkedIn, which was settled as follows: https://natlawreview.com/article/hiq-and-linkedin-reach-prop...
Legally, you can grant access to people subject to a certain license (eg Creative Commons Share Alike) and then any derived content must have its weights opened. Similar to, say, Affero GPL license for derivative software.
Why are you ignoring his main argument?
I'm not. I'm asking why this flow is "distribution":
* User types an address into Perplexity
* Perplexity fetches the page, transforms it, and renders some part of it for the user
But this flow is not:
* User types an address into Orion Browser
* Orion Browser fetches the page, transforms it, and renders some part of it for the user
Regardless of the legal question (which I'm also skeptical of), I'm especially unconvinced that there's a moral distinction between a web service that transforms copyrighted works in an ad hoc manner upon a user's specific request and renders them for that specific user vs an installed application that does exactly the same thing.
The moral case is pretty obviously that Perplexity is preventing traffic from reaching the people who made the content.
How so? TFA pretty clearly shows that traffic does reach the server, how else would it show up in the logs?
Also, the author of TFA has already gotten themselves deindexed, the behavior they're complaining about now is that if someone copies and pastes a link into Perplexity it will go fetch the page for the user and summarize it.
This scenario presupposes that the user has a link to a specific page. I suspect that in nearly all cases that link will be copied from the address bar of an open tab. This means that most of the time the site will actually get double the traffic: one hit when the user opens it in the browser and a second when Perplexity asks for the page to summarize it.
Where exactly you crossed the line is a question for the courts. I am not a lawyer and will there for not help with the specifics.
However, please see the Aereo case [0] for a possibly analogous case. I am allowed to have a DVR. There is no law preventing me from accessing my DVR over a network. Or possibly even colocating it in a local data center. But Aereo definitely crossed a line. Also see Vidangel [1]. The fact that something is legal to do at home, does not mean that I can offer it as a cloud service.
[0] https://www.vox.com/2018/11/7/18073200/aereo
[1] https://en.m.wikipedia.org/wiki/Disney_v._VidAngel
I expect you're right. Although Perplexity thinks they're well within the law[0]. Are they correct? I guess we'll see....
[0] https://www.perplexity.ai/search/Why-are-you-2wJteqZ4SUCqPjk...
Which is offensive and the legal structure underlying that should be changed. Renting out machines, where a person could legally install and use the exact same machine, makes zero sense to count as "distribution".
I actually don't see the legal distinction here. A browser with an ad blocker is also:
1. Asking for a copy of your content
2. Manipulating the content
3. Redistributing the content to the end-user who requested it
Ditto for the LLM that has been asked by the end user to fetch your content and show it to them (possibly with a manipulation step e.g. summarization).
I don't think there's a legal, copyright distinction between doing that on a server vs doing that on a local machine. And, for example, if there were a difference: using a browser on a remote desktop would be illegal, or using curl on a machine you were SSHed into would be illegal. Also, an LLM running locally on your machine (doing the exact same thing) would be legal!
I understand that it's inconvenient and difficult to monetize content when an LLM is summarizing it, and hard to upsell other pages on a website to users when they aren't coming to your website and are instead accessing it through an LLM. But legally I think there's not an obvious distinction on copyright grounds, and if there were (other than a very fine-grained ban on specifically LLMs accessing websites, without any general principle behind it), it would catch up a lot of legitimate behavior in the dragnet.
I'd also point out that in the U.S., search engines have passed the "Fair Use" test of exemption from copyright — I think it would be very hard to make a distinction between what a search engine is doing (which is on a server!) and what an LLM is doing based on trying to say copyright distinguishes between server vs client architectures.
The difference isn't so much the server, but the third party. You're allowed to modify computer programs (websites) as part of using them. You aren't allowed to then transfer the modified version (see section 117 of the US copyright code).
If you're in control of the server there's a plausible argument that you aren't transferring it. When perplexity is in control of the server... I don't see it. A traditional ad-blocker isn't "redistributing the content to the end-user who requested it" because it's the end user who has ownership over the data the whole time (note: not the copyright, the actual individual instance of the data). Unlike with a server run by a third party there is no third party legal entity who ever has the data.
You could conceivably make "ublock origin except it's a proxy run by a third party and we modify the website on the proxy", I'd agree that that has the same problem as a service like perplexity (though a different fair use analysis and I'm not sure what way that would go).
Well, sure. It's easy to distinguish between an LLM summarizing content and a traditional search engine though (and in ways relevant to the fair use analysis), just not based on the server client architecture.
Disclaimer: Not a lawyer, not legal advice, and so on.
Section 117 is irrelevant — it grants archival rights to end-users for computer programs. It doesn't make claims about servers or legal third parties.
(Although it is relevant in disproving your point: I can pay an archival service to back up data I legally have the right to view, even if the backup is then on their server, and despite the service being a different legal entity than me. And they can give me a copy of it later, too.)
So, running a local LLM version of Perplexity that does exactly the same thing is legal, but Perplexity is illegal, because "a third party legal entity has the data"?
Why should it be possible to stop an LLM from training itself on your data? If you want to restrict access to data then don't post it on a public website. It's easy enough to require registration and agreement to licensing terms for access.
It seems like some website owners want to have their cake and eat it too. They want their content indexed by Google and other crawlers in order to drive search traffic but they don't want their content used to train AI models that benefit other companies. At some point they're going to have to make a choice.
Because if I run a server - at my own expense - I get to use information provided by the client to determine what, if any, response to provide? This isn’t a very difficult concept to grasp.
I'm having difficulty grasping the concept. Only a fool would trust any HTTP headers such as User-Agent sent by a random unauthenticated client. Your expenses are your problem.
… and I have absolutely no obligation to provide any particular response to any particular client.
Parsing, rendering, and trusting that the payload is consistent from request to request is your problem. You can connect to my server, or not. I really don’t care. What you cannot do is dictate how my server responds to your request.
Sure. So just return an HTTP 4XX response to requests you don't like. What's the problem?
Or, I return whatever content I want, within the bounds of the law, based on whatever parameters I decide. What's your problem with that? Again, connect to my server or don't. But don't tell me what type of response I'm obligated to provide you.
If I think a given request is from an LLM training module, I don't have any legal obligation whatsoever to return my original content. Or a 400-series response. If I want to intersperse a paragraph from Don Quixote between every second sentence, that's my call.
This argument of freedom seems applicable on both sides. A site owner/admin is free to return whatever response they wish based on the assumed origin of a request. An LLM user/service is free to send whatever info in the request that elicits a useful response.
I don’t have any problem with that.
But nobody is arguing for that. Instead, what the server owners want is to mandate the clients connecting to them to provide enough information to reliably reject such connections.
There are literally people in this thread arguing that it is "unethical" to discriminate based on user agent.
The client is under no obligation to be truthful in its communications with a server. Spoofing a User-Agent doesn't "dictate" anything. Your server dictates how it responds all on its own when it discriminates against some User-Agents.
With enough sophistication and bad intent, at some point being untruthful to a server falls under computer intrusion laws, eg using a password that is not yours. I don't believe spoofing user agent would be determinant for any such case though.
Even redistributing secret material you found on an accidentally open S3 bucket, without spoofing UA, could be considered intrusion if it was obvious the material was intended to be secret and you acted with bad intent.
This is a technical fact.
It is also a technical fact that a client can send any header it wants.
I think that is implied in my comment. You can send me whatever request you want, within the bounds of the law. I get to decide, within the bounds of the law, how I respond. Demanding I provide a particular response to every client (and what the parent commenter and others seem to be arguing for) is where I take exception.
The companies will scrape and internalise the "customer asked for this" requests... and slowly turn the latter into the former, or just their own tool as the scraper.
No, easier to just ask a simple question: Does the company respect the access rules communicated via a web standard? No? In that case hard deny access to that company.
These companies don't need to be given an inch.
This is exactly the concern and there’s a lot of comments just completely ignoring it or willfully conflating.
Ad block isn’t the same problem because it doesn’t and can’t steal the creator’s data.
Arguably it does. That topic has been debated endlessly and there are plenty of people on HN who are willing to fiercely argue that adblock is theft.
I happen to agree with you that adblock doesn't steal data, but I'm also completely unsure why interacting with a tool over a network suddenly turns what would be acceptable on my local computer into theft.
If that's the concern, then ask for a line in the terms and conditions that explicitly says a user-initiated request will not be saved or used for training. Don't act like the access itself is an affront.
So should Firefox not allow changing the user agent in order to bypass websites that erroneously claim to not work on Firefox?
Similarly, for sites which configure robots.txt to disallow all bots except Googlebot, I don't lose sleep about new search engines taking that with a grain of salt.
Citing the source doesn't bring you, the owner of the site, valuable data. When was your data accessed, who accessed it, from where, at what time, what device, etc. It brings data to the LLM's owner, and you get
N O T H I N G.
Could you change the way printed news magazines showed their content? No. Then, why is that a problem?
Btw nobody clicks on sources. NOBODY.
I always click on sources to verify what an LLM in this case says. I also hear the claim that a lot about people not reading sources (before LLM it was video content with references) but I always visited the sources. Is there a statistics or studies that actually support this claim? Or is it just a personal experience, of people (including me) enforcing it as generic behavior of all people?
That's you, because you are a researcher or coder or someone who uses his brain much more than average, hence not an average joe. I ran a news site for 15 years and the stats showed that from 10000 views on an article, only a miniscule amount of clicks were made on the source links. Average people do not care where the info is coming from.
Also perplexity shows the videos on their site, you cannot go to youtube, you have to start it on their site, and then you have to click on the youtube player's logo in the lower right to get to the site.
Perplexity is getting greedy.
You said "NOBODY" (pretty sure the all caps means it's extra true).
Well, that's the reality. The tech savvy people here are the exception, and represent only a very minor percentage of users.
It seems self-evident to me that if a user tells a bot to go get a web page, robots.txt doesn't apply, and the bot shouldn't respect it. I understand others' concerns that, like Apple's reader, and other similar tools, it's ethically debatable whether a site should be required to comply with the request, and spoofing an agent seems in dubious territory. I don't think a good answer has been proposed for this challenge, unfortunately.
Just to clarify, Perplexity is not spoofing a user agent, they're legitimately using a headless Chrome to fetch the page.
The author just misunderstood their docs [0]: when they say that "you can identify our web crawler by its user agent", they're talking about the crawler, not the browser they use for ad hoc requests. As you note, crawling is different.
[0] https://docs.perplexity.ai/docs/perplexitybot
This is completely false, the user agent being used by Perplexity its _not_ the headless-chrome user agent, wich is close similar to this (emphasis on HeadlessChrome):
They are spoofing it to pretend to be a desktop Chrome one:Ah, you're correct, my bad.
I don't personally have a problem with spoofing user agents, but yeah, they're either spoofing or for some reason they're truly using a non-headless Chrome.
There's a difference here between "headless chrome" as a concept and "headless-chrome" the software. It's still pretty common to run browser automation with a full "headful" browser, in which case you would just get the normal user agent. headless-chrome is sort of an optimized option that comes with some downsides.
If the user specifically asks for a file and asks a computer program to process it in a specific way, it should be permitted, regardless of user-agent spoofing (although user-agent spoofing should (normally) ideally only be done when the user specifically requests it; it should not do so automatically). However, this is better when using FOSS and/or local programs (or if the user is accessing them through a proxy, VPN, Tor, etc). Furthermore, any company that provides such services should not use unethical business practices, false advertising, etc, to do so.
If the company wants a copy of the files for your own use, then that is a bit different. When accessing large number of files at once, robots.txt is useful to block it. If they can get a copy of the files in a different way (assuming the files are intended to be public anyways), then they might do so. However, even in this case, still they should not use unethical business practices, false advertising, etc; and, they should also avoid user-agent spoofing.
(In this case, the reason for the user-agent spoofing does not seem to be deliberate, since it uses a headless browser. They should still change it though; probably by keeping the user-agent string but adding on a extra part such as "Perplexity", to indicate that it is what it is, in addition to the headless browser.)
A user-agent requests the file using your credentials, eg a cookie or public key signature.
It is transforming the content for you, an authorized party.
That is not the same as then making derivative copies and distributing the information to others without paying. For example, if I bought a ticket to a show, taped it and then distributed it to everyone, disregarding that the show prohibited this.
If I shared my Netflix password with up to 5 others, at least I can argue that they are part of my “family” or something. But to unlimited numbers of people? Why would they pay for netflix, and how would the shows get made?
I am not necessarily endorsing government force enforcing copyright, which is why I have been building a solution to enforce it at the tech level: https://Qbix.com/ecosystem
Well, I am opposed to copyright. If it is publicly available, then you can make a copy, and even a modified version (as long as you do not claim that it is the same as the original).
However, what you say about credentials is still valid in the case of private data; this is why you should run the program locally and not use some other company's remote service for this use. (Well, it is one reason why. Other reason is all of the other bad stuff they do with the service.)
It is also valid about credentials, even if it is published but requires a password to access using that service; but even then, if you would ignore copyright, you can just use a different copy of the same file (which you might make by yourself).
None of this is meaning that you cannot pay for it, if they accept payment. It is also not meaning that whoever made it is required to give it away for free. What it is meaning, is that if you have a copy, you do not have to worry about copyright and other legal mess; you can just to do it; a license is not required.
However, it is also another issue how much power big companies are wasting with your data, whether they are authorized to access it or not. This is potentially a reason to disallow some uses, but that is independent from copyright (which is bad, anyways).
I’m not saying you’re wrong, but why? And what do you mean by “your data” here?
The website that they created.
By "my data" he means the data a site spent time and money to create and publish.
The problem that Perplexity has that ad blockers don't is that they're an independent site that is publishing content based on work they didn't produce. That runs afoul of both copyright laws and section 230 which let's sites like Google and Facebook operate. That's pretty different from an ad blocker running on your local machine. The ad blocker isn't publishing the page it edited for you.
What distinguishes these two situations?
* User asks proprietary web browser to fetch content and render it a specific way, which it does
* User asks proprietary web service to fetch content and render it a specific way, which it does
The technical distinction is that there's a network involved in the second scenario. What is the moral distinction?
Why is it that a proprietary web service manipulating content on behalf of a user is "publishing" content illegally, while a proprietary web browser doing the exact same kind of transformations is not? Assume that in both cases the proprietary software fetches the data upon request, does not cache it, and does not make the transformed content available to other users.
I don't have a horse in this race, but:
That sounds like Google Translate to me, when pasting a URL.
Bonus points if instead of pasting a URL directly, it is submitted to one of the Internet Archive-like sites; and then submit that archive URL to Google Translate. That would be download and adaptation (by Google Translate) of the download and adaptation[1] (by Internet Archive) of the original content.
[1]: These archive sites usually present the content in a slightly different way. Granted, it's usually just adding stuff around the page, e.g. to let you move around different snapshots, but that's still showing stuff that was originally not there.
I'm okay with this world, as a tradeoff. I'm not sure users should have _the right_ to reformat others' content.
Users should have the right to reformat their own copy of others content (automatically as well as manually). However, if they then redistribute the reformatted copy, then they should not be allowed to claim that it is the same formatting as the original, because it is not the same as the original.
Personally I think AI is a major win for accessibility and we should not be preventing people to access information in the way that is best suited for them.
Accessibility can mean everything from a blind person wanting to interacting with a website using voice, to someone recovering from a surgery and wanting something to reduce unnecessary popups and clicks on a website to get to the information they need. Accessibility is in the eye of the accessor, and AI is what enables them to achieve it.
The way I see it, AI is not a robot and doesn't need to look at robots.txt. Rather, AI is my low-cost secretary.
I don't think you are seeing it very clearly then. Your secretary can also be a robot. What do you think an AI is if not a robot??
It doesn't "need" to look at robots.txt because nothing does.
Yeah if people get to extensive about blocking then we're going to end up with a scenario where the web request functionality is implementing by telling the chatbot's users's browser to make the fetch and submit it back to the server for processing, making it largely indistinguishable from the user making the query themselves. If CORS gets in the way they can just prompt users to install a browser extension to use the web request functionality.
To follow onto this:
If what Perplexity is doing is illegal, is it illegal to run an open-source LLM on your own machine, and have it do the same thing? If so, how are ad blockers or Reader Modes or screen readers legal?
And if it's legal to run an open-source LLM on your own machine, is it legal to run an open-source LLM on a rented server (e.g. because you need more GPUs)? And if that's legal, why is it illegal to run a closed-source LLM on servers? Could Perplexity simply release the model weights and keep doing what they're doing?
You can poison all your images with Glaze and Nightshade. Then you don't have to stop them from using them - they have to stop themselves from using them or their image generator will be useless. I don't know if there's a comparable system for text. If there was, it would probably be noticeable to humans.
The other question is once the user directs the ai to read the website instead of crawling will the site then be fair game for training?
Let’s differentiate between:
1) a user-agent which makes an authenticated and authorized request for data, and delivers to the user
2) a user who then turns around and distributes the data or its derivatives to users in an unauthorized manner
A “dumber” example would be whether I can indefinitely cache and index most of information via the Google Places API, as long as my users request each item at least once. Can I duplicate all that map or streetview photo information that google paid cars to go around and photograph? Or how about the info that Google users entered as user-generated content?
THE REQUIREMENT TO OPEN SOURCE WEIGHTS
Legally, if I had a Creative Commons Share-Alike license on my data, and the LLM was trained on it and then served unlimited requests to others, without making the weights available…
…that would be almost exactly like if I had made my code available with Affero GPL license, someone would take my code but then incorporated it into a backend software hosting a social network or something, without making their own entire social network source code available. Technically this should be enforceable via a court order compelling the open sourcing to the public. (Alternatively, they’d have to pay damages in a class action lawsuit and stop using the tainted backend software or weights when serving all those people.)
TECHNICAL ANALYSIS
The key, as many here have missed, is authentication and authorization. You may have authorization to log in and view movies on Netflix. Not to rebroadcast them. Even the question of a VCR for personal use was debated in the past.
Distributing your scripts and software to process data is not the same as distributing arbitrary data the user agent found on the internet for which you don’t have a license.
If someone wrote an article, your reader transforms it based on your authenticated request, and your user would have an authorized subscription.
LEGAL ANALYSIS
Much of the content published on the Web isn’t secured with subscriptions and micropayments, which is why the whole thing becomes a legal battle as silly as “exceeding authorized access” which landed someone like Aaron Swartz in jail.
In other words, it is the question of “piracy”, which has acquired a new character only in that the AI is trained on your data and transforms it before it republishes it.
There was also a lawsuit aboot scraping LinkedIn, which was settled as follows: https://natlawreview.com/article/hiq-and-linkedin-reach-prop...
Legally, you can grant access to people subject to a certain license (eg Creative Commons Share Alike) and then any derived content must have its weights opened. Similar to, say, Affero GPL license for derivative software.