People with websites used to have a clear reason to allow bots to crawl and index our sites. Google and everyone sent us traffic. There was something of a trade off. Google has been slowly changing that trade by displaying more and more of our sites on google.com rather than sending people our way.
As far as I can see there's no sending people away from SearchGPT, it just gives answers. I can't see any reason to allow AI crawlers on my sites, all they do is crawl my site and slow things down. I'm glad that most of them seem to respect robots.txt.
https://github.com/ai-robots-txt/ai.robots.txt/blob/main/tab...
Some of them identify themselves by user agent but don't respect robots.txt, so you have to set up your server to 403 their requests to keep them out. If they start obfuscating their user agents then there won't be an easy solution besides deferring to a platform like CloudFlare which offers to play that cat and mouse game on your behalf though.
If I were making a search engine or AI crawler, I would simply pose as Googlebot
Google actually provides means of validating whether a request really came from them, so masquerading as Googlebot would probably backfire on you. I would expect the big CDNs to flag your IP address as malicious if you fail that check.
https://developers.google.com/search/docs/crawling-indexing/...
You could maybe still only follow robots.txt rules for Googlebot.
The entry here for Perplexity is the one that got a lot of attention but it's also unfair: PerplexityBot is their crawler, which uses that user agent and as far as anyone can tell it respects robots.txt.
They also have a feature that will, if a user pastes a URL into their chat, go fetch the data and do something with it in response to the user's query. This is the feature that made a big kerfuffle on HN a while back when someone noticed it [0].
That second feature is not a web crawler in any meaningful sense of the word "crawler". It looks up exactly one URL that the user asked for and does something with it. It's Perplexity acting as a User Agent in the original sense of the word: a user's agent for accessing and manipulating data on the open web.
If an AI agent manipulating a web page that I ask it to manipulate in the way I ask it to manipulate it is considered abusive then so are ad blockers, reader mode, screen readers, dark reader, and anything else that gives me access to open web content in a form that the author didn't originally intend.
[0] https://news.ycombinator.com/item?id=40690898
No, thats illogical.
The action is indeed prompted by a human, but so is any crawl in some way. At some point they either configured an interval or other trigger to send the script to the Web host to fetch anything it can find.
It's inherently different to extensions such as adblockers that just remove elements according to configuration.
After all, the users device will never even see the final DOM now. instead it's getting fetched, parsed and processed on a third device, which is objectively a robot. You'd be able to make that argument only if it was implemented via an extension (users device fetched the page and posts the final document to the LLM for processing).
And that's ignoring the fact that adblockers are seen as illegitimate by a lot of websites too, and they often try to block access to people using these extensions too.
I wrote a reply but you edited out the chunk of text that I quoted, so here's a new reply.
Sure, but why does it matter if the machine that I ask to fetch, parse, and process the DOM lives on my computer or on someone else's? I, the human being, will never see the DOM either way.
This distinction between my computer and a third-party computer quickly falls apart when you push at it.
If I issue a curl request from a server that I'm renting, is that a robot request? What about if I'm using Firefox on a remote desktop? What about if I self-host a client like Perplexity on a local server?
We live in an era where many developers run their IDE backend in the cloud. The line between "my device" and "cloud device" has been nearly entirely blurred, so making that the line between "robot" and "not robot" is entirely irrational in 2024.
The only definition of "robot" or "crawler" that makes any kind of sense is the one provided by robotstxt.org [0], and it's one that unequivocally would incorporate Perplexity on the "not robot" side:
Or the MDN definition [1]:
Perplexity issues one web request per human interaction and does not fetch referenced pages. It cannot be considered a "crawler" by either of these definitions, and the definition you've come up with just doesn't work in the era of cloud software.
[0] https://www.robotstxt.org/faq/what.html
[1] https://developer.mozilla.org/en-US/docs/Glossary/Crawler
I clearly would not. "Slow down your site" because information is super useful when you can't search it properly.
Sites might exist for reasons other than to "be useful". At a bare minimum, they may be trying to sell eyeballs to advertisers, but they also might be trying to deliver an experience, induce some deeper engagement, make a sale, build a community, whatever.
All of that disappears when a bot devours whatever it assesses to be your "content" and then serves it up as a QA response, stripped of any of the surrounding context.
Because reading nonsense inside an infinite debatable context is fun. I know what you're talking about and frankly I'm not impressed.
You know why people like these chat systems? Because it straight up saves time. When a system is made it to indexable, "context dependent", and "creating a certain experience" it just begs to be summarized and made to be something you can use. That interpretable work is.... Pointlessly difficult.
A good example: discord. A vast number of communities are designed to be "experiences" where you have to pour hours of your time to adapt to their little fiefdoms if you wanted to obtain any useful information in the form of important information on a topic. Try doing this in any serious fashion and you will quickly be wasting more of your time than you want.
Yeah so maybe chatgpt gives you the occasional incorrect fact. I haven't had that happen in any way, shape, or form. Furthermore: just be critical of your information. Not hard, and they are already working on fixing that.
Especially for people that are bonafide adults time is worth more than "the pride of human work".
I'm confused how this can be your opinion while you're also spending time on this website responding to people.
Why are you not just asking chatgpt "what's the latest tech news"?
Could it be that there's something else you get from this site other than just it's content being easily searchable in someone elses database?
I imagine that something else is conversation.
Note however that HN is not gatekeeping any useful information that may be produced during conversations here; in fact, it's all indexed and searchable.
Sure, and if a chatbot can helpfully summarize factual content being gatekept in a Discord chat, then that's fantastic, but I don't think that's quite what I'm getting at. The internet has room for more than just an infinite queue of fact-seekers interacting with a bank of fact-repositories. Some writing (eg, poetry) is clearly art and the people who have created it are entitled to have a bit of say over how that art is consumed and under what regimes it is summarized/remixed. Or at least us as those consumers should have the discernment required to be able to say "this isn't authentic, let me seek out the original instead."
I'm not normally a purist on these things, but I'm recalling musical artists who bemoaned the destruction of the album format in favour of $0.99/track sales in the early days of the iTunes store. Concept albums in the vein of Sgt Peppers still exist of course, but almost every modern mainstream song is now prepared first and foremost to be listened to in isolation. I didn't care for those arguments at the time they were being made, but years later, I can appreciate how something was lost there and that it might have been appropriate to let artists specify that album X was to be sold only as an album.
If that's the case, I think it's fair to say that we can skip websites and just host a service that chatgpt can talk to. If you're a restaurant, user can actually order right from the chat. or voice.
So, instead of giving 30% fee for DoorDash or Wolt about visibility, we start giving that fee for "some-AI-search-tool", and they don't allow selling food cheaper than you get by using the AI search process ordering. I don't like this era.
That's assuming you don't have any customers. Those will just directly name your restaurant. I guess the only way to attract new customers would be word-of-mouth, or some other form of advertising.
I am assuming that most customers comes from deliveries in these days. People optimize time.
Only some big, well-known restaurants are an exception because people already known them and know to look for them. They are big enough to have online shop for food delivery and some form of delivery system.
For other places, it is not like that. They need to be visible on platforms that people use. If they are not there, people don't order take-away food because they mainly use applications which provide decent-enough catalog about available restaurants with delivery option and easy payment process.
Typical platform monopoly problem exists because people tend to be lazy and it is hard to get visibility on traditional search engines, when they are flooded by ads for the normal users, and then if you filter them out, SSO spam, which might not be related to restaurants, and then finally, competition against other restaurants.
If you want ChatGPT to say nice things about you (or bad things about your competitors), then you'll need to give it your version of information - at least that will be the line peddled to us.
I've already received emails from SEO snake oil sellers now advertising themselves as being able to influence ChatGPT output.
Pedalled or peddled? :)
Fixed
Maybe this is an excellent time for prompt injection :)
So you get the information out to people without even having to serve traffic? Sounds like a win? At the end of the day, if they want to book/buy anything they’ll have to go to your site
I guess collaborative sites like forums are likely to suffer greatly from this shift.
Do you think HN will suffer greatly from AI bots digesting the site?
For a long time now, that reason was to show ads and the content quality was very, very low. It destroyed journalism, it destroyed everything pretty much. Some of the blame is on Google but a lot of the blame is on the people with websites IMHO.
People with websites can go back to the good old days when they made websites to show off their talents, persuade people into activism, spread ideas and seek interaction with likeminded humans.
LLMs have the problem of hallucinations but when that's solved I wouldn't be looking back. Hopefully, Google itself would be disrupted.
Maybe we can finally have a business model for high quality data, like journalists selling their work on current events without the need to present it in the most bombastic way possible?
I think the world currently is in strong need of a way to process large amounts of data streaming in and make sense of it. AI can be very useful of that. I would love to have access to an LLM that can keep up with the current events so I don't have to endure all the nazi stuff on Twitter.
I don't want censorship it's just that I would be perfectly happy to know that there are bunch of people thinking that Biden was replaced and that there are some other people thinking that Michelle Obama is actually a man without have to read it like 100 times a day when I'm trying to look at the opinions of the people on something. It's cool to know that there are such people(or bots?) out there but I don't want to read their stuff, I want to the computer to on top of all that and give me a brief on, then I can drill down if I want to know more or see the exact content.
LLM hallucinations can't be solved as it is the whole LLM mechanisms. all LLM results are hallucinations, it just happens some results are more true/useful than others.
If it doesn't send you to the sites, what's the difference from just using ChatGPT?
My impression from the demo is it has a perplexity-like result, with the answer and references to where each part comes from.
Search is a familiar interface and chat is a bit janky sometimes.
Fundamentally the problem here is that in many (maybe about half of) search cases people are not ultimately interested in visiting a website, they are interested in a piece of information on the website. The website is essentially a paywall for that information. The website is a middleman, and middlemen can be easily disrupted by a better one.
So what we really need here is a new approach to funding the availability of information. Unfortunately, ads are fairly lucrative because advertisers are willing to pay a lot more than users are. You could I guess do something where SearchGPT pays a couple of cents out of a monthly fee to each information source it used. Much harder with LLMs, since the source of information is potentially very diffuse and difficult to track. And even if you tracked it each publisher would get such a tiny fraction of what they are making now.
But the difficult part for web publishers is that AI powered information retrieval is a significantly better user experience, which means it's very likely to win no matter what.
Slowly? Google gave up quality searches years ago
To an extent, I actually like the trend as Joe Average User.
Most websites are just plain filthy and even dangerous today. I know I am not opening any link to a website I don't already know and trust unless it's in a Private window (fuck their cookies) with JavaShit more than likely blocked. If it's really shady I'll fire up an entire disposable VM for it first.
Google, Bing, et al. just putting the content right then and there saves me time and hassle from dealing with the ancillary garbage myself.
It's honestly a tragedy of the commons. Big Tech wants more traffic and to keep it, websites want more traffic and just throw whatever literal shit they can muster (aka SEO).