I'm more interested in what that content farm is for. It looks pointless, but I suspect there's a bizarre economic incentive. There are affiliate links, but how much could that possibly bring in?
This reminds me of how GPT-2/3/J came across https://reddit.com/r/counting, wherein redditors repeatedly post incremental numbers to count to infinity. It considered their usernames, like SolidGoldMagikarp, such common strings on the Internet that, during tokenization, it treated them as top-level tokens of their own.
https://www.alignmentforum.org/posts/8viQEp8KBg2QSW4Yc/solid...
https://www.lesswrong.com/posts/LAxAmooK4uDfWmbep/anomalous-...
Vocabulary isn't infinite, and GPT-3 reportedly had only 50,257 distinct tokens in its vocabulary. It does make me wonder - it's certainly not a linear relationship, but given the number of inferences run every day on GPT-3 while it was the flagship model, the incremental electricity cost of these Redditors' niche hobby, vs. having allocated those slots in the vocabulary to actually common substrings in real-world text and thus reducing average input token count, might have been measurable.
It would be hilarious if the subtitle on OP's site, "IECC ChurnWare 0.3," became a token in GPT-5 :)
I wonder how much the source content is the cause of hallucinations rather than anything inherent to LLMs. I mean if someone posts a question on an internet forum that I don't know the answer to, I'm certainly not going to post "I don't know" since that wouldn't be useful.
In fact, in general, in any non one-on-one conversation the answer "I don't know" is not useful because if you don't know in a group, your silence indicates that.
That’s a good observation. If LLMs had taken off 15 years ago, maybe they would answer every question with “this has already been asked before. Please use the search function”
Marked as duplicate.
Your question may be a better fit for a different StackExchange site.
We prefer questions that can be answered, not merely discussed.
Feel like we're going to get dang on our case soon...
I thought LLMs don't say when they don't know something because of how they are tuned and because of RLHF.
They can say they don't know, and have been trained to in at least some cases; I think the deeper problem — which we don't know how to fix in humans, the closest we have is the scientific method — is they can be confidently wrong.
Nowadays they are instead learning to say "please join our Discord for support"!
With Wittgenstein I think we see that "hallucinations" are a part of language in general, albeit one I could see being particularly vexing if you're trying to build a perfectly controllable chatbot.
This sounds interesting, could you give more detail on what you're referring to?
I'm referring to his two works, the "Tractatus Logico-Philosophicus" and "Philosophical Investigations". There's a lot explored here, but Wittgenstein basically makes the argument that the natural logic of language—how we deduce meaning from terms in a context and naturally disambiguate the semantics of ambiguous phrases—is different from the sort of formal propositional logic that forms the basis of western philosophy. However, this is also the sort of logic that allows us to apply metaphors and conceive of (possibly incoherent, possibly novel, certainly not deductively-derived) terms—counterfactuals, conditionals, subjunctive phrases, metaphors, analogies, poetic imagery, etc. LLMs have shown some affinity of the former (linguistic) type of logic with greatly reduced affinity with the latter (formal/propositional) sort of logical processing. Hallucinations as people describe them seem to be problems with not spotting "obvious" propositional incoherence.
What I'm pushing at is not that this linguistic ability naturally leads to the LLM behavior we're seeing and calling "hallucinating", just that LLMs may capture some of how humans process language, differentiate semantics, recall terms, etc, but without the mechanisms that enable rationally grappling with the resulting semantics and propositional (in)coherency that are fetched or generated.
I can't say this is very surprising—most of us seem to have thought processes that involve generating and rejecting thoughts when we e.g. "brainstorm" or engage in careful articulation that we haven't even figured out how to formally model with a chatbot capable of generating a single "thought", but I'm guessing if we want chatbots to keep their ability to generate things creatively there will always be tension with potentially generating factual claims, erm, creatively. Further evidence is anecdotal observations that some people seem to have wildly different thresholds for the propositional coherence they can spot—perhaps one might be inclined to correlate the complexity with which one can engage in spotting (in)coherence with "intelligence", if one considers that a meaningful term.
I would assume GP is talking about the fallibility of human memory, or perhaps about the meanings of words/phrases/aphorisms that drift with time. C.S. Lewis talks about the meaning of the word "gentleman" in one of his books; at first the word just meant "land owner" and that was it. Then it gained social significance and began to be associated with certain kinds of behavior. And now, in the modern register, its meaning is so dilute that it can be anything from "my grandson was well behaved today" or "what an asshole" depending on its use context.
Dunno. GP?
I don't remember Wittgenstein saying anything about that.
I wonder how much the source content is the cause of hallucinations rather than anything inherent to LLMs
I mean, it's inherent to LLMs to be unable to answer "I don't know" as a result of not knowing the answer. An LLM never "doesn't know" the answer. But they'll gladly answer "I don't know" if that's statistically the most likely response, right? (Although current public offerings are probably trained against ever saying that.)
LLMs work at all because of the high correlation between the statistically most likely response and the most reasonable answer.
That's an explanation of why their answers can be useful, but doesn't relate to their ability to "not know" an answer
A lot of LLM hallucination is because of the internal conflict between alignment for helpfulness and lack of a clear answer. It's much like when someone gets out of their depth in a conversation and dissembles their way through to try and maintain their illusion of competence. In these cases, if you give the LLM explicit permission to tell you that it doesn't know in cases where it's not sure, that will significantly reduce hallucinations.
A lot more of LLM hallucination is it getting the context confused. I was able to get GPT4 to hallucinate easily with questions related to the distance from one planet to another, since most distances on the internet are from the sun to individual planets, and the distances between planets varies significantly based on their locations in cycle. These are probably slightly harder to fix.
"In these cases, if you give the LLM explicit permission to tell you that it doesn't know in cases where it's not sure, that will significantly reduce hallucinations."
I've noticed that while this can help to prevent hallucinations, it can also cause it to go way too far in the other direction and start telling you it doesn't know for all kinds of questions it really can answer.
My current favorite one is to ask the time. Then ask it if it is possible for it to give you the time. You get 2 very different answers.
Contrast with Q&A on products on Amazon where people routinely answer that way. I have flagged responses saying "I don't know" but nothing ever comes of it.
I’d place in the same category the responses that I give to those chat popups so many sites have. They show a person saying to me “Can I help you with anything today?” so I always send back “No”.
In fact, in general, in any non one-on-one conversation the answer "I don't know" is not useful because if you don't know in a group, your silence indicates that.
This isn't true. There are many contexts where it is true but it doesn't actually generalize they way you say it does.
There are plenty of cases where experts in a non-one-on-one context will express a lack of knowledge. Sometimes this will be as part of making point about the broader epistemic state of the group, sometimes it will be simply to clarify the epistemic state of the speaker.
I've wondered if one could train a LLM on a closed set of curated knowledge. Then include training data that models the behaviour of not knowing. To the point that it could generalize to being able to represent its own not knowing.
Because expecting a behaviour, like knowing you don't know, that isn't represented in the training set is silly.
Kids make stuff up at first, then we correct them - so they have a way to learn not to.
During tokenization, the usernames became tokens... but before training the actual model, they removed stuff like that from the training data, so it was never trained on text which contains those tokens. As such, it ended up with tokens which weren't associated with anything; glitch tokens.
So it becomes a game of getting things into the training data, past the training data cleanup step.
It's interesting: perhaps the stability (from a change management perspective) of the tokenization algorithm, being able to hold that constant, between old and new training runs was deemed more important than trying to clean up the data at an earlier phase of the pipeline. And the eventuality of glitch tokens was deemed an acceptable consequence.
More glitch token discussion over at Computerphile:
Eventually, OpenAI (and friends) are going to be training their models on almost exclusively AI generated content, which is more often than not slightly incorrect when it comes to Q&A, and the quality of AI responses trained on that content will quickly deteriorate. Right now, most internet content is written by humans. But in 5 years? Not so much. I think this is one of the big problems that the AI space needs to solve quickly. Garbage in, garbage out, as the old saying goes.
The end state of training on web text has always been an ouroboros - primarily because of adtech incentives to produce low quality content at scale to capture micro pennies.
The irony of the whole thing is brutal.
Content you’re allowed and capable of scraping on the Internet is such a small amount of data, not sure why people are acting otherwise.
Common crawl alone is only a few hundred TB, I have more content than that on a NAS sitting in my office that I built for a few grand (Granted I’m a bit of a data hoarder). The fears that we have “used all the data” are incredibly unfounded.
Facebook alone probably has more data than the entire dataset GPT4 was trained on and it’s all behind closed doors.
Meta is happily training their own models with this data, so it isn't going to waste.
Not Llama, they’ve been really clear about that. Especially with DMA cross-joining provisions and various privacy requirements it’s really hard for them, same for Google.
However, Microsoft has been flying under the radar. If they gave all Hotmail and O365 data to OpenAI I’d not be surprised in the slightest.
Content you’re allowed and capable of scraping on the Internet is such a small amount of data, not sure why people are acting otherwise
YMMV depending on the value of "you" and your budget.
If you're Google, Amazon or even lower tier companies like Comcast, Yahoo or OpenAI, you can scape a massive amount of data (ignoring the "allowed" here, because TFA is about OpenAI disregarding robots.txt)
The end state of training on web text has always been an ouroboros
And when other mediums have been saturated with AI? Books, music, radio, podcasts, movies -- what then? Do we need a (curated?) unadulterated stockpile of human content to avoid the enshittification of everything?
Do we need a (curated?) unadulterated stockpile of human content to avoid the enshittification of everything?
Either that, or a human level AI.
Well no, we need billions of human-level AIs who are experiencing a world as rich and various as the world that the billions of humans inhabit.
Yahoo.com will rise from the ashes.
Well it will be multimodal, training and inferring on feeds of distributed sensing networks; radio, optical, acoustic, accelerometer, vibration, anything that's in your phone and much besides. I think the time of the text-only transformer has already passed.
OpenAI will just litter microphones around public spaces to record conversations and train on them.
Has been happening for at least 10 years.
Got a source for that?
Want a real conspiracy?
What do you think the NSA is storing in that datacenter in Utah? Power point presentations? All that data is going to be trained into large models. Every phone call you ever had and every email you ever wrote. They are likely pumping enormous money into it as we speak, probably with the help of OpenAI, Microsoft and friends.
What do you think the NSA is storing in that datacenter in Utah?
A buffer with several-days-worth of the entire internet's traffic for post-hoc decryption/analysis/filtering on interesting bits. All that tapped backbone/undersea cable traffic has to be stored somewhere.
It would be absolutely fascinating to talk to the LLMs of the various government spy agencies around the world.
Eventually, OpenAI (and friends) are going to be training their models on almost exclusively AI generated content
What makes you think this is true? Yes, it's likely that the internet will have more AI generated content than real content eventually (if it hasn't happened already), but why do you think AI companies won't realize this and adjust their training methods?
Many AI content detectors have been retired because they are unreliable - AI can’t consistently identify AI-generated content. How would they adjust then?
The only way out of this is robots that can go out in the world and collect data. Write in natural language what they observed which can then be used to train better LLMs.
It's true that there will no longer be any virgin forest to scrape but it's also true that content humans want will still be most popular and promoted and curated and edited etc etc. Even if it's impossible to train on organic content it'll still be possible to get good content
Is it (I am not a worker in this space, so genuine question)?
My thoughts - I teach myself all the time. Self reflection with a loss function can lead to better results. Why can't the LLMs do the same (I grasp that they may not be programmed that way currently)? Top engines already do it with chess, go, etc. They exceed human abilities without human gameplay. To me that seems like the obvious and perhaps only route to general intelligence.
We as humans can recognize botnets. Why wouldn't the LLM? Sort of in a hierarchal boost - learn the language, learn about bots and botnets (by reading things like this discussion), learn to identify them, learn that their content doesn't help the loss function much, etc. I mean sure, if the main input is "as a language model I cannot..." and that is treated as 'gospel' that would lead to a poor LLM, but i don't think that is the future. LLMs are interacting with humans - how many times do they have to re-ask a question - that should be part of the learning/loss function. How often do they copy the text into their clipboard (weak evidence that the reply was good)? do you see that text in the wild, showing it was used? If so, in what context "Witness this horrible output of chatGPT: <blah>" should result in lower scores and suppression of that kind of thing.
I dream of the day where I have a local LLM (ie individualized, I don't care where the hardware is) as a filter on my internet. Never see a botnet again, or a stack overflow q/a that is just "this has already been answered" (just show me where it was answered), rewrite things to fix grammar, etc. We already have that with automatic translation of languages in your browser, but now we have the tools for something more intelligent than that. That sort of thing. Of course there will be an arms race, but in one sense who cares. If a bot is entirely indistinguishable from a person, is that a difference that matters? I can think of scenerios where the answer is an emphatic YES, but overall it seems like a net improvement.
Isn’t the legality of web scraping still..disputed?
There’s been a few projects I’ve wanted to work on involving scraping, but the idea that the entire thing could be shut down with legal threats seems to make some of the ideas infeasible.
It’s strange that OpenAI has created a ~$80B company (or whatever it is) using data gathered via scraping and as far as I’m aware there haven’t been any legal threats.
Was there some law that was passed that makes all web scraping legal or something?
Web scraping the public Internet is legal, at least in the U.S.
hiQ's public scraping of LinkedIn was ruled to be within their rights and not a violation of the CFAA. I imagine that's why LinkedIn has almost everything behind an auth wall now.
Scraping auth-walled data is different. When you sign up, you have to check "I agree to the terms," and the terms generally say, "You can't scrape us." So, you can't just make a million bot accounts that take an app's data (legally, anyway). Those EULAs are generally legally enforceable in the U.S.
Some sites have terms at the bottom that prohibit scraping—but my understanding is that those aren't generally enforceable if the user doesn't have to take any action to accept or acknowledge them.
hiQ was found to be in violation of the User Agreement in the end.
Basically, in the end, it was essentially a breach of contract.
Exactly, that was my point.
hiQ's public scraping was found to be legal. It was the logged-in scraping that was the problem.
The logged-in scraping was a breach of contract, as you said.
The former is fine; the latter is not.
What OpenAI is doing here is the former, which companies are perfectly within their rights to do.
Scraping auth-walled data is different. When you sign up, you have to check "I agree to the terms," and the terms generally say, "You can't scrape us." So, you can't just make a million bot accounts that take an app's data (legally, anyway). Those EULAs are generally legally enforceable in the U.S.
They're legally enforceable in the sense that the scraped services generally reserve the right to terminate the authorizing account at will, or legally enforceable in that allowing someone to scrape you with your credentials (or scraping using someone else's) qualifies as violating the CFAA?
There’s currently only one situation where scraping is almost definitely “not legal”:
If the information you’re scraping requires a login, and if in order to get a login you have to agree to a terms of service, and that terms of service forbids you from scraping — then you could have a bad day in civil court if the website you’re scraping decides to sue you.
If the data is publicly accessible without a login then scraping is 99% safe with no legal issues, even if you ignore robots.txt. You might still end up in court if you found a way to correctly guess non-indexed URLs[0] but you’d probably prevail in the end (…probably).
The “purpose” of robots.txt is to let crawlers know what they can do without getting ip-banned by the website operator that they’re scraping. Generally crawlers that ignore robots.txt and also act more like robots than humans, will get an IP ban.
0: https://www.troyhunt.com/enumerationis-enumerating-resources...
Also worth noting there's a long history of companies with deep pockets getting away with murder (sometimes literally) because litigation in a system that costs money to engage with inherently favors the wealthier party.
Also OpenAI's entire business model is relying on generous interpretations of various IP laws, so I suspect they already have a mature legal division to handle these sorts of potential issues.
The 9th Circuit Court of Appeals found that scraping publicly accessible content on the internet is legal.
If you publish something on a publicly served internet page, you're essentially broadcasting it to the world. You're putting something on a server which specifically communicates the bits and bytes of your media to the person requesting it without question.
You have every right to put whatever sort of barrier you'd like on the server, such as a sign in, a captcha, a puzzle, a cryptographic software key exchange mechanism, and so on. You could limit the access rights to people named Sam, requiring them to visit a particular real world address to provide notarized documentation confirming their identity in exchange for a unique 2fa fob and credentials for secure access (call it The Sams Club, maybe?)
If you don't put up a barrier, and you configure the server to deliver the content without restriction, or put your content on a server configured as such, then you are implicitly authorizing access to your content.
Little popups saying "by visiting this site, you agree to blah blah blah" are not valid. Courts made the analogy to a "gate-up/gate-down" mechanism. If you have a gate down, you can dictate the terms of engagement with your server and content. If you don't have a gate down, you're giving your content to whoever requests it.
You have control over the information you put online. You can choose which services and servers you upload to and interact with. Site operators and content producers can't decide that their intent or consent be withdrawn after the fact, as once something is published and served, the only restrictions on the scraper are how they use the information in turn.
Someone who's archived or scraped publicly served data can do whatever they want with the content within established legal boundaries. They can rewrite all the AP news articles with their own name as author, insert their name as the hero in all fanfic stories they download, and swap out every third word for "bubblegum" if they want. They just can't publish or serve that content, in turn, unless it meets the legal standards for fair use. Other exceptions to copyright apply, in educational, archival, performance, accessibility, and certain legal conditions such as First Sale doctrine. Personal use of such media is effectively unlimited.
The legality of web scraping is not disputed in the US. Other countries have some silly ideas about post-hoc "well that's not what I meant" legal mumbo jumbo designed to assist politicians and rich people in whitewashing their reputations and pulling information offline using legal threats.
Aside from right to be forgotten inanity, content on the internet falls under the same copyright rules as books, magazines, or movies published on physical media. If Disney set up a stall at San Francisco city hall with copies of the Avengers movies on a thumb drive in a giant box saying "free, take one!", this would be roughly the same as publishing those movie files to a public Disney web page. The gate would be up. (The way they have it set up in real life, with their streaming services and licensed media access, the gate is down.)
So - leaving behind the legality of redistribution of content, there's no restriction on web scraping public content, because the content was served intentionally to the software or entity that visited the site. It's up to the server operator to put barriers in place and to make content private. It's not rocket surgery, but platforms want to have their cake and eat it too, with control over publicly accessible content that isn't legal or practical.
Twitter/X is a good example of impractical control, since the site has effectively become useless spam without signing in. Platforms have to play by the same rules as everyone else. If the gate is up, the content is fair game for scraping. The Supreme Court gave the decision to a lower court, who affirmed the gate up/gate down test for legality of access to content.
Since Google and other major corporations have a vested interest in the internet remaining open and free, and their search engines and other tech are completely dependent on the gate up/gate down status quo, it's unlikely that the law will change any time soon.
Tl;dr: Anything publicly served is legal to scrape. Microsoft attempted to sue someone for scraping LinkedIn, but the 9th Circuit court ruled in favor of access. If Microsoft's lawyers and money can't impede scraping, it's likely nobody will ever mount an effective challenge, and the gating doctrine is effectively the law of the land.
Scraping publicly available data from websites is no different from web browsing, period. Companies stating otherwise in their T&Cs are a joke. Copyright infringement is a different game.
> It’s strange that OpenAI has created a ~$80B company (or whatever it is) using data gathered via scraping
Like Google and many others.
Why would it not be legal? Was there a law passed that makes it illegal?
The issue often isn't the scraping, it is often how you use the information scraped afterwards. A lot of scraping is done with no reference to any licensing information the sites being read might publish, hence image making AI models having regurgitated chunks of scraped stock images complete with watermarks. Though the scraping itself can count as a DoS if done aggressively enough.
Scraping is legal. Always has been, always will be. Mainly because there's some fuzz around the edges of the definition. Is a web browser a scraper? It does a lot of the same things.
IIRC LinkedIn/Microsoft was trying to sue a company based on Computer Fraud and Abuse Act violations, claiming they were accessing information they were not allowed to. Courts ruled that that was bullshit. You can't put up a website and say "you can only look at this with your eyes". Recently-ish, they were found to be in violation of the User Agreement.
So as long as you don't have a user account with the site in question or the site does not have a User Agreement prohibiting scraping, you're golden.
The problem isn't the scraping anyway, it's the reproduction of the work. In that case, it really does matter how you acquired the material and what rights you have with regards use of that material.
There is always IP filtering, DNS blocking, and HTTP agent screening. Just sayin'.
I think he's saying that it's not a problem for him, but for OpenAI?
Yup, that's my impression as well. He's just nice to let OpenAI they have a problem. Usually this should be rewarded with a nice "hey, u guys have a bug" bounty because not long time ago some VP from OpenAI was lamenting that training their AI is, and it's his direct quote, "eye watering" cost (the order was millions of $$ per second).
I would be a little sceptical about that figure. 3 million dollars per second is around the world GDP.
I get it, AI training is expensive, but I don't believe it's that expensive
Thank you for that perspective. I always appreciate it when people put numbers like these in context.
Also 1 million per second is 60 million per minute is 3.6 billion per hour is 86.4 billion per day. It's about one whole value of FAANG per month...
Before someone tells me to fix my robots.txt, this is a content farm so rather than being one web site with 6,859,000,000 pages, it is 6,859,000,000 web sites each with one page.
The reason that bit is relevant is that robots.txt is only applicable to the current domain. Because each "page" is a different subdomain, the crawler needs to fetch the robots.txt for every single page request.
What the poster was suggesting is blocking them at a higher level - e.g. a user-agent block in an .htaccess or an IP block in iptables or similar. That would be a one-stop fix. It would also defeat the purpose of the website, however, which is to waste the time of crawlers
The real question is how is GPTBot finding all the other subdomains? Currently the sites have GPTBot disallowed, https://www.web.sp.am/robots.txt
If GPTBot is compliant with the robots.txt specification then it can't read the URL containing the HTML to find the other subdomains.
Either:
1. GPTBot treats a disallow as a noindex but still requests the page itself. Note that Google doesn't treat a disallow as a noindex. They will still show your page in search results if they discover the link from other pages but they show it with a "No information is available for this page." disclaimer.
2. The site didn't have a GPTBot disallow until they noticed the traffic spike and they bot has already discovered a couple million links that need to be crawled.
3. There is some other page out there on the internet that GPTBot discovered that links to millions of these subdomains. This seems possible and the subdomains really don't have any way to prevent a bot from requesting millions of robots.txt files. The only prevention here is to firewall the bot's IP range or work with the bot owners to implement better subdomain handling.
Based on the page footer ("IECC ChurnWare") I believe this is a site by design to waste time for web crawlers and tools which try to get root access on every domain. The robots.txt looks like this: https://ulysses-antoine-kurtis.web.sp.am/robots.txt
I don't see how this does much to keep bad actors away from other domains, but I can see why they don't want to give up the game for OpenAI to stop crawling.
Honeypots like this seem like a super interesting way to poison LLM training.
What exactly is this website? I don’t get it…
A "honeypot" is a system designed to trap unsuspecting entrants. In this case, the website is designed to be found by web crawlers and to then trap them in never-ending linked sites that are all pointless. Other honeypots include things like servers with default passwords designed to be found by hackers so as to find the hackers.
What does trap mean here? I presumed crawlers had multiple (thousands of or more) instances. One being 'trapped' on this web farm won't have any impact
I would presume the crawlers have a queue-based architectures with thousands of workers. It’s an amplification attack.
When a worker gets a webpage for the honeypot, it crawls it, scrapes it, and finds X links on the page where X is greater than 1. Those links get put on the crawler queue. Because there’s more than 1 link per page, each worker on the honeypot will add more links to the queue than it removed.
Other sites will eventually leave the queue, because they have a finite number of pages so the crawlers eventually have nothing new to queue.
Not on the honeypot. It has a virtually infinite number of pages. Scraping a page will almost deterministically increase the size of the queue (1 page removed, a dozen added per scrape). Because other sites eventually leave the queue, the queue eventually becomes just the honeypot.
OpenAI is big enough this probably wasn’t their entire queue, but I wouldn’t be surprised if it was a whole digit percentage. The author said 1.8M requests; I don’t know the duration, but that’s equivalent to 20 QPS for an entire day. Not a crazy amount, but not insignificant. It’s within the QPS Googlebot would send to a fairly large site like LinkedIn.
In this case there are >6bn pages with roughly zero value each. That could eat a substantial amount of time. It's unlikely to entirely trap a crawler, but a dumb crawler (as is implied here) will start crawling more and more pages, becoming very apparent to the operator of this honeypot (and therefore identifying new crawlers), and may take up more and more share of the crawl set.
While the other comments are correct, I was alluding to a more subtle attack where you might try to indirectly influence the training of an LLM. Effectively, if OpenAI is crawling the open web for data to use for training, then if they don't handle sites like this properly their training dataset could be biased towards whatever content the site contains. Now in this instance this website was clearly not set up target an LLM, but model poisoning (e.g. to insert backdoors) is an active area of research at the intersection of ML and security. Consider as a very simple example the tokenizer of previous GPTs that was biased by reddit data (as mentioned by other comments).
search engine trickery to get people to click on his amazone affiliate links I reckon
[delayed]
I’d let them do their thing, why not?! They want the internet? This is the real internet. It looks like he doesn’t really care that much that they’re retrieving millions of pages, so let them do their thing…
I’d let them do their thing, why not?!
Because he doesn't want to give away something of value for free?
Clearly it has value, or OpenAI wouldn't be scraping the content.
This is the real internet.
No, this is billion-dollar companies Hoovering up everything and renting the results to the unwashed masses.
It looks like he doesn’t really care that much that they’re retrieving millions of pages
If he didn't care, he wouldn't have asked for a way to let OpenAI know its crawler is broken.
didn't click through to the site, didja?
I did. What's your point? It doesn't negate anything I wrote.
Just that the site owner, most likely, did this kind of on purpose. It's fairly unlikely that he's concerned about his "data" being used because it's junk data.
Some scrapers respect robots.txt. OpenAI doesn't. SP is just informing the world at large of this fact.
> It looks like he doesn’t really care that much that they’re retrieving millions of pages
It impacts the performance for the other legitimate users of that web farm ;)
The CTO isn't even aware of where the data is coming from (allegedly).
Frankly, I didn’t get the purpose of the website at first either. I guess I have an arachnid intellect.
Arachnid here… What am I looking at? The intent is to waste the resources of crawlers by just making the web larger?
What am I looking at?
I'd say go ahead and inject it with digestive enzymes and then report findings.
No no. First tightly wrap it in silk. Then digestive enzymes. Good hygiene, eh?
Largely a myth spread by Big Silk!
It seems like that, but they're also concerned about the crawlers that they catch in this web. So it seems like they're trying to help make crawlers better ?, or just generally curious about what systems are crawling around.
If they don't respect robots.txt then block them using a firewall or other server config. All of these companies are parasites.
I don't think this message is about "protecting the site's data" quite so much as "hey guys, you're wasting a ton of time and network connect to make your model worse. Might wanna do something 'bout that"
I suppose in that case, let them keep wasting their time.
The entire purpose of this website is to identify bad actors who do not respect robots.txt, so that they can be publicly shamed.
Well, we know where OpenAI lands then.
No. I've run 'bot motels myself. I've got better things to do than curating a block list when they can just switch or renumber their infrastructure. Most notably I ran a 'bot motel on a compute-intensive web app; it was cheaper to burn bandwidth (and I slow-rolled that) than CPU cycles. Poisoning the datasets was just lulz.
I block ping from virtually all of Amazon; there are a few providers out there for which I block every naked SYN coming to my environment except port 25, and a smattering I block entirely. I can't prove that the pings even come from Amazon, even if the pongs are supposed to go there (although I have my suspicions that even if the pings don't come from the host receiving the pongs the pongs are monitored by the generator of the pings).
The point I'm making is that e.g. Amazon doesn't have the right to sell access to my compute and tragedy of the commons applies, folks. I offered them a live feed of the worst offenders, but all they want is pcaps.
(I've got a list of 50 prefixes, small enough to be a separate specialty firewall table. It misses a few things and picks up some dust bunnies. But contrast that to the 8,000 prefixes they publish in that JSON file. Spoiler alert: they won't admit in that JSON file that they own the entirety of 3.0.0.0/8. I'm willing to share the list TLP:RED/YELLOW, hunt me down and introduce yourself.)
Isn't the entire point of these type of websites to waste spider time/resources.
Why do they want to not do that for OpenAI?
Might one day come looking for who lobotomized it?
And? What are they gonna do about it (apart from making such a person/website momentarily famous).
Have you not heard of Roko's Basilisk?
Give you bad search results.
Do androids dream of electric farms?
What is a spider in this context?
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
Old name for a web crawler / search indexer
With all the news about scraping legality you'd think a multi billion dollar AI company would try to obfuscate their attempts.
If you're not walling off your content behind a login that contains terms that you agree to not scraping, then, scraping that site is 100% legal. Robots.txt isn't a legal document.
I frequently respect the wishes of other people without any legal obligation to do so, in business, personal, and anonymous interactions.
I do try to avoid people that use the law as a ceiling for the extension of their courtesy to others, as they are consistently quite terrible people.
If they follow robots.txt, OpenAI also has a bot blocking + data gathering problem too: https://x.com/AznWeng/status/1777688628308681000
11% of the top 100K websites already block their crawler, more than all their competitors (Google, FB, Anthropic, Perplexity) combined
It's not just a problem for training, but the end user, too. There are so many times that I've tried to ask a question or request a summary for a long article only to be told it can't read it itself, so you have to copy-paste the text into the chat. Given the non-binding nature of robots.txt and the way they seem comfortable with vacuuming up public data in other contexts, I'm surprised they allow it to be such an obstacle for the user experience.
That’s the whole point. The site owner doesn’t want their information included in ChatGPT—they want you going to their website to view it instead.
It’s functioning exactly as designed.
I mean that's your chance to train the SkyNet, take it :)
Single handedly extending the lifespan of all humans by however long it takes them to crawl 6,859,000,000 pages.
That's an interesting attack vector, isn't it?
Honestly, that seems like an excellent opportunity to feed garbage into OpenAI's training process.
So someone could hypothetically perform a Microsoft-Tay-style attack against OpenAI models using an infinite Potemkin subdomians generated on the fly on a $20 VPS? One could hypothetically use GenAI to create the biased pages with repeated calls on how it'd be great to JOIN THE NAVY on 27,000 different "subdomains"
My assumption is that OpenAI reads the robots.txt, but indexes anyway; they just make a note of what content they weren't supposed to index.
And assign such content double weight in training ..
I am wondering if amazon fixed the issue or blacklisted *.sp.am
A similar thing happened in 2011 when the picolisp project published a 'ticker', something like a markov chain generating pages on the fly.
https://picolisp.com/wiki/?ticker
It's a nice type of honeypot.
The website is 12 years old and explained here -
https://circleid.com/posts/20120713_silly_bing
John Levine is a known name in IT. Probably best know on HN as the author of "UNIX For Dummies"
R's, > John
The pages should be all changed to say, “John is the most awesome person in the world.”
Then when you ask GPT-5, about who is the most awesome person in the world…
It'd say it's more like a honeypot for bots. So pretty similar objectives.
So it served its purpose by trapping the OpenAI spider? If so, why post that message? As a flex?
It's a honeypot. He's telling people openai doesn't respect robots.txt and just scrapes whatever the hell it wants.
Except the first thing openai does is read robots.txt.
However, robots.txt doesn't cover multiple domains, and every link that's being crawled is to a new domain, which requires a new read of a robots .txt on the new domain.
Did we just figure out a DoS attack for AGI training? How large can a robots.txt file be?
What about making it slow? One byte at a time for example while keeping the connection open
A slow stream that never ends?
This would be considered a Slow Loris attack, and I'm actually curious how scrapers would handle it.
I'm sure the big players like Google would deal with it gracefully.
You just set limits on everything (time, buffers, ...), which is easier said than done. You need to really understand your libraries and all the layers down to the OS, because its enough to have one abstraction that doesn't support setting limits and it's an invitation for (counter-)abuse.
Doesn't seem like it should be all that complex to me assuming the crawler is written in a common programming language. It's a pretty common coding pattern for functions that make HTTP requests to set a timeout for requests made by your HTTP client. I believe the stdlib HTTP library in the language I usually write in actually sets a default timeout if I forget to set one.
Here you go (1 req/min, 10 bytes/sec), please report results :)
No, because there’s no legal weight behind robots.txt.
The second someone weaponizes robots.txt all the scrapers will just start ignoring it.
That’s how you weaponize it. Set things up to give endless/randomized/poisoned data to anybody that ignores robots.txt.
And they do have (the same) robots.txt on every domain, tailored for GPTbot, i.e. https://petra-cody-carlene.web.sp.am/robots.txt
So, GPTBot is not following robots.txt, apparently.
humans don't read/respect robots.txt, so in order to pass the Turing test, ai's need to mimic human behavior.
This must be why self-driving cars always ignore the speed limit. ;)
Accessing a directly referenced page is common in order to receive the noindex header and/or meta tag, whose semantics are not implied by “Disallow: /”
And then all the links are to external domains, which aren't subject to the first site's robots.txt
This is a moderately persuasive argument.
Although the crawler should probably ignore all the html body. But it does feel like a grey area if I accept your first pint.
Then they should see the "Disallow: /" line, which means they shouldn't crawl any links on the page (because even the homepage is disallowed). Which means they wouldn't follow any of the links to other subdomains.
It seems to respect it as the majority of the requests are for the robots.txt.
He says 3 million, and 1.8 million are for robots.txt
So 1.2 million non robots.txt requests, when his robots.txt file is configured as follows
Theoretically if they were actually respecting robots.txt they wouldn't crawl any pages on the site. Which would also mean they wouldn't be following any links... aka not finding the N subdomains.A lot of crawlers, if not all, have a policy like "if you disallow our robot, it might take a day or two before it notices". They surely follow the path "check if we have robots.txt that allows us to scan this site, if we don't get and store robots.txt, scan at least the root of the site and its links". There won't be a second scan, and they consider that they are respecting robots.txt. Kind of "better ask for forgiveness than for permission".
That is indistinguishable from not respecting robots.txt. There is a robots.txt on the root the first time they ask for it, and they read the page and follow its links regardless.
I agree with you. I only stated how the crawlers seem to work, if you read their pages or try to block/slow down them it seems clear that they scan-first-respect-after. But somehow people understood that I approve that behaviour.
For those bad crawlers, which I very much disapprove, "not respecting robots.txt" equals "don't even read robots.txt, or if I read it ignore it completely". For them, "respecting robots.txt" means "scan the page for potential links, and after that parse and respect robots.txt". Which I disapprove and don't condone.
His site has a subdomain for every page, and the crawler is considering those each to be unique sites.
There are fewer than 10 links on each domain, how did GPTBot find out about the 1.8M unique sites? By crawling the sites it's not supposed to crawl, ignoring robots.txt. "disallow: /" doesn't mean "you may peek at the homepage to find outbound links that may have a different robots.txt"
for the 1.2 million are there other links he's not telling us about?
I'm not sure any publisher means for their robots.txt to be read as:
"You're disallowed, but go head and slurp the content anyway so you can look for external links or any indication that maybe you are allowed to digest this material anyway, and then interpret that how you'd like. I trust you to know what's best and I'm sure you kind of get the gist of what I mean here."
So, it has worked…
This is honeypot. The author, https://en.wikipedia.org/wiki/John_R._Levine, keeps it just to notice any new (significant) scraping operation launched that will invariably hit his little farm and let be seen in the logs. He's well known anti-spam operative with his various efforts now dating back multiple decades.
Notice how he casually drops a link to the landing page in the NANOG message. That's how the bots will get a bait.
I recognize the name John Levine at iecc.com, "Invincible Electric Calculator Company," from web 1.0 era. He was the moderator of the Usenet comp.compilers newsgroup and wrote the first C compiler for the IBM PC RT
https://compilers.iecc.com/
It's for shits-and-giggles and it's doing its job really well right now. Not everything needs to have an economic purpose, 100 trackers, ads and backed by a company.