Affected companies are becoming increasingly frustrated with the army of AI crawlers out there as they won't stick to any scraping best practices (respect robot.txt, use public APIs, no peak load). It's not necessarily about copyright, but the heavy scraping traffic also leads to increased infra costs.
What's the endgame here? AI can already solve captchas, so the arms race for bot protection is pretty much lost.
Require login, then verify the user account is associated with an email address at least 10 yrs old. Pretty much eliminates bots. Eliminates a few real users too, but not many.
this is not a solution if you want a public internet (and sites that don't care about the public internet already don't have a problem)
for read-only content, I just stick it behind a cache and let the bots go wild.
I presume OSM has already considered this and ruled it out (probably because the map should be dynamic)
You can't cache this stuff for bot consumption. Humans only want to see the popular stuff. Bots download everything. The size of your cache then equals the size of your content database.
That’s just passing the buck.
Someone still needs to pay for that traffic. If it gets too much for cloud flare or whoever, you’re gonna get the bill.
This is about OpenStreetMap, so you are proposing that my minor daughter not be allowed to read a map?
I must be an outlier here, but I don't keep email addresses that long. After a couple years they're on too many spam lists. I'll wind those addresses down and use them for a couple years only for short interactions that I expect spam from, and ultimately close then down completely the next cycle.
At best any email I have is 4 or 5 years old.
Seems to me eventually we might hit a point where stuff like api access is whitelisted. You will have to build a real relationship with a real human at the company to validate you aren’t a bot. This might include in person meeting as anything else could be spoofed. Back to the 1960s business world we go. Thanks, technologists, for pulling the rug under us all.
Scraping often uses the same APIs that the website itself does, so to make that work a lot of sites will have to put their content around authentication of some sort.
For example, I have a project that crawls the SCP Wiki (following best practices, ratelimiting, etc). If they were to restrict the API that I use it would break the website for people, so if they do want to limit the access they have no choice but to instead put it behind some set of credentials that they could trace back to a user and eliminate the public site itself. For a lot of sites that's just not reasonable.
You can't whitelist and also have a consumer-facing service. There is no reliable way to differentiate between a legitimate user and the AI company's scraper.
Scraping implies not API - they're accessing the site as a user agent. And whitelisting access to the actual web pages isn't a tenable option for many websites. Humans generally hate being forced to sign up for an account before they can see this page that they found in a Google search.
I could definitely see this. I worked for a company that had a few popular free inspector tools on their website. The constant traffic load of bots was nuts.
Invite only authenticated islands based on trust. Which seems like the end result of the rampant centralization of the internet.
The open web is on a crash course. I don't necessarily believe in copyright claims, but I think it makes sense to aggressively prosecute scrapers for DDOSing.
This would already be happening if we could track them.
Bars and coffee shops?
Many would oppose the idea, but if any service (e.g. eBay, LinkedIn, Facebook) were to dump the snapshot to S3 every month, that could be a solution. You can't prevent scraping anyway.
Data from S3 isn't free though, still costs money and has a limit based on the tier you purchase.
Yeah, you can get dumps of Wikipedia and stackoverflow/stackexchange that way.
(Not sure if created by the admins or a 3rd party, but done once for many is better than overlapping individual efforts).
You can rather easily set up semi-hard rate limiting with a proof of work scheme. Will very trivially affect human users, while bot spammers have to eat up the cost of a million hash reversions per hour or whatever.
e.g. HashCash
Yep. That works well enough for password hashing algorithms to deter brute force attackers.
This is a similar situation.
The idea is not to make scraping impossible, but to make it expensive. A human doesn't make requests as fast as a bot, so the pretend human is still rate limited. Eventually, you need an account, and tracking of that also happens, and accounts matching specific patterns get purged, and so on. This will not stop scraping, but the point is not to stop it, but to make it expensive and slow. Eventually, expensive enough that it might be better off to not pretend to be a human, pay for a license, and then the arms race goes away.
Can defenses be good enough it's better to not even try to fight? It's a far harder question than wondering if a random bot can make a dozen requests pretending to be human
I liked the analogy to Gabe Newell's "piracy is a service problem" adage, embodied in Virgin API consumer vs Chad third-party scraper https://x.com/gf_256/status/1514131084702797827
Make it easier to get the data, put less roadblocks in the way for legitimate access, and you'll find fewer scrapers. Even if you make scraping _very_ hard, people will still prefer scraping if legitimate use is even more cumbersome than scraping, or you refuse to even offer a legitimate option.
Admittedly, we are talking here because some people are scraping OSM when they could get the entire dataset for free... but I'm hoping these people are outliers, and most consume the non-profit org's data in the way they ask.
I think this very example proves that the adage is wrong, or at least doesn't capture many things for the full picture.
I don't know if the AI's have an endgame in mind. As for the humans, I think it's an internet built for a dark forest. We'll stop assuming that everything is benign except for the malicious parts which we track and block. Instead we'll assume that everything is malicious except for the parts which our explicitly trusted circle of peers have endorsed. When we get burned, we'll prune the trust relationship that misled us, and we'll find ways to incentivize the kind of trust hygiene necessary to make that work.
When I compare that to our current internet the first thought is "but that won't scale to the whole planet". But the thing is, it doesn't need to. All of the problems I need computers to solve are local problems anyway.
Arguably, trying to scale everything to the whole planet is the root cause of most of these problems. So "that won't scale to the whole planet" might, in the long view, be a feature and not a bug.
Right. If your use case for the internet is exerting influence over people who don't trust you, then it's past time that we shut you down anyhow.
For everyone else, this transition will not be a big deal (although your friends may ask you to occasionally spend a few cycles maintaining your part of a web of trust, because your bad decisions might affect them more than they currently do).
Web Attestation, cryptography to the rescue.
How? Watermark everything with a hash?
Feed bad data to heavy users. Instead of blocking, use poison.
Presumes you can distinguish the heavy users. If you knew who the heavy users were, you could just block them.
isn't the answer just rate limiting unauthenticated requests to a level that's reasonable/expected for a human?
An optimistic outcome would be that public content becomes fully peer-to-peer. If you want to download an article, you must seed at least the same amount of bandwidth to serve another copy. You still have to deal with leechers, I guess.
We've had good success with
- Cloudflare Turnstile
- Rate Limiting (be careful here, as some of these scrapers use large numbers of IP addresses and User Agents)
lower max upload speed for certain IPs to 5kb/s
How long before companies start putting AI restrictions on new account creation simply because of the sheer amount of noise and storage issues associated with bot spam?
API-based interactions w/ Authentication.
Websites previously would have their own in-house API to freely deliver content to anyone who requests it.
Now, a website should be a simple interface for a user that communicates with an external API and display it. It's the user's responsibility to have access to the API.
Any information worth taking should be locked away by Authentication - which has become stupid simple using oAuth w/ major providers.
So these people trying to extract content by paying someone or using a paid service should rather use the API which packages it for them and is fairly priced.
Lastly, robots.txt should be enforced by law. There is no difference from stealing something from a store, and stealing content from a website.
AI (and greed) has killed the open freedoms of the Internet.