The spam site was checking for Googlebot IP addresses. If the visitor’s IP address matched as belonging to Google then the spam page displayed content to Googlebot. > > All other visitors got a redirect to other domains that displayed sketchy content.
Years ago, Google had an explicit policy that sites that showed different content to Googlebot than they showed to regular unauthenticated users were not allowed, and they got heavily penalized. This policy is long gone, but it would help here (assuming the automated tooling to enforce it was any good, and I assume to was).
More recently, Google seems totally okay with sites that show content to Googlebot but go out of their way not to show that content to regular users.
About 10 years ago, I was working on a site that served several hundred million non-crawler hits a month. Many of our millions of pages had their content change multiple times a day. Because of the popularity and frequent changes, the crawlers hit us constantly... crawlers accounted for ~90% of our traffic - billions of hits per month. Bing was ~70% of the crawler traffic and Google was ~25% of it. We noticed it because Bing quickly became very aggressive about crawling, exposing some of our scaling limits as they doubled our already significant traffic in a few short months.
I was working on the system that picked ads to show on our pages (we had our own internal ad system, doing targeting based on our own data). This was the most computationally intensive part of serving our pages and the ads were embedded directly in the HTML of the page. When we realized that 90% of our ad pick infrastructure was dedicated to feeding the crawlers, we immediately thought of turning ads off for them (we never billed advertisers for them anyway). But hiding the ads seemed to go directly against the spirit of Google's policy of showing their crawlers the same content.
Among other things, we ended up disabling almost all targeting and showing crawlers random ads that roughly fit the page. This dropped our ad pick infra costs by nearly 80%, saving 6-figures a month. It also let us take a step back to decide where we could make long term investments in our infra rather than being overwhelmed with quick fixes to keep the crawlers fed.
This kind of thing is what people are missing when they wonder why a company needs more than a few engineers - after all, someone could duplicate the core functionality of the product in 100 lines of code. At sufficient scale, it takes real engineering just to handle the traffic from the crawlers so they can send you more users. There are an untold number of other things like this that have to be handled at scale, but that are hard to imagine if you haven't worked at similar scale.
90% of traffic being crawlers seems (on the face) just absolutely batshit insane.
Seems like a natural consequence of having "millions of pages", if you think about it? You might have a lot of users, but they're only looking at what they want to look at. The crawlers are hitting every single link, revisiting all the links they've seen before, their traffic scales differently.
Maybe Google needs to implement an API where you can notify it when a page on your site has changed. That should cut down on redundant crawls a lot, eh?
We very much wanted this! We had people that were ex-Google and ex-Bing who reached out to former colleagues, but nothing came of it. You'd think it would be in their interest, too.
The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.
In 2021 Bing, Yandex, Seznam.cz, and (later, in 2023) Naver ended up implementing a standard where you can notify one search engine of a page update and the other participating search engines are also notified [1, 2, 3].
[1]: https://www.indexnow.org/
[2]: https://www.bing.com/indexnow
[3]: https://blogs.bing.com/webmaster/october-2021/IndexNow-Insta...
Should be easy to crosscheck the reliability of update notifications by doing a little bit of polling too.
Or you could just delete it, if your content isn't valuable enough that you'll pay to have it served once a week without ad-dollars to subsidize it.
https://developers.google.com/search/apis/indexing-api/v3/us...
You can have millions of static pages and serve them very inexpensively. Showing dynamic ads is fundamentally exposing an expensive computational resource without any rate limiting. If that was any other API or service it would be gated but the assumption here is that this particular service will make more money than lose and it obviously breaks down in this instance. I really don’t think you can say it’s about scale when what you’re scaling (serving ads to bots) doesn’t make any business sense.
Leaving the ads in was a business necessity because it eliminated the documented risk of being delisted by Google for customizing content for their crawlers. The company would have gone out of business if that happened permanently. Even if it only happened for a few days, it would have meant millions in lost revenue.
I think you’re right. At first I thought “crawlers are actually creating large amounts of spam requests” but this is just the way a searchable web functions. The crawlers are just building the index of the internet.
Adds a lot of weight to the dead internet theory.
can you elaborate?
I googled this: https://en.wikipedia.org/wiki/Dead_Internet_theory
Actually everything we discussed here is the result of genuine human activities.
I still think that humans are very good at identifying other humans, particularly through long-form speech and writing. Sentient and non-sentient beings alike are very good at identifying members of their own species.
I wonder if there's some sort of "time" threshold for how long an AI can speak/write before it is identifiable as an AI to a human. Some sort of Moore's law, but for AI recognizability
tbh alot of this "theory" is common sense for internet natives.
"most of everything is shit" comes to mind, but "most of email being spam" and "most of web-traffic being porn" are well known.
The number I came up with last time I looked into this was about 60% of page requests are by bots on any normal website.
We have some very long tail content and experienced this in 2023 after all the VC funded LLM start-ups tried to scrape every page ever.
Or you could also NOT serve targetted ads?
They're serving first party targeted ads based on only their own data. If you're going to complain about that, it's close to saying that websites shouldn't be able to make money from advertising at all.
This is the case. Advertising is a scourge, psychological warfare waged by corporations against our minds and wallets. Advertisers have no moral qualms, they will exploit any psychololgical weakness to shill products, no matter how harmful. Find a "market" of teenagers with social issues? Show them ads of happy young people frolicking with friends to make them buy your carbonated sugar water; never mind that your product will rot their teeth and make them fat. Advertisers don't care about whether products are actually good for people, all they care about is successful shilling.
Advertising is warfare waged by corporations against people and pretending otherwise makes you vulnerable to it. To fight back effectively we must use adblockers and advocate for advertising bans. If your website cannot exist without targeted advertising, then it is better for it to not exist.
Think about what it would mean to not have any advertising whatsoever. Most current large brands would essentially be entrenched forever. No matter how good a new product or service is, it's going to be almost impossible to reach a sustainable scale through purely organic growth starting from zero. Advertising in some form is necessary for an economy to function.
Nonsense cope. I practice total ad avoidance and I still try new products without being prompted to by ads. But at least half the people on this website work in advertising directly or indirectly so I know I'm talking to a wall here. Advertisers and all adtech are parasites trying to convince the host they're necessary. The whole of advertising is worse than worthless and humanity would be much better off if you all dropped dead.
What's a viable business model for web search other than ads (Google, Bing, DuckDuckGo, Naver, etc.) or paid search (Kagi)? If paid search is the only option left, is it okay that poor people can't use the web? Is it okay if poor people don't get access to news?
Oh, and they don't get to vote because voting day and locations can't be advertised by the government, especially in targeted mailings that are personalized with your party affiliation and location. The US Postal Service will also collapse, so those mailings can't go out, even if allowed. At least the rich can still search for their polling location on the web [<- sarcasm].
None of that is okay with me. More/better regulation? Yes! But our world doesn't know how to function without ads. Being absolute about banning ads is unrealistic and takes focus away from achieving better regulation, thereby playing into the hands of the worst advertisers.
Not my problem. Those companies, and any other with business models reliant on advertising, don't have a right to exist. If your business can't be profitable without child labor, your business has no right to exist. This is no different.
I don't mind ads, they're not great, but they're not the end of the world either, they paid for a lot of useful services over the years.
Very much this. It's a site/app that has probably been used by 80-90% of adults living in America over the last decade. It would not exist if these ads weren't targeted. I know because we knew (past tense because I'm no longer there) exactly how much targeting increased click-through-rate and how that affected revenue.
On top of that, they were ads for doing more of what the user was doing right then, tailored to tastes we'd seen them exhibit over time. Our goal was that the ads should be relevant enough that they served as an exploration mechanism within the site/app. We didn't always do as well as we hoped there, but it was a lot better than what you see on most of the internet. And far less intrusive because they weren't random (i.e., un-targeted). I have run ad blockers plus used whole house DNS ad blocking as long as I've been aware of them, but I was fine working on these ads because it felt to me like ads done right.
If we can't even allow for ads done right, then vast swaths of the internet have to be pay-walled or disappear. One consequence of that... only the rich get to use most of the internet. That's already too true as it is, I don't want to see it go further.
I have no problems with this (first party, targeted) as far as I can read English and understand.
In fact one of my bigger problems have been that Google has served me generic ads that are so misplaced they go far into attempted insult territory (shady dating sites, pay-to-win "strategy games" etc).
Seems like that's exactly what they did...
For everyone, not just crawlers.
Why not just block bing and save 70% straight away? Nobody uses bing anyway.
Nobody goes there anymore, it's too crowded.
Love finding a Yogi Berra quote in the wild
Nobody quotes Yogi Berra anymore, he's too popular.
I have never used Bing. I use duckduckgo though and they buy their results from Bing. At least they did in the past, I don't follow them closely enough to necessarily notice every possible change.
Specific Google searches are often useless, so I switch to Bing at work and home as needed.
Crawler hints and index now could reduce that traffic these days
- https://developers.cloudflare.com/cache/advanced-configurati...
- https://www.indexnow.org/
That 'policy' is still actually in effect, I believe, in Google's webmaster guidelines. They just don't enforce it.
Years ago (early 2000s) Google used to mostly crawl using Google-owned IPs, but they'd occasionally use Comcast or some other ISPs (partners) to crawl. If you were IP cloaking, you'd have to look out for those pesky non-Google IPs. I know, as I used to play that IP cloaking game back in the early 2000s, mostly using scripts from a service called "IP Delivery".
Not sure about now, but I worked in the T&S Webspam team (in Dublin, Ireland) until 2021, and we were very much enforcing cloaking.
It was, however, one of the most difficult types of spam to detect and penalise, at scale.
Is it even well defined? On the one hand, there’s “cloaking,” which is forbidden. On the other hand, there’s “gating,” which is allowed, and seems to frequency consist of showing all manner of spammy stuff and requests for personal information in lieu of the indexed content. Are these really clearly different?
And then there’s whatever Pinterest does, which seems awfully like cloaking or bait-and-switch or something: you get a high ranked image search result, you click it, and the page you see is in no way relevant to the search or related to the image thumbnail you clicked.
I think they must be penalized, because I see this a lot less in the results than I used to.
And byw (unless we are talking about different things) it was possible to get to the image on target page, but it was walled off behind a log in.
Do you have any example searches for the Pinterest results you're describing? I feel like I know what you're talking about but wondering what searches return this.
Whatever Pinterest does should result in them being yeeted from all search engines, tbh.
Apologies for not responding quicker.
For context, my team wrote scripts to automate catching spam at scale.
Long story short, there are non spam-related reasons why one would want to have their website show different content to their users and to a bot. Say, adult content in countries where adult content is illegal. Or political views, in a similar context.
For this reason, most automated actions aren't built upon a single potential spam signal. I don't want to give too much detail, but here's a totally fictitious example for you:
* Having a website associated with keywords like "cheap" or "flash sale" isn't bad per say. But that might be seen as a first red flag
* Now having those aforementioned keywords, plus "Cartier" or "Vuitton" would be another red flag
* Add to this the fact that we see that this website changed owners recently, and used to SERP for different keywords, and that's another flag
=> 3 red flags, that's enough for some automation rule to me.
Again, this is a totally fictitious example, and in reality things are much more complex than this (plus I don't even think I understood or was exposed to all the ins and outs of spam detection while working there).
But cloaking on its own is kind of a risky space, as you'd get way too many false positives.
Curious. How is it detected in the first place if not reported like in this case?
sampling from non-bot-IPs and non-bot UAs
You can actually get a manual action (penalty) from Google if you do IP cloaking/redirects. It's still mentioned prominently in Google's Webmaster Guidelines: https://support.google.com/webmasters/answer/9044175?hl=en#z...
And then there is Dynamic Rendering which OKed cloaking
https://developers.google.com/search/docs/crawling-indexing/...
and the there are AMP pages which is Google Enforced cloaking...
I think by now all search engines run JavaScript and index the rendered page...
As the founder of SEO4Ajax, I can assure you that this is far from the case. Googlebot, for example, still has great difficulty indexing dynamically generated JavaScript content on the client side.
This isn’t about JavaScript vs no JavaScript.
It’s about serving different pages based on User Agent.
I think they did this because lots of publishers show paywalls to people but still want their content indexed by Google. In other words, they want their cake and eat it too!
And of course many of these publishers are politically powerful, and are the trusted sources that google wants to promote over random blogs.
Well, they all show Google ads.
You'd think they could make fine money as neutral brokers since everyone served their ads and for a long period they did make money as semi-neutral brokers. But since, IDK, 2019 they have become more and more garbage. This is broadly part of the concentration of wealth and power you see everywhere else but I don't know the specifics but you can see the result.
Not true for the NYT, which has its own ad system.
there is a special spec for google by that
https://developers.google.com/search/docs/appearance/structu...
basically cloaking + json-ld markup
I wonder if Google trains its AI on paywalled data, that other scrapers don’t have access to but which those paywalled sites give full access for the Google bot to.
Why do you think that the rule is not in effect and that this is not an example of the constant cat and mouse game between Google and spammers?
They still have that rule. Just not always easy to spot spammers getting around it.
See also, pages behind Red Hat and Oracle tech support paywalls.