Google is apparently struggling to contain an ongoing spam attack

The spam site was checking for Googlebot IP addresses. If the visitor’s IP address matched as belonging to Google then the spam page displayed content to Googlebot. > > All other visitors got a redirect to other domains that displayed sketchy content.

Years ago, Google had an explicit policy that sites that showed different content to Googlebot than they showed to regular unauthenticated users were not allowed, and they got heavily penalized. This policy is long gone, but it would help here (assuming the automated tooling to enforce it was any good, and I assume to was).

More recently, Google seems totally okay with sites that show content to Googlebot but go out of their way not to show that content to regular users.

About 10 years ago, I was working on a site that served several hundred million non-crawler hits a month. Many of our millions of pages had their content change multiple times a day. Because of the popularity and frequent changes, the crawlers hit us constantly... crawlers accounted for ~90% of our traffic - billions of hits per month. Bing was ~70% of the crawler traffic and Google was ~25% of it. We noticed it because Bing quickly became very aggressive about crawling, exposing some of our scaling limits as they doubled our already significant traffic in a few short months.

I was working on the system that picked ads to show on our pages (we had our own internal ad system, doing targeting based on our own data). This was the most computationally intensive part of serving our pages and the ads were embedded directly in the HTML of the page. When we realized that 90% of our ad pick infrastructure was dedicated to feeding the crawlers, we immediately thought of turning ads off for them (we never billed advertisers for them anyway). But hiding the ads seemed to go directly against the spirit of Google's policy of showing their crawlers the same content.

Among other things, we ended up disabling almost all targeting and showing crawlers random ads that roughly fit the page. This dropped our ad pick infra costs by nearly 80%, saving 6-figures a month. It also let us take a step back to decide where we could make long term investments in our infra rather than being overwhelmed with quick fixes to keep the crawlers fed.

This kind of thing is what people are missing when they wonder why a company needs more than a few engineers - after all, someone could duplicate the core functionality of the product in 100 lines of code. At sufficient scale, it takes real engineering just to handle the traffic from the crawlers so they can send you more users. There are an untold number of other things like this that have to be handled at scale, but that are hard to imagine if you haven't worked at similar scale.

90% of traffic being crawlers seems (on the face) just absolutely batshit insane.

Seems like a natural consequence of having "millions of pages", if you think about it? You might have a lot of users, but they're only looking at what they want to look at. The crawlers are hitting every single link, revisiting all the links they've seen before, their traffic scales differently.

Maybe Google needs to implement an API where you can notify it when a page on your site has changed. That should cut down on redundant crawls a lot, eh?

We very much wanted this! We had people that were ex-Google and ex-Bing who reached out to former colleagues, but nothing came of it. You'd think it would be in their interest, too.

The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.

In 2021 Bing, Yandex, Seznam.cz, and (later, in 2023) Naver ended up implementing a standard where you can notify one search engine of a page update and the other participating search engines are also notified [1, 2, 3].

[1]: https://www.indexnow.org/

[2]: https://www.bing.com/indexnow

[3]: https://blogs.bing.com/webmaster/october-2021/IndexNow-Insta...

The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.

Should be easy to crosscheck the reliability of update notifications by doing a little bit of polling too.

Or you could just delete it, if your content isn't valuable enough that you'll pay to have it served once a week without ad-dollars to subsidize it.

https://developers.google.com/search/apis/indexing-api/v3/us...

You can have millions of static pages and serve them very inexpensively. Showing dynamic ads is fundamentally exposing an expensive computational resource without any rate limiting. If that was any other API or service it would be gated but the assumption here is that this particular service will make more money than lose and it obviously breaks down in this instance. I really don’t think you can say it’s about scale when what you’re scaling (serving ads to bots) doesn’t make any business sense.

Leaving the ads in was a business necessity because it eliminated the documented risk of being delisted by Google for customizing content for their crawlers. The company would have gone out of business if that happened permanently. Even if it only happened for a few days, it would have meant millions in lost revenue.

I think you’re right. At first I thought “crawlers are actually creating large amounts of spam requests” but this is just the way a searchable web functions. The crawlers are just building the index of the internet.

Adds a lot of weight to the dead internet theory.

can you elaborate?

I googled this: https://en.wikipedia.org/wiki/Dead_Internet_theory

Actually everything we discussed here is the result of genuine human activities.

I still think that humans are very good at identifying other humans, particularly through long-form speech and writing. Sentient and non-sentient beings alike are very good at identifying members of their own species.

I wonder if there's some sort of "time" threshold for how long an AI can speak/write before it is identifiable as an AI to a human. Some sort of Moore's law, but for AI recognizability

tbh alot of this "theory" is common sense for internet natives.

"most of everything is shit" comes to mind, but "most of email being spam" and "most of web-traffic being porn" are well known.

The number I came up with last time I looked into this was about 60% of page requests are by bots on any normal website.

We have some very long tail content and experienced this in 2023 after all the VC funded LLM start-ups tried to scrape every page ever.

Or you could also NOT serve targetted ads?

They're serving first party targeted ads based on only their own data. If you're going to complain about that, it's close to saying that websites shouldn't be able to make money from advertising at all.

websites shouldn't be able to make money from advertising at all.

This is the case. Advertising is a scourge, psychological warfare waged by corporations against our minds and wallets. Advertisers have no moral qualms, they will exploit any psychololgical weakness to shill products, no matter how harmful. Find a "market" of teenagers with social issues? Show them ads of happy young people frolicking with friends to make them buy your carbonated sugar water; never mind that your product will rot their teeth and make them fat. Advertisers don't care about whether products are actually good for people, all they care about is successful shilling.

Advertising is warfare waged by corporations against people and pretending otherwise makes you vulnerable to it. To fight back effectively we must use adblockers and advocate for advertising bans. If your website cannot exist without targeted advertising, then it is better for it to not exist.

Think about what it would mean to not have any advertising whatsoever. Most current large brands would essentially be entrenched forever. No matter how good a new product or service is, it's going to be almost impossible to reach a sustainable scale through purely organic growth starting from zero. Advertising in some form is necessary for an economy to function.

Nonsense cope. I practice total ad avoidance and I still try new products without being prompted to by ads. But at least half the people on this website work in advertising directly or indirectly so I know I'm talking to a wall here. Advertisers and all adtech are parasites trying to convince the host they're necessary. The whole of advertising is worse than worthless and humanity would be much better off if you all dropped dead.

What's a viable business model for web search other than ads (Google, Bing, DuckDuckGo, Naver, etc.) or paid search (Kagi)? If paid search is the only option left, is it okay that poor people can't use the web? Is it okay if poor people don't get access to news?

Oh, and they don't get to vote because voting day and locations can't be advertised by the government, especially in targeted mailings that are personalized with your party affiliation and location. The US Postal Service will also collapse, so those mailings can't go out, even if allowed. At least the rich can still search for their polling location on the web [<- sarcasm].

None of that is okay with me. More/better regulation? Yes! But our world doesn't know how to function without ads. Being absolute about banning ads is unrealistic and takes focus away from achieving better regulation, thereby playing into the hands of the worst advertisers.

What's a viable business model for web search other than ads (Google, Bing, DuckDuckGo, Naver, etc.) or paid search (Kagi)?

Not my problem. Those companies, and any other with business models reliant on advertising, don't have a right to exist. If your business can't be profitable without child labor, your business has no right to exist. This is no different.

I don't mind ads, they're not great, but they're not the end of the world either, they paid for a lot of useful services over the years.

Very much this. It's a site/app that has probably been used by 80-90% of adults living in America over the last decade. It would not exist if these ads weren't targeted. I know because we knew (past tense because I'm no longer there) exactly how much targeting increased click-through-rate and how that affected revenue.

On top of that, they were ads for doing more of what the user was doing right then, tailored to tastes we'd seen them exhibit over time. Our goal was that the ads should be relevant enough that they served as an exploration mechanism within the site/app. We didn't always do as well as we hoped there, but it was a lot better than what you see on most of the internet. And far less intrusive because they weren't random (i.e., un-targeted). I have run ad blockers plus used whole house DNS ad blocking as long as I've been aware of them, but I was fine working on these ads because it felt to me like ads done right.

If we can't even allow for ads done right, then vast swaths of the internet have to be pay-walled or disappear. One consequence of that... only the rich get to use most of the internet. That's already too true as it is, I don't want to see it go further.

I have no problems with this (first party, targeted) as far as I can read English and understand.

In fact one of my bigger problems have been that Google has served me generic ads that are so misplaced they go far into attempted insult territory (shady dating sites, pay-to-win "strategy games" etc).

Among other things, we ended up disabling almost all targeting and showing crawlers random ads that roughly fit the page

Seems like that's exactly what they did...

For everyone, not just crawlers.

Why not just block bing and save 70% straight away? Nobody uses bing anyway.

Nobody goes there anymore, it's too crowded.

Love finding a Yogi Berra quote in the wild

Nobody quotes Yogi Berra anymore, he's too popular.

I have never used Bing. I use duckduckgo though and they buy their results from Bing. At least they did in the past, I don't follow them closely enough to necessarily notice every possible change.

Specific Google searches are often useless, so I switch to Bing at work and home as needed.

Crawler hints and index now could reduce that traffic these days

- https://developers.cloudflare.com/cache/advanced-configurati...

- https://www.indexnow.org/

That 'policy' is still actually in effect, I believe, in Google's webmaster guidelines. They just don't enforce it.

Years ago (early 2000s) Google used to mostly crawl using Google-owned IPs, but they'd occasionally use Comcast or some other ISPs (partners) to crawl. If you were IP cloaking, you'd have to look out for those pesky non-Google IPs. I know, as I used to play that IP cloaking game back in the early 2000s, mostly using scripts from a service called "IP Delivery".

Not sure about now, but I worked in the T&S Webspam team (in Dublin, Ireland) until 2021, and we were very much enforcing cloaking.

It was, however, one of the most difficult types of spam to detect and penalise, at scale.

Is it even well defined? On the one hand, there’s “cloaking,” which is forbidden. On the other hand, there’s “gating,” which is allowed, and seems to frequency consist of showing all manner of spammy stuff and requests for personal information in lieu of the indexed content. Are these really clearly different?

And then there’s whatever Pinterest does, which seems awfully like cloaking or bait-and-switch or something: you get a high ranked image search result, you click it, and the page you see is in no way relevant to the search or related to the image thumbnail you clicked.

I think they must be penalized, because I see this a lot less in the results than I used to.

And byw (unless we are talking about different things) it was possible to get to the image on target page, but it was walled off behind a log in.

Do you have any example searches for the Pinterest results you're describing? I feel like I know what you're talking about but wondering what searches return this.

Whatever Pinterest does should result in them being yeeted from all search engines, tbh.

Apologies for not responding quicker.

For context, my team wrote scripts to automate catching spam at scale.

Long story short, there are non spam-related reasons why one would want to have their website show different content to their users and to a bot. Say, adult content in countries where adult content is illegal. Or political views, in a similar context.

For this reason, most automated actions aren't built upon a single potential spam signal. I don't want to give too much detail, but here's a totally fictitious example for you:

* Having a website associated with keywords like "cheap" or "flash sale" isn't bad per say. But that might be seen as a first red flag

* Now having those aforementioned keywords, plus "Cartier" or "Vuitton" would be another red flag

* Add to this the fact that we see that this website changed owners recently, and used to SERP for different keywords, and that's another flag

=> 3 red flags, that's enough for some automation rule to me.

Again, this is a totally fictitious example, and in reality things are much more complex than this (plus I don't even think I understood or was exposed to all the ins and outs of spam detection while working there).

But cloaking on its own is kind of a risky space, as you'd get way too many false positives.

Curious. How is it detected in the first place if not reported like in this case?

sampling from non-bot-IPs and non-bot UAs

You can actually get a manual action (penalty) from Google if you do IP cloaking/redirects. It's still mentioned prominently in Google's Webmaster Guidelines: https://support.google.com/webmasters/answer/9044175?hl=en#z...

And then there is Dynamic Rendering which OKed cloaking

https://developers.google.com/search/docs/crawling-indexing/...

and the there are AMP pages which is Google Enforced cloaking...

I think by now all search engines run JavaScript and index the rendered page...

As the founder of SEO4Ajax, I can assure you that this is far from the case. Googlebot, for example, still has great difficulty indexing dynamically generated JavaScript content on the client side.

This isn’t about JavaScript vs no JavaScript.

It’s about serving different pages based on User Agent.

I think they did this because lots of publishers show paywalls to people but still want their content indexed by Google. In other words, they want their cake and eat it too!

And of course many of these publishers are politically powerful, and are the trusted sources that google wants to promote over random blogs.

Well, they all show Google ads.

You'd think they could make fine money as neutral brokers since everyone served their ads and for a long period they did make money as semi-neutral brokers. But since, IDK, 2019 they have become more and more garbage. This is broadly part of the concentration of wealth and power you see everywhere else but I don't know the specifics but you can see the result.

Not true for the NYT, which has its own ad system.

there is a special spec for google by that

https://developers.google.com/search/docs/appearance/structu...

basically cloaking + json-ld markup

I wonder if Google trains its AI on paywalled data, that other scrapers don’t have access to but which those paywalled sites give full access for the Google bot to.

Why do you think that the rule is not in effect and that this is not an example of the constant cat and mouse game between Google and spammers?

They still have that rule. Just not always easy to spot spammers getting around it.

More recently, Google seems totally okay with sites that show content to Googlebot but go out of their way not to show that content to regular users.

See also, pages behind Red Hat and Oracle tech support paywalls.

I’ve switched to Kagi a couple of months ago. Every once in a while I struggle to get good search results, but then I check Google and it’s not any better there. It’s not always the greatest at promoting the sites I like, but I’ve already started boosting and pinning various domains to tailor the results to my own preference.

Still using a lot of other google stuff including gmail and maps. Just not search anymore.

+1 for Kagi.

But also, for the past few months, I’ve completely stopped searching the internet. ChatGPT-4 does the job way more effectively and I don’t see why I would go back to searching the internet (assuming the chatgpt experience doesn’t get nerfed in some way).

ChatGPT is like living in an information tunnel. It’s amazing but it doesn’t replace search for me at all. Irony is when it does search, because it’s obviously just doing a quick crawl it actually makes things worse as it treats whatever shit it finds as authoritative - which, as anyone working on RAG knows, is a whole world of problems on its own.

This is absolutely not to say that Google can be considered ‘good’ these days.

Yeah they merged browsing into ChatGPT and it worsened the experience. I dread seeing the browsing icon show up to a question as the ai will get dumber.

I've addressed this by adding a custom instruction to only search when necessary, when it can't give a good answer otherwise. Pretty much cuts the searching down to when I explicitly ask it to do so.

They also have an official "ChatGPT Classic" that turns off plugins including Bing.

Exactly the same for me, I've found answers derived from chat GPT training to be way more useful than the browser search answers. Half the time it doesn't work, it ads a lot of waiting time and it provides answers way less comprehensive. I have used the custom "classic" GPT bot when I have wanted to avoid Bing search answers.

When !OpenAI achieves AGI, Microsoft will just use it to drive more traffic to Bing and Azure.

That's creativity and innovation Microsoft style.

I've been using ChatGPT and Perplexity with GPT4 as my replacement for DDG and Google. I never thought the day would come when I thought Google Search was going to be superseded. It's crazy times to me.

If some company makes a less sanitized but equally capable version of GPT4 then I could see it replacing Google.

But for now you must keep your searches to a very limited, sanitized, corporate, non-copyright infringing, non-adult set of knowledge. And it's impossible to know what that will be beforehand, which makes using the tools very frustrating.

For example, try searching for anything medical related. Even if you're clear that you not looking for medical advice, you're just looking for info, it won't give it (sorry, as an AI I can't give medical advice). I imagine this is very frustrating for medical students.

And yes, I'm sure that I could coerce it into responding. Pretend your my grandmother telling me about her old medical recipes or some such. But that's still too annoying to do as anything except for testing the boundaries of the tool.

If you contextualize your request (e.g., "this is hypothetical"), GPT4 answers quite a few medical questions.

Yeah but my point is that it's not a complete search tool while it makes you jump through hoops of indirectly asking "hypothetical" questions.

It needs to directly answer all the questions that are given to it. If it must finish out the post with a disclaimer like "I am not a doctor/lawyer, this is not medical/legal advice" or whatever, that's fine.

Simply asking it, “What would you tell a medical student…” works around this. I can’t imagine how or why they’re bothering with this when a simple disclaimer would probably protect them legally. WebMD seems to do just fine.

When kagi fails me, I use https://kagi.com/fastgpt

Much better than bing gpt in my experience.

https://phind.com is good too

Damn, that's worth noting. Bing Chat (or Copilot or whatever they're calling it now) is my gold standard for web-enhanced GPT.

Same for me. GPT-3.5 is good enough and pretty fast for most basic queries. GPT-4 is great at more detailed queries.

Any time I do end up going to Google, I’m so disappointed by the search results that I just leave. The only thing its good for now is searching site:reddit.com

Agreed. For getting a range of opinions Kagi is best for searching through forums, finding technical blogs, etc.

For quick intros to technical issues GPT4 gives a decent summary if the topic has been around for a while.

For going in-depth, though, I still rely on technical docs…

I've been using kagi for almost a year, I think. Before that it was startpage or DDG.

I also have the same experience where kagi doesn't find something I think it should, so I go try google. Holy hell is google bad. Shockingly bad. I genuinely can't believe how bad it is now compared to 10 years ago.

Do you have example queries where Kagi shines? Just curious

- Any time you are searching for something that can be misunderstood.

- Especially if you feel the need to use doublequotes.

No, not really.

Kagi is designed to show you what you ask for, and not for showing you the ads you're most likely to fall for. It simply takes your query and returns results matching it. That's really it. It's sad that "does what you ask" is a defining feature, but that's what kagi is.

They have advanced features like "lenses" that bias your results toward a specific topic like programming, research, forums. It also lets you add weight to certain domains. For example, I have Pinterest and Facebook totally blacklisted, and some small sites boosted in my results.

They also support advanced query syntax with double quotes, +/-, and other operators.

Which is to say, kagi is the standard for "search engine that works". Nobody else sells a search engine that just works and does what you tell it to, which is literally all I want out of a search engine. It's mundane and unsexy, but it works, it doesn't advertise at me, and it lets me get my work done faster.

I have essentially given up on Google as a search engine; I use it as bookmark search for bookmarks I consistently fail to add to my list. 99% of the time I google something for a specific URL from a domain I know (wikipedia, arxiv. etc). The other 1% was Google searches, but now I just append :reddit.com from the get go, bringing down my genuine "Google search"es to approximately zero.

even a couple years ago. There is soooooo much AI generate trash nowadays like https://www.oggyboggydoop.com on the first page. How can google's SEO filtering be so bad?

I've trialed Kagi but it's not given me the best results, especially not in my native language or related to my local area. I still prefer to use Google or even DDG.

Yes, it's slightly worse for local results. But pretty good for most other searches.

What language is that?

lol, recent spam attack in the past few weeks?

Hah. Google's been 98% spam for well over a year now.

Try googling "What is the fifth book in the wheel of time series" and see what you get? All spam.

I tried googling "What is the fifth book in the wheel of time series" and got

1. Wikipedia - The Fires of Heaven

2. Amazon - The Fires of Heaven (The Wheel of Time, Book 5)

3. a bunch of videos which I skipped over

4. Wheel of Time Wiki - The Fires of Heaven

5. Macmillan Publishers - The Wheel of Time, Books 5-9

6. Goodreads - The Fires of Heaven (The Wheel of Time, #5)

7. Novelnotions - Book Review: The Fires of Heaven (maybe this is spam?)

8. Esquire.com - Wheel of Time Books in Order (not specific to book 5)

9. FictionDb - Wheel of Time Series in Order (not specific to book 5)

10. Barnes And Noble - The Fires of Heaven

I am wondering how different everyone's results must be. [Also, I don't know how to format a list on HN…]

Weird, last week when I googled it, all I got was liscicles about "the best order to read WoT" and other blogspam. Just googled it now and it seems fine.

Google put one of my accounts into a spam blackhole in Jan 2022 for about a month (almost all results were obvious low quality spam).

Result quality can be significantly different

There is (browser?) malware out there that hijacks the google search results to show junk instead of what you are actually looking for. At one point I got bitten by this, it only sometimes replaced the results so it was a bit subtle and took me a while to pin down.

I wonder how many people on HN are infected by such malware and don't realize it? A lot of the complaints about search results are clearly not this, but when someone complains of outright spam for reasonable queries, I do wonder...

I got something similar.

Google's been 98% spam for well over a year now.

I find this hard to believe. How do you even measure for this?

I'd love to see a few more examples of searches you are making that show spam, because the example you gave provided me with the appropriate results. I almost suspect you are either being disingenuous or just have some malware on your computer.

"This search engine I've gone out of my way to not track my search, viewing or other habits and usage is showing me irrelevant ads! Fucking trash!"

They'll complain at the thought of paying for YT premium ("the internet should be free bro! Except my new SaaS calendar app, of course"), pirate Factorio, pay for kagi. A real eclectic bunch.

I see this all the time these days and always wonder how many people making these comments have their own monetized apps, or have done any work on monetization, etc.

Google "wheels of time", skip the results about the TV series (ca. 3), open the wikipedia page and go to the " novels" section.

So... just go to wikipedia.

Try googling "What is the fifth book in the wheel of time series" and see what you get? All spam.

Google gave the correct answer. Didn't see any spam.

It's probably good to check with a guest browser session before making this kind of strong claims.

I got decent results back. What do you get?

I tried that and it all seems pretty relevant to the Fires of Heaven in particular. (name/description, wikis, book stores, discussions etc)

Try reverse image search for a cartoon from DeviantArt, 99% links in the results redirect you to unrelated pr0n sitez. We're talking that kind of spam

I just did and got The Fires of Heaven along with an Overview, Summary and Reviews. Are you sure you’re not using Bing?

Google has been borderline useless for productive work. I always attach Wikipedia or Reddit to my search to get anything useful

I took a series of screenshots but they all seem fine to me?

Wikipedia, Goodreads, Fandom.com and MacMillan Publishing; these all seem to be reasonable results. I could share the whole page if I could find a place to upload my screenshots (RIP imgur)

I would have thought it well known by Search Engine Journal that at intervals Google implements changes that in one way or the other influences what kind of sites are rated in which way... And that, at times of change this may lead to very substantial changes in ranking, sometimes letting a lot of "irrelevant"/"lower quality" (...all this is subjective to some extent) results flow to the top for certain queries, even for a prolonged period of time... Back in the day these algo updates were quite the thing to monitor and discuss on certain SEO-related sites...

That said I only comment out of casual interest as I stopped using Google more than a decade ago.

What do you use now for all the various things? I cut google out for most things but still use their search from time to time (and of course have to use google docs but that's because the people I'm collaborating with are on google)

Kagi.

Startpage is a good wrapper for Google if you care about privacy. (Or it was, I haven't checked in many years)

DDG is worse than google, IMO. Bing works, I guess, but I trust Microsoft almost as much as I trust google.

At this point I've given up on Google. If I can't find it in kagi after a bit of effort, I'll either work around it or ask a person I know in the given field.

God, the internet sucks so much now. I miss the early 2000s :(

I’ve found ddg is much better than google, but switched to kagi, which is better still.

For queries that fail on kagi, I check ddg and google, but it never helps.

Kagi’s FastGPT works well on queries where the search engines fail, but is worse at search on average.

DDG has been much more aggressive about replacing my query with whatever their algorithm decides I meant. They support no advanced query operators, not even double quotes.

A search engine that ignores my query and shows me something else is not super valuable to me

This has nothing to do with changes in the algorithm. I've been in search for 20+ years, so I'm quite familiar with how Google works. ;) My article explains why it's likely happening.

TL/DR is that spammers are likely exploiting two loopholes. 1. Longtail keywords are low competition and may trigger different algorithms. 2. Some/many of the search queries the spam ranks for trigger the more permissive Local Search algorithm

Plus there are other reasons why those sites are getting through, which are discussed in detail in the article.

My article explains why it's likely happening.

Are you saying you're the author of TFA?

Yes. If you roll over the author at the top of the article, there is a link to various other social media profiles where the author has used this username.

After 20 years of being taken for granted that search engines as we know them are equipped to solve the typical problems we throw at them, I wonder if the whole concept of an unsupervised web crawl as the input to a single purpose search engine will just die out.

When I think about my typical web queries across the past year or two, it seems more and more likely that I'd be better off replacing Google with several purpose-built systems, none of which search the "entire web" (whatever that even means anymore). Technical queries? Just search StackOverflow and Github directly. Searching for a local venue of any kind? Search against a dedicated places database where new entries have to pass at least a cursory scrutiny. (Arguably Google Maps or Yelp already serve this purpose today, but I'm not sure if they have enough vetting today). Medical question? Search across a few sites known to be trustworthy.

We have become accustomed to go to Google because it's more convenient to type in a movie title, "chinese restaurant philadelphia", "flights to miami 4/12/24" or "Error code 127 python" into the same single place, but something tells me we'd be better off if that one place made some LLM-assisted guesses of what kind of search it is, and then went to a specialized search that is curated. If we go back toward the DMOZ/Yahoo model of directories that humans curate, I wonder if we could even reverse the trend toward spam and clickbait that has been so lamented in recent years.

For me search would be greatly improved if I could selectively exclude entire domains when I come across them. I want to be able to, with one click, remove GeeksForGeeks from all my search results — forever. And then I want to be able to continue to add to that (once called) "black list".

Never, ever show me Pinterest when I do an image search.

I imagine my search results would improve quickly in short order.

Better still, aggregate those lists from all users and you can improve search for users that have not yet built up a black list.

I want to be able to, with one click, remove GeeksForGeeks from all my search results — forever

Try https://github.com/iorate/ublacklist, otherwise Kagi also has a similar feature.

I would rather not install yet another extension. Its easier to just add a filter page to uBlock Origin using something like this

https://github.com/quenhus/uBlock-Origin-dev-filter

I'm fine with installing it. It's definitely worth it to clean up results. Seems like ~220,000 users think the same.

That’s literally built into kagi. For me their rewrite feature is more important - I can now go directly to old reddit.

On the surface this is a good idea, however this would turn wildly anticompetitive. Whether or not your site would have business on the web would be entirely dictated by whether or not you woul be correctly classified (or indeed classified at all) in this engine. if you wanted to start your own stackoverflow competitor for whatever reason, you would have a very hard time getting any traction. this is also true of current general purpose engines, but you still do stand a chance to be referenced well and hit high enough to still get traffic.

the yahoo model collapsed for this very reason. back when you went to more than 5 websites to look at screenshots of the other 4, the directories would not necessarly show you the latest thing, because it wasn't on the list of sites manually added to each directory.

i think the current problem with google isn't to do with spam. i think google has become complacent because their ads are on all the sites anyway, so the function of "maximize revenue per search" doesn't actually care if you find what you're looking for, because you will get shown google ads anyway, and will be coming back to google anyway. in fact, they probably get to show more ads by feeding you bad results, because then you're loading more pages. this didn't used to be the case when google search was on top of spam sites, but it doesn't feel like they're doing anymore algo updates to curb the current trend, and spam sites have caught on to what ranks higher in the results.

By limiting web search results to "a few known sites", you'd be expediting the death of parts of it.

The beauty of search engines (in theory) is that you can find something NEW. Keeping the "open web" out would just entrench and ossify the current players.

youtube search is problematic. i very often get videos on top that are auto-generate spam with AI narration and random clips (often completely irrelevant to the subject). there are videos i know that aren't shown even when i search with title and channel name.

Google wants to be too relevant , to the point it s unusable

Also recommend. Others liked. Recommend for you. Shorts section.

Search is unusable.

Recently getting pimple poppers and onlyfans when searching for small engine repair.

It's better than the Rogan and IDW spam I was getting a few weeks ago.

Recently getting pimple poppers

Ye for like two weeks. Searching for Lady Gaga or whatever, get those disgusting video thumb nails in the result.

It is impossible to not look at the thumb nails since they are so disgusting. I wonder if that feeds the ranking somehow?

Exactly why they work. It’s all about the almighty ad dollar.

Roel Van De Paar broke it several years ago. https://you.tube/Olkb7fYSyiI

This reminds me of what happens when I search "coupon" on Twitter. I would love someone to explain why you just get a thousand pages of this:

https://imgur.com/a/uzNNmuw

Looks like it might be spam in a different character set or something.

The images are bizarre though too...

That image in your screenshot looks like coupon codes for Arabic web shopping sites.

Well the text is HTML entities to escape ASCII text. "People being end with a person. Like everything in OUNASS ProMo CODE Onas OuNaS oNass cOuPOn DiscOUnt NoON SiVvl non toyou NaMshi"

Here Sivvi, toyou, Namshi, noon, and OUNASS are all brands of shopping websites and you can see their logos in the image.

Clearly this is some sort of keyword spam, though it's hard to tell more than that from your screenshot. It's also not clear why they'd bother to use HTML entities... a bug in the spam code? Or perhaps exploiting some parser differential between different twitter systems? Who can say.

So this is pure speculation, but more people should be aware of parser differentials (same thing as that email thing the other day) so let me say what I mean...

Hypothetically say a website has an internal service to index posts for keywords for search, that just so happens to unescape HTML entities during keyword normalization due to a seemingly harmless bug.

Plus a second internal service to identify keyword spam that _doesn't_ do any HTML entity unescaping (because why would you?)

Then you could end up in a situation where a spammer uses HTML entities to avoid spam detection while still showing up in search results. They hope that the user ignores the nonsense text and just clicks their link based on the image (a list of big shopping brands in the middle east) instead.

this has been going on for years. Google makes money with ads. the quality of search results is almost irrelevant to its business model

No? Decreased search quality will reduce the number of searches people are doing which redcues the amount of ads Google can show. Decreased searches can come from not only people switching to another search engine, but also from people using the web less. Why look up something on the web when you can watch TikTok for a few hours?

No? Decreased search quality will reduce the number of searches people are doing which redcues the amount of ads Google can show.

I’m willing to bet that >90% of Google Search users aren’t even aware that alternatives exist.

They are not suddenly going to stop using Google Search. There might even be a significant short-term increase in usage, for example, if they really need to go to page 5 to find the first relevant result.

They are not suddenly going to stop using Google Search

They will use it less if it becomes a waste of time to try.

That is very far from my experience with non-tech people. My friends and students always google something, and almost inevitably fail to find it. Then they do it again, and again, and again.

Over two years now I see specifically two people deterministically failing to find stuff through G search, and yet they still start with it in an almost Pavlovian ritual. Sometimes they will desperately scroll down, click on a blatantly irrelevant result and switch to facebook or some other bookmark aggregator of theirs to continue their brutally inefficient search process, thumb flick after thumb flick after thumb flick.

You're right, but you're thinking long term sustainability. That's not Google's current goal, they'd rather squeeze out as much profit as possible in a short term, then current shareholders can sell their stock to idiots who become bag holders.

I have noticed that the quality of responses I get from Google searches appear to have been in steep decline for a long time.

I have seen others comment on this as well.

I cant say I know when the trend started.

Could this have been running / going on for a long while without getting the scrutiny, it needed?

Is this spam attack the final act?

Can we attribute the decline in quality to the decline in informative content proportional to garbage content?

Garbage content seems to be making massive gains year on year, while informative or high quality content has stagnated or even declined from data decay

I think part of the problem is that Google has a recency bias. Newer, more spammy, sources get priority over older ones - even if the older ones are of higher quality.

For me it started when they began to monkey with the search entries and "second guess" what I was searching for. My guess is that they lost track of unvarnished human interaction and things snowball from there. That's just my hunch. People gave up trying to actually rely on it. We've all learned that Google doesn't care.

Basically it used to be optimized to be a sharp knife but now it's optimized to be a safety knife.

More like a spork, actually.

I feel like it’s been in decline ever since the introduction of Hummingbird

I’ve noticed an increasing number of sites that provide an insane amount of text to answer a simple question.

They almost always follow some sort of structure where the actual answer to your question is all the way at the bottom of the page.

Most of the content appears relevant on the surface, but when you actually read it, it’s completely generic junk. Stuff that a high schooler would use to fluff an essay to hit a word limit.

Lately a lot of these seem to be actually generated with LLMs. You can usually tell from the high school essay structure, often ending with a paragraph along the lines of "in conclusion there are many benefits but also many drawbacks".

Those sites predate llm’s. I often wonder how much better things like chatgpt would perform if their training data did not include seo spam.

Blame Google. A few years back, they decided that “topical authority” was important, and a page that targeted as many keywords as possible was “better”. A bunch of SEOs published studies showing how pages with 2,000+ keywords ranked higher, and then the floodgates opened with every company fluffing up their pages with 2000 words of BS just to appeal to Google.

Many don't even have an answer. They simply conclude that they don't know, after having spent pointless paragraphs on filler. "Well there you have it" is a common expression I see.

"They almost always follow some sort of structure where the actual answer to your question is all the way at the bottom of the page."

You have to scroll past the adverts - that's why they exist. These sites are generated from templates.

I mean, when we get public posts like this one where someone boasts how they’re spamming to get hits through SEO — https://x.com/jakezward/status/1728032634037567509, I can only imagine it’s happening in larger quantities behind the scenes.

Google actually shut that down pretty quick. That 'loophole' doesn't exist anymore.

Pro Tip: if you're going to boast on how you're spamming Google, then expect it to be shut down, especially if it's a hole in their algorithm.

I’m actually curious, how did Google do that? The guy who did it did it in a very obvious way, but I’m assuming you can just schedule a lot of posts that would drop once a day, make the AI to use different language structures and change the underlying AI model in general (e.g. switch between OpenAI, Mistral and whatever) and slow drip submit the posts. How would Google know they’re “mass generated”?

The original poster of that Tweet (Jake) admitted they got a manual penalty. Also, clearly Google didn't fix it because if not Google wouldn't be 'overwhelmed' with this current spam attack going on. If you look into the attack it's mass generated absolute spam garbage pages on hundreds if not thousands of separate domains. So it is definitely not fixed.

Did they really close the 'loophole' though? Jake said on Twitter they were actually hit with a manual penalty. So doesn't seem like it's patched. Seems like if he didn't boast about it they'd still be doing well.

Recently almost every YouTube ad I get is a poorly made deepfake of elon musk telling me about an amazing secret investment opportunity. Google is losing the plot

Most ads I see on YouTube are very dubious and poorly made. A reasonable ammount are plain fraud. It is very frustrating going to "ad center" and reading about how the advertiser did not verify his identity.

Lucky, those sound amusing.

All I get is the same damn advertisement for the Internet provider I already have. You’d think they’d be able to say “Google don’t send this ad to our customers.”

I wouldn’t be surprised if there was some kind of “annoyance” metric for the ad selection algorithm. The idea being: show a certain demographic the same annoying ad again and again, and they might end up paying just to get rid of ads. I’m pretty sure Spotify does this.

s/is losing/has lost/

Over the past few years I’ve felt like Google itself is a spam attack.

This. Google ads are spam, Google search is spam, Gmail is spam, YouTube is spam...

Google gets their money. They don't care about anything else.

Baby: bath water.

You are playing up the noise and playing down the signal. Search, Gmail, and YouTube are far from spam. There are obviously many scenarios/URLs that contain spam, but all 3 of those products are overwhelmingly useful.

For responsible scientists and researchers who've disclosed to Google how this is being used to execute large-scale phishing attacks, and Google opting to not fix it, this news warms our hearts.

Spam on!

Google opting to not fix it

Source?

Can you elaborate? This sounds interesting!

Google has give up on organic search. I am pretty good as that SEO stuff and I can't Google anymore as I can fathom really fast why a page ranks the way it does. And non of it has to do with accurate valuable information.

Google is a marketplace, and they let most "engaging" results that adhere to a certain content structure win.

By now most paid results offer more value for the users than the organic results. Cause thats what Google wants. Click the paid, ignore the crap.

Yeah that's not how I feel today when I searched for a local buffet by name and accidentally tapped the first, sponsored result which was Golden Corral, yuck. Definitely not "more value" to me

Google is a marketplace

No. Google is a glorified advertising agency who just wants to make the absolute maximum profit for the absolute minimum amount of work.

By now most paid results offer more value for the users than the organic results. Cause thats what Google wants. Click the paid, ignore the crap.

Your first paragraph implies that the paid results are crap too?

Google has become unusable in many circumstances, and it rewards spam. It claims it doesn't, but it does, a lot of SEO strategies now revolve around spamming the search engine with articles, pages, etc. Not for useful content, but for linkbacks, internal linking, etc. It is especially bad for geo-specific SEO strategies, where you're trying to have different page sets for different regions. Basically, how it was when Google first started and was easily gamed. Now people are spinning up 100s of pages and articles using AI and just spamming it. It has gotten bad, but the worst part is that you have to do it now in order to compete for keywords.

Yeah. Ten years ago Google was fighting those strategies involving content farms and abusive SEO with things like the Panda update, etc. Now it seems they don't care anymore. Since low-quality SEO ranks over legitimate sites, it forces those sites to pay for advertisement. This is very sad.

I stopped using google and switched to Bing about a year ago after they started doing more with ChatGPT. For the most part I’m much happier with how it presents what I’m looking for. It’s not perfect, but when I compare to google the few times I’ve been frustrated, it’s not any better and has to do with the topic, not search engine.

I’ve been a DDG user for years now, so I guess a bunch of my results come from Bing.

I don’t generally compare to Google, so I can’t say for sure that the results are ‘as good’, but my experience sure as shit is better.

I search for a thing and I get a page with links. Usually the thing I want is in the first page.

Sometimes it isn’t, or I’m searching for something that I know is recent that Google probably has a later version of, so I just add the !g to the search and there I am at Google.

It’s great. It works. It’s not stressful or horrible or annoying. I recommend it.

Google is unusable for finding unknown things. If you know exactly what you're looking for like the local restaurant then it's fine but for everything else I use Bing, ChatGPT or Phind. On Bing, DDG and others my game [1] which has quite a following now is at the top of the results followed by a fan's clone of the original game; these are what people are looking for. On Google it's all spammy advertising sites with embedded copycat content... Google is an advertising company so go figure. My site could be faster but it still shouldn't be beaten by spammy sites farming links to each other.

[1] https://redactle.net

(I work for msft my opinions are my own.)

When I googled redactle the first site was some fan site. The second was one of your sites (anybrowser). The third was Reddit which linked directly to your sites. Then came a pair of unlimited sites. Finally your .net site so not terribly deep and it also came in third in a away. All in all it's not ideal but not 'all spammy advertising sites'

Whatever. Not my problem anymore. I switched to Kagi + ChatGPT 4 and couldn't be happier -- though it costs me about $30 per month.

I haven't tried Kagi's FastGPT yet. How does it compare with ChatGPT 4? Is it updated regularly? What is its knowledge cutoff date?

It’s useless. It’s much worse than a Google search.

It's gotten to the point we just need an open non-profit search engine.

I've been using phind.com, very good results

install with https://www.phind.com/search?q=%s

It's probably really hard to manage those things. Best of luck.

When "SEO" first became a trendy buzzword decades ago, I immediately thought "these are fraudulent efforts to artificially appear more relevant than you actually are to the search algo" - and then it became a multi billion dollar industry.

To be honest, why would they care?

If anything, this will help them sell more ad spots on Google Search.

What's sad is that I didn't even notice.

google maps has been plagued with casino spam for a long time, and it seems to still be the case.

eg (November 2022) https://usa-casino.com/casino-news/spammers-take-over-google...

and right now I just tried a query from the article above "Alabama Casinos" on Google maps, and sure enough I see "Bovada casino" (offshore "illegal in US" casino) affiliate links in the third spot in the "locations" list.

yuk.

Meanwhile, my blog got some of its posts de-indexed in the latest Google update, despite trying to write high quality content.

The fact that Google puts so much focus on content makes it sound like the Google algorithm has been corrupted. It's almost like a secret society where you need to do a secret handshake in order to get access. In this case you need to know the right word patterns to use in your content to get access to high traffic.

Google should be prioritizing sites which have fewer backlinks as it's proof that they didn't cheat and are likely higher quality. Or just random ranking; only require a certain low threshold of backlinks to establish baseline relevance.

I opened the article from a computer in school library that didn’t have ad-block. Nearly crashed the system. I seem to remember times when SEJ was reputable, can not even describe what this is.

For a bunch of super smart people they sure run a shite service. I've received the same spam multiple times in the last few days. Marked as spam every single time. And each time they failed to identify the exact same message as spam.

It's incredible how brain dead Google have become.

I realized the other day that I haven't Googled anything in over a month. And I guess even then, and before Google search was thoroughly enshittified, it was mostly just a convenient way to get to StackOverflow via my search bar, but now it's just ChatGPT for most quick queries I have.

A couple years ago I created a new gmail, with a very long address. Didn't tell anybody, sign up with ANYTHING. Just parked it and let it sit dormant. Within a DAY I had my first spam.

Google is an ad company and it's entirely reasonable to assume they are simply selling lists of their own email addresses. Same with Yahoo! Mail! Which! I! Ran! The! Same! Experiment! From! With! The! Same! Results!

Wow, that's pretty impressive. I didn't experience a "Code Red" when I was at Google but this would certainly qualify. Bing is not affected so it is definitely something to do with rank injection. I am really really interested in the post mortem if it sees the light of day, although if it reveals exploits for ranking it will probably not be made public.