Last I checked bleepingcomputer, ifixit, vibilagare.se, psyche.co and libreboot.org aren't personal sites.
None of those actually have an /about page, yet your site says they do...
Another funny thing: just search 404 or not found and you'll get a lot of 404 pages
Yes :(
Do you have an idea of how to remove company websites in an automated way? I didn't want to manually review all 7k indexed websites.
This is the GPT prompt I used for filtering domains to add, but it gives false positives:
For the 404s (assuming the status code isn't a 4xx), use a URL that you strongly suspect won't exist, then you can do a comparison (levenshtein distance, bag of words, etc.) to see if it's very similar to one of about, ideas, etc. pages.
Most are a 4xx code, I checked myself, some may be 301/302 redirect to 4xx not being handled properly by their crawler
Good point. We're using https://crawlee.dev, I think there's a way to handle more status codes as errors...
Right now it only excludes pages based on the text content: https://github.com/lindylearn/aboutideasnow/blob/main/apps/a...
I think openai embeddings API could be useful here. Perhaps one of the neurons responds to corporate speak.
Maybe change the API so that GPT can express uncertainty (make it a ternary value or even a confidence percentage), and then check the “uncertain” cases manually.
Yep, most of our systems end up exposing a parameter like that to the customer. Some people only like the system to take action if the system is very sure, hate incorrect action and prefer unprocessed stuff in a queue. Other customers hate unprocessed items and prefer to cleanup incorrect actions. Takes tinkering to find the best.
Great idea, I will try this. Thank you!!