Kagi founder here. This is very early (alpha) concept built by a single Kagi Labs developer in a few weeks. The proper infrastruture and product is not built yet. We are launching a prototype to get feedback and gauge demand.
Why does this exist?
It would be an efficient way for us to build and expand our own index. Assuming users of this would be Kagi users, we would expand our index by tens of thousands of high quality? personal websites, hobby projects, startups, documentation websites, etc., also helping to surface them in our results (where relevant, like we already do with Kagi Small Web initiative [1]). It is a win-win for both our users and Kagi.
It would also be a way for Kagi to get some exposure outside of kagi.com (provided the search widget has some branding on it).
This is why it makes sense to offer it for free for smaller sites/projects.
And crowdsourcing index is completely opposite direction of one that causes deterioration of web search results in ad-supported search where few entities control the majority of space [2], so we like it.
That is the plan - and since this is a "Labs" project, we are open to it crashing and burning. Know we do not, until we try. Try and try again, we must.
Hi Vlad,
How will you prevent someone from connecting Sidekick to a website that appears at first to be a small website but eventually fills with LLM-generated SEO keywords and ads? You might be able to manually review sites when they first sign up for Sidekick, but once the channel is open for them to inject content into Kagi's index, what's to stop them from abusing their privileged position?
That is a great question. Well we first have to ask what would be the purpose of someone going through the trouble to create such (LLM generated) spammy SEO content? The answer (for the majority of web at least now) is to monetize it with ads/affiliate links. If that is the case then the answer is easy as we already penalize sites with ads/trackers on them for our general web search, and completely boot them out of our own index.
In parallel, we are developing LLM-content detector technology to be more efficient at detecting such content regardless of how it is monetised (and we will offer this as an API once developed).
Is there any reliable AI generated content detector today ? I've tried many free and paying ones online, but they're aren't reliable
It's impossible to make. You cannot prove any sentence was created by an LLM and you can't prove it wasn't.
Unless you design the LLM yourself and purposefully watermark the output. https://arxiv.org/pdf/2306.04634.pdf
That's a digital signature, same as sending an email with GPG to prove you sent it. You wouldn't say that because some people use GPG you can somehow detect who wrote every email on earth, it's a push model vs pull. This is why I wrote "any sentence" vs "some sentences".
Watermarking is not at all like a digital signature and a lot like steganography. I only have a surface level understanding of the process, but it works by biasing token selection to encode information into the resulting text in a way that's resistant to later modifications and rephrasing.
I have my doubts about the effectiveness of this method and realistically, it won't make any difference because the bad actors will just use an LLM that doesn't snitch on them, so you're technically correct.
The only way to make that stenography robust is to have the encoded message be generated with some secret key that can be verified. Otherwise anyone could manually fake the stenography into human typed messages assisted by some encoder and you'd have no way of telling if it was really typed by an LLM. That line of thinking is what makes it have to be like a signature to work like you said for "any sentence". I also think these methods only work above certain character limits. Short messages are impossible to tell.
If you look here : GitHub.com/HNx1/IdentityLM you can see that it’s relatively easy to sign LLM output with a private key using an adaptation of the watermarking method.
This application is exactly what I was describing. I'll look it over to see how it scales the encryption strength based on token length or how it deals with short messages, which is the only thing I'd think it'd be very hard to do. If you print 2 paragraphs it's easy to change some tokens with a secret key mask but if you print "Yes", it's not so easy. Thanks for the great share.
you can always ask it to include a '!' after every word and then sed it away. Poof, there goes your watermark
Maybe you can't determine that with certainty, but there may be statistical tools you can use to estimate the probably that some content came from one of the LLMs we know about based on their known writing styles?
Someone did something like that to identify HN authors (as in correlating similar writing styles between pseudonyms) a few years back, for example: https://news.ycombinator.com/item?id=33755016
Or a study applying similar analysis to LLMs: https://arxiv.org/pdf/2308.07305
Of course, LLM output can be tweaked to evade these, just like humans can alter their writing style or handwriting to better evade detection. But it's one approach.
SEO pages pushing some product are SEO pages pushing some product. You should ignore them no matter what the source is, so what does it matter if they're LLM generated or hand written?
The problem is that people keep consuming the samey low quality content instead of skipping it (think superhero movies and Netflix series that are all indistinguishable from each other). As long as they're satisfied with that, they'll fall for fake product reviews too.
This is a naive take. SEO schemes are attractive for companies that sell products themselves (e.g. try searching anything related to ETL tools). The content itself is the ad and you won’t find any ad serving scripts or affiliate links in there.
(Source: have created such schemes, although would generally not recommend them to my customers nowadays)
Underestimate the average Kagi user at your own peril. I do not think many would fall prey to an LLM generated content marketing page and end up buying a product from such site. Much likelier scenario is the page gets instantly blocked/reported.
not really underestimating Kagi users, just the inexorable push to colonize every last useful network.
This could also be an attack vector for adversaries looking to pollute Kagi's search results and/or force you to divert resources to policing it.
That is, until eternal september.
They want to index companies that sell products. I don't see a big problem here if a company that sells a product I'm searching for, who happens to also have low-quality SEO content, shows up in that search.
In fact, I would rather they not get penalized for it, since low-quality SEO content is a good way to show up in certain other search engines (Google), and every business wants to show up in Google, making that content quite common even from reputable businesses making a quality product.
As someone who in a past life spent loads of time doing of SEO I cannot help but find this argument flawed.
So, we shouldn’t penalize low quality, SEO, spam because of people’s wants? I do want them to penalize those sites because they are a disservice and more often than not crappy, unsecured WordPress that drowns out those that are not spam.
Thank you Kagi team! A shame how far Google’s results have fallen.
Edit: also SEO is one of the more seedier parts of the software industry. Tons of unaware small businesses conned into these awful, low quality sites. I literally quit because it was so morally bankrupt.
You can block the site in kagi if you don't like it, that's 50% of the entire moat of the search engine.
Problem is that many webites used to hire writers which wrote tangentially related posts to get their main product higher ranked. Like LogRocket and Partition Minitool do.
Combine that with that guy who boasted about his 'SEO heist', I think it's a very valid concern.
Isn't this exactly what 37signals and even joelonsoftware were? Isn't HN essentially a free conduit to YCombinator awareness?
I don't see the problem with what you're describing. It seems like one of the most contributory ways to market well.
You're right, but those are the good examples.
Another decent one would be linux sysadmin info from Digital Ocean and the likes.
But for every joelonsoftware there are 99999 sites that have all copy/pasted the same tutorial about something basic and try to push some random product or just ads.
I have solved many problems because of a blog post created by company that wanted to get their product name out there and I don't think they should be looked at negatively for doing that. Are upset whenever a companies tech blog lands on HN? Because it is virtually the same thing. If you use Kagi and come across a site that you find is low quality and spammy then just block it. That's the cool thing about using Kagi.
Agreed.
I've also found that type of developer marketing valuable many times in the past. It's sometimes obvious its going to end in a pitch for the product, but often it does a good job summarizing the key problems in the space, mentioning or showing other solutions / offerings, and pitching which tradeoffs they made for their own product and how they solved issues.
Even if you don't go with the ad, you can quickly pivot to other named players or get a better understanding of the terminology or jargon to start searching more.
My general impression of the LogRocket site is that they have decent articles on how to do frontend development. At least that what I remember from the times I've been directed there by a search engine.
And we…want to discourage writing useful web pages, even though articles on understanding TypeScript's type system aren't all that closely related to their main product…? What am I missing?
Kagi is Japanese for key, right? We search with a search key. If I'm getting it right.
Do you know that "sidekick" is aibó in Japanese (相棒)?
Notice the "AI" in it?
Didn't know that and thanks for giving us the idea. Aibó sounds much better than Sidekick, we may need to rename :)
Please don't write it Aibó, though. The proper romanization is Aibō or Aibou. See https://en.m.wikipedia.org/wiki/Hepburn_romanization#Long_vo...
Fr what sort of romanisation system uses an acute accent for that. It's much more akin to the latin flat accent. But in reality just leaving off an accent or using the u is more convenient for typing.
A number of European languages use ó for indicating the long vowel o, rather than stress: Czech, Slovak, Polish, Icelandic, Irish, ...
According to Wiktionary, ō is used as a long vowel in two Latvian languages, and in Swedish as a hand-written form of ö (not always a long vowel). Also in Silesian, a language in a region of what is now Poland.
Unfortunately, neither English nor Japanese IME's on desktops provide any way to type Hepburn that isn't extremely awkward. E.g. on Windows if you want to type Tōkyō, you have to be in Hiragana mode, go letter by letter and pick from a list.
Since I'm originally from Slovakia, I used my Slovak IME, which is convenient for me.
Incidentally, Slovak has its own romanization of Japanese, which uses almost nothing but Slovak letters and diacritics (plus "w").
Wikipedia pages about 九州 and 四国:
https://sk.wikipedia.org/wiki/Kjúšu
https://sk.wikipedia.org/wiki/Šikoku
The Hepburn system is an Americanism. Americans don't own the Roman alphabet or the way it should be applied to Japanese. Though Hepburn has official status in that it is taught in Japan, and used for the benefit of visitors. E.g. signs giving names of train stations or government buildings or what have you.
Isn't that already the name of Sony's robot dog?
It is.
Even though Sony has taken the sidekick word, there are lots. The characters 愛(love), 相(together) have "ai" readings and are productive. The verb あう goes to an あい noun form that is productive for forming words like aizuchi. Plus various others.
aite
愛犬 aiken: beloved dog
相手 aite: companion, other party, opponent.
藍色 aiiro: indigo blue; 濃藍 koai: deep indigo.
...
https://jisho.org/search/あい%3F
and "jisho" means "dictionary" haha! loved the explanation, although I can only read the hiragana and love jisho too! have to get on with those wanikani exercises...
Sony AIBO. https://en.wikipedia.org/wiki/AIBO
Yeah, what are the odds that the Japanese would use up the obvious Japanese words for tech stuff.
Thank you for the explanation!
While I agree with others that AI can take away from a products core vision, I’ve been very happy with Kagi’s path and roadmap. I feel like the AI products that you guys have released have served well as complements to search, and hope the trend continues.
Hopefully this helps with indexing while offering a cool service to small creators!
Edit: I forgot to say, the change where a `?` appended to a search triggering the quick answers was an amazing change. I would love to see more features that can be invoked by appending or prepending to the search query.
RE your edit, I’m assuming you’re already aware that Kagi lets you create custom bangs, but just in case you’re not, you can create your own shortcuts that when preceded by an exclamation mark like !so can redirect to or search other websites. I use this to append ‘site:reddit.com’ when I add !r to a query, for example.
Is it possible to use the search functionality without the "AI smarts"? I can see a good site search service being a great addition to some websites I run, but I would absolutely not want to push an AI chatbot on my users.
Yes, glad to see the skepticism towards AI, this is why it is turned off by default even in our demo.
Kagi user here and scientist.
I think kagi sidekick would be very well received in the bioinformatics space. Lots of complex docs that require end users to digest large complex data.
Can it be tuned to only point users to the docs and not answer questions?
Yes, summary mode is completely optional (and turned off by default as you can see in our demo).
It sounds good, except I don't want AI on my site. Any way to not have the AI part?
AI is optional and is turned off by default.
What would your approach be for pricing for wiki-type sites that are nonprofit but may have hundreds or thousands of pages with assorted media? I know that decent search beyond just name matching is a recurring issue for independent fandom wikis, which rarely have the funding or coordination to do anything fancier than just a Mediawiki site.
For a random example, there's the Baldur's Gate 3 wiki (https://bg3.wiki), which has upwards of 8,000 pages often with pretty dense text (see https://bg3.wiki/wiki/D%26D_5e_rule_changes for an example) and is funded entirely off donations.
It would be great if there was a free version for charity/non profit/open source. I don't know if this is feasible for Kagi. But I do know that many of these types of wiki/forum/blog are run on a shoestring.
Because "AI" increases the shareholder's chances of winning the lottery.
You should launch a crypto project for that
I love your transparency. Saying how it benefits Kagi, not just how it is a cool feature for users, is refreshing. It makes me trust more of what you say, and builds some sense of what the product’s direction could be. Thanks.
I think it would be interesting if like the website ranking that is done on Kagi there was a way to rate the search results to lower or higher it's ranking in search results. It would be a little different though since the website ranking on Kagi is for users but ranking the search results might just improve the intended search result that many people are looking for.
I guess this assumes that you aren't already doing that when they click one option over another for a certain search term.
Just thinking about searching through some documentation sites and you get a dumb result you weren't looking for at the top, and would want to deprioritize that result.
Great work! How it would be different from Algolia DocSearch?
https://docsearch.algolia.com
Satisfied Kagi early adopter here. Can you make a Mediawiki extension also? MW search leaves something to be desired, and I'd love to have Kagi on my wiki site.
Keep up the great work, you have an incredible product.
I'd use this tomorrow...
If it worked in a shell script or similar old-school unix architectural style on my bespoke static site generator, which is a slow-motion train wreck of a weekend hack-fest being ported from python/staticjinja to rust/minijinja.
Is kagi competing with whoogle?
Whoogle gives me the old-school, seemingly linear algebra of pagerank, hauntology that I expect.
this is a great idea and should have happened long ago..
https://news.ycombinator.com/item?id=19713604#19714732
Ah yes, "The 16 Companies" - except it's actually five hundred...
Seems like the problem in [2] is a few entities controlling the majority of spaces other than search, to me. Shame we don't have any real laws against anti-competitive behavior (just the way YC likes it).
inb4 flagged