return to table of content

Show HN: I've built a locally running Perplexity clone

nilsherzig
27 replies
19h34m

Happy to answer any questions and open for suggestions :)

It's basically a LLMs with access to a search engine and the ability to query a vector db.

The top n results from each search query (initialized by the LLM) will be scraped, split into little chunks and saved to the vector db. The LLM can then query this vector db to get the relevant chunks. This obviously isn't as comprehensive as having a 128k context LLM just summarize everything, but at least on local hardware it's a lot faster and way more resource friendly. The demo on GitHub runs on a normal consumer GPU (amd rx 6700xt) with 12gb vRAM.

keefle
7 replies
11h0m

Wonderful work!

is it possible to make it only use a subset of the web? (Only sites that I trust and think are relevant to producing an accurate answer), and are there ways to possibly make it work offline on pre installed websites? (wikipedia, some other wikis and possibly news sites that are archived locally), and how about other forms of documents? (books and research papers as pdfs)

kidintech
3 replies
10h53m

Seconded. I tried to do this many years ago for my dissertation and failed, but this would be a dream of mine.

robertlagrant
2 replies
10h46m

Would it not be possible to create a search engine that only crawls certain sites?

kidintech
1 replies
10h42m

I was most interested in the offline aspect of it, which I wouldn't know where to even start with if I were to fork.

How do you parse and efficiently store large, unstructured information for arbitrary, unstructured queries?

stavros
0 replies
9h13m

You put it in a search server, like ElasticSearch or Meili.

nemoniac
1 replies
10h26m

Llocalsearch uses searxng which has a feature to blacklist/whitelist sites for various purposes.

nilsherzig
0 replies
9h7m

also a great idea to expose this to the frontend. thanks :)

nilsherzig
0 replies
9h8m

uhhhh both ideas are great, would you like to turn them into github issues? i will definitely look into both of them :)

koeng
6 replies
18h12m

What is the search engine that it uses?

nilsherzig
5 replies
18h7m

searxng, which is a locally running meta search engine combining a lot of different sources (including Google and co)

mmahemoff
4 replies
12h59m

This might be more of a searxng question, but doesn't it quickly run up against anti-bot measures? CAPTCHA challenges and Forbidden responses? I can see the manual has some support for dealing with CAPTCHA [1], but in practical terms, I would guess a tool like this can't be used extensively all day long.

I'm wondering if there's a search API that would make the backend seamless for something like this.

1. https://docs.searxng.org/admin/answer-captcha.html

visarga
2 replies
12h2m

As a last resort we could have AI work on top of a real web browser and solving captchas as well. Should look like normal usage. I think these kinds of systems LLM + RAG + Web Agent will become widespread and the preferred method to interact with the web.

We can escape all ads and dark UI patterns by delegating this task to AI agents. We could have it collect our feeds, filter, rank and summarize them to our preferences, not theirs. I think every web browser, operating system and mobile device will come equipped with its own LLM agent.

The development of AI screen agents will probably get a big boost from training on millions of screen capture videos with commentary on YouTube. They will become a major point of competition on features. Not just browser, but also OS, device and even the chips inside are going to be tailored for AI agents running locally.

manojlds
1 replies
10h54m

If everyone consumes like that what's even the incentive for content creators?

visarga
0 replies
6h43m

If content creators can't find anything that is uniquely human and cannot be made by AI, then maybe they are not creative enough for the job. The thing about generative AI is that it can take context, you can put a lot or very little guidance in it. The more you specify, the more you can mix your own unique sauce in the final result.

I personally use AI for text style changes, as a summarizer of ideas and as rubber duck, something to bounce ideas off of. It's good to get ideas flowing and sometimes can help you realize things you missed, or frame something better than you could.

nilsherzig
0 replies
9h49m

I didn't run into a lot of timeouts while using it myself, but you would probably need another search source if you plan to host this service for multiple users at the same time.

There are projects like flareresolverr which might be interesting

FezzikTheGiant
2 replies
12h58m

If you're open to it, it would be great if you could make a post explaining how you built this. Even if it's brief. Trying to learn more about this space and this looks pretty cool. And ofc, nice work!

nilsherzig
0 replies
9h3m

guys, i didn't thought there would be this much interest in my project haha. I feel kinda bad for just posting it in this state haha. I would love to make a more detailed post on how it works in the future (keep an eye on the repo?)

ziziman
1 replies
4h23m

To scrape the websites, do you just blindly cut all of the HTML into defined size chunks or is there some more sophisticated logic to extract text of interest ?

I'm wondering because most news websites now have a lot of polluting elements like popups, would they also go into the database ?

totolouis
0 replies
3h53m

If you look at the vector handler in his code, he is using blue Monday sanitizer and doing some "replaceAll".

So I think there may be some useless data in the vector, but that may not be a issue since it is coming from multiple sources (for simple question at least)

mark_l_watson
1 replies
17h22m

Your project looks very cool. I had on my ‘list’ to re-learn Typescript (I took a TS course about 5 years ago, but didn’t do anything with it) so I just cloned your repo so I can experiment with it.

EDIT: I just noticed that most of the code is Go. Still going to play with it!

nilsherzig
0 replies
9h50m

Thanks :). Yea only the web part is typescript and I really wouldn't recommend to learn from my typescript haha

ivolimmen
1 replies
13h47m

"normal consumer GPU"... well mine is a 4GB 6600.. so I guess that varies.

nilsherzig
0 replies
9h51m

Sorry it wasn't my intention to gatekeep, but my 300€ card really is on the low end of LLM Things

d-z-m
1 replies
8h49m

any plans to support other backends besides ollama?

nilsherzig
0 replies
6h24m

Sure (if they are openai api compatible i can add them within minutes) otherwise I'm open for pull requests :)

Also, i don't own an Nvidia Card or Windows / MacOS

hanniabu
0 replies
3h7m

This is awesome, would love if there were executable files where these dependencies are needed. That would make it wayyyy more accessible rather than just to those that know how to use the command line and resolve dependencies (yes, even docker runs into that when fighting the local system).

keyle
17 replies
19h36m

Impressive, I don't think I've seen a local model call upon specialised modules yet (although I can't keep up with everything going on).

I too use local 7b open-hermes and it's really good.

nilsherzig
10 replies
19h31m

Thanks :). It's just a lot of prompting and string parsing. There are models like "Hermes-2-Pro-Mistral" (the one from the video) which are trained to work with function signatures and outputting structured text. But at the end it's just strings in > strings out, haha. But its fun (and sometimes frustrating) to use LLMs for flow control (conditions, loops...) inside your programs.

keyle
8 replies
19h3m

Wow, I didn't know about "Hermes 2 Pro - Mistral 7B", cheers!

nilsherzig
7 replies
18h50m

It's my go to "structured text model" atm. Try "starling-ml-beta" (7b) for some very impressive chat capabilities. I honestly think that it outperforms GPT3 half the time.

peter_l_downs
6 replies
18h13m

Sorry to repeat the same question I just asked the other commenter in this thread, but could you link the model page and recommend a specific level of quantization for the models you've referenced? I'd love to play with these models and see what you're talking about.

peter_l_downs
4 replies
17h55m

Thank you — from that page, at the bottom, I was able to find this link to what I think are the quantized versions

https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-...

If you have the time, could you explain what you mean by "Q5 is minimum"? Did you determine that by trying the different models and finding this one is best, or did someone else do that evaluation, or is that just generally accepted knowledge? Sorry, I find this whole ecosystem quite confusing still, but I'm very new and that's not your problem.

d-z-m
1 replies
9h13m

Talking GGUF, Usually the higher you can afford to go wrt. quantization(e.g. Q5 is better than Q4, etc), the better. A Q6_K has minimal performance loss from the Q8, so in most cases if you can fit a Q6_K it's recommended to just use that. TheBloke's READMEs[0] usually have a good table summarizing each quantization level.

If you're RAM constrained, you'll also have to make trade-offs about the context length. e.g. you could have 8 GB RAM and a Q5 quant with shorter context, vs Q3 with longer, etc.

[0]:https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

peter_l_downs
0 replies
1h34m

Thank you!

BOOSTERHIDROGEN
1 replies
12h27m

It's the best balance if you have limited compute performance.

peter_l_downs
0 replies
1h35m

Thank you

davidcollantes
0 replies
18h18m

Got a link for that one? I have found a few with Hermes-2-Mistral in the name.

viksit
3 replies
19h17m

curious what hardware you use? and is any of this runnable on an m1 laptop?

keyle
1 replies
19h15m

Absolutely, 7B will run comfortably on 16GB of RAM and most consumer level hardware. Some of the 40B run on 32GB, but it depends on the model I found (GGUF, crossing fingers help).

I ran this originally on a M1 with 32GB, I run this on an Air M2 with 16GB (and mac mini M2 32GB), no problem.

I use llama.cpp with a SwiftUI interface (my own), all native, no scripts python/js/web.

7b is obviously less capable but the instant response makes it worth exploring. It's very useful as a Google search replacement that is instantly more valuable, for general questions, than dealing with the hellscape of blog spam ruling Google atm.

Note, for my complex code queries at $dayjob where time is of the essence, I still use GPT4 plus, which is still unmatched imho, without running special hardware at least.

regularfry
0 replies
8h19m

I've been occasionally using a 7b Q4 quant on llama.cpp on an 8GB M1. It's usable, if not amazing.

nilsherzig
0 replies
18h53m

Depends on your m1 specs, but should definitely be able to run a 7b model (at least with some quantization).

windexh8er
0 replies
17h27m

Have you looked into tools like CrewAI [0]?

[0] https://www.crewai.io/

peter_l_downs
0 replies
18h15m

I'm just starting to get into downloading and testing models using llama.cpp and I'm curious which model you're actually using, since they seem to come in varying levels of quantization. Is this [0] the model page for the one you're using, or should I be looking somewhere else? What is the actual file name of the model you're using?

[0] https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GG...

ldjkfkdsjnv
9 replies
17h46m

The big secret about perplexity is they havent done much beyond using off the shelf models

KuriousCat
5 replies
16h13m

How did they secure funds in that case?

basbuller
3 replies
14h11m

That is probably exactly why they got funding. You can sell it as focus on adding new features and leveraging the best available tools before reinventing the wheel.

They do train their own models now, but for about a year they just forwarded calls to models like gpt3.5T. You still have the option to use models not trained by perplexity.

hackernewds
1 replies
13h56m

which is why their engagement and model responses suck. the other competitors are far better

C.ai and Pi comes to mind

basbuller
0 replies
13h45m

Wait, are you directly comparing Perplexity and C.ai or Pi? Perplexity is a search engine, Pi is a chatbot, and C.ai is roleplay? Their value propositions are very different

KuriousCat
0 replies
13h30m

I still don't get it. What was the USP here? What is the allure in it for the investors?

code51
0 replies
10h16m

Simple, looking at the mirror and saying "Google-killer" firmly 3 times everyday.

nilsherzig
0 replies
6h21m

I assume the same, feels like their product is just summarizing the top n results? I wouldn't need the whole vector db thing, if local models (or hardware) would be able to run with a context of this size.

msp26
0 replies
8h46m

Pretraining and even finetuning (to a good extent) is overrated and you can create plenty of value without it.

ggnore7452
0 replies
13h53m

I've been working on a small personal project similar to this and agree that replicating the overall experience provided by Perplexity.ai, or even improving it for personal use, isn't that challenging. (The concerns of scale or cost are less significant in personal projects. Perplexity doesn't do too much planning or query expansion, nor does it dig super deep into the sources afaik)

I must say, though, that they are doing a commendable job integrating sources like YouTube and Reddit. These platforms benefit from special preprocessing and indeed add value.

wg0
7 replies
8h47m

In five year's time - by 2030, I foresee that lots of inference would be happening on local machines with models being downloaded on demand. Think docker registry of AI models which is pretty much Hugging Face already there.

This all would be due to optimisations within model inference code and techniques, hardware and packaging of software like the above.

Don't see billion dollar valuations for lots of AI startups out there to materialise into anything.

openquery
4 replies
8h39m

I foresee that lots of inference would be happening on local machines with models being downloaded on demand

Why? It's much more efficient to have centralized special purpose hardware to run enormous models and then ship the comparatively small result over the internet.

By analogy, you don't have a search engine running on your phone right?

dns_snek
1 replies
6h37m

Why?

Privacy, security, latency, offline availability, access to local data and services running on the device, just to name a few.

ilc
0 replies
3h27m

Big Tech + Countries: Those all sound like great reasons to centralize all access to AIs!

vachina
0 replies
8h34m

A more appropriate analogy would be driving your own car vs. taking the bus.

Sammi
0 replies
8h27m

You currently can't have a search engine running locally on your phone. Google search is possible the single largest c++ program every built. And nevermind the storage needs...

But in a few years we might be able to have LLMs running on our phones that work just as well if not better. Of couse as you mention the LLMs running on large servers might still be much more powerfull, but the local ones might be powerfull enough.

ThinkBeat
1 replies
8h2m

"640kb will be enough for everyone." (Gates)

I think that the models will evolve and grow as more powerful compute/hardware comes out.

You may be able to run scaled down n versions of what state of the art now, but by then the giant models will have grown in size and in required compute.

The 6 year old models will be retro computingish.

Somewhat like how you can play 6 year old games on a new powerful PC but by then the new huge games will no longer play well on your Old mach

lobocinza
0 replies
3h26m

There will be demand and supply for both cases.

andrewfromx
4 replies
7h48m

searXNGDomain := os.Getenv("SEARXNG_DOMAIN")

I see this but what search engine lets you get results in json for free?

nilsherzig
2 replies
7h45m

Those arent search results tho, that's just duckduckgo internal things like "similar queries"

andrewfromx
1 replies
7h41m

oh they must be hitting their own internal api with format=json but what is the datasource?

sebzim4500
3 replies
7h55m

Whenever I see these projects I always find reading the prompts fascinating.

Useful for searching through added files and websites. Search for keywords in the text not whole questions, avoid relative words like "yesterday" think about what could be in the text. > The input to this tool will be run against a vector db. The top results will be returned as json.

Presumably each clarification is an attempt to fix a bug experienced by the developer, except the fix is in English not in Go.

nilsherzig
2 replies
7h46m

haha yea pretty much, its amazing (and frustrating) how much of the programs "performance" depends on these prompts

nilsherzig
0 replies
4h54m

Yea im kinda stressed out to get it working for everyone haha. I would have caught that under different conditions. I'm a big e2e tests guy haha

pants2
3 replies
18h24m

It says it's a "locally running search engine" - but not sure how it finds the sites and pages to index in the first place?

nilsherzig
2 replies
18h14m

Yea I guess that's misleading, I should probably change that. I was referring to the LLM part as locally running. Indexing is still done by the big guys and queried using searxng

nilsherzig
0 replies
18h8m

Just to clarify, it wasn't my intention to be misleading

lavela
0 replies
9h30m

What would be your current recommendation on how to create a vector db from local files that would work with LLocalSearch?

fnetisma
3 replies
18h38m

This is really neat! I have questions:

“Needs tool usage” and “found the answer” blocks in your infra, how are these decisions made?

Looking at the demo, it takes a little time to return results, from the search, vector storage and vector db retrieval, which step takes the most time?

nilsherzig
2 replies
18h29m

Thanks :)

Die LLM makes these decisions on its own. If it writes a message which contains a tool call (Action: Web search Action Input: weight of a llama) the matching function will be executed and the response returned to the LLM. It's basically chatting with the tool.

You can toggle the log viewer on the top right, to get more detail on what it's doing and what is taking time. Timing depends on multiple things: - the size of the top n articles (generating embeddings for them takes some time) - the amount of matching vector DB responses (reading them takes some time)

dcreater
1 replies
12h32m

Die LLM

You mean the? The German is bleeding through haha

rzzzt
0 replies
12h13m

Wolfenstein 3D did it first! And then The Simpsons as well.

arflikedog
3 replies
11h7m

A while back you commented on my personal project Airdraw which I really appreciated. This looks awesome and you're well on your way to another banger project - looking forward to toying around with this :)

gardenhedge
1 replies
8h25m

Did you just happened to see this post today and notice the username?

arflikedog
0 replies
3h5m

unironically yes, I used comments to hot fix a bunch of stuff when I first launched. It's a small world and I thought this was a cool moment

nilsherzig
0 replies
9h48m

Uhh yes I was really impressed by your project :)

nikolayasdf123
2 replies
9h46m

cool to see Go here

nilsherzig
1 replies
7h42m

langchain go (missed opportunity to call it golang-chain) is nice, but has literally no docs haha

nilsherzig
0 replies
6h17m

btw ollama (the webserver around llamacpp) is also written in golang :)

ml-anon
2 replies
11h12m

Did you really make a perplexity clone if you didn’t spend more time promoting yourself on Twitter and LinkedIn than on the engineering?

nilsherzig
1 replies
9h43m

Ah damn I forgot about getting some VC money

ProllyInfamous
0 replies
1h26m

This demo will land you more important things than "just" VC money.

Extremely impressive, cannot wait to actually implement this on my M2Pro (mac).

madeofpalk
2 replies
6h57m

Q: is chrome on ios powered by safari

According to the sources provided, Chrome on iOS is not powered by Safari. Google's Chrome uses the Blink engine, while Safari uses the WebKit engine.

I find it amusing how when people show off their LLM projects their examples are always of it failing, and providing a bad answer.

nilsherzig
1 replies
6h23m

Well i don't indent to get money from people, so i guess showing real results isnt a "problem".

Besides i think the following sentences arent wrong? Its just a 7b model give it some slack haha

madeofpalk
0 replies
5h20m

No, sure. I just think it's funny how it constantly happens, from opensource, free, or commercial projects. No one seems to be immune to 'telling on themselves'.

gorbypark
2 replies
9h18m

I if a quick poke through the source and it seems like there’s not much reason this couldn’t run on macOS? It seems that ollama is doing the inference and then there’s a go binary doing everything else? I might give it a go and see what happens!

nilsherzig
1 replies
7h44m

sure there are people in the issues how got it working on macos. docker networking was the only problem :)

mritchie712
0 replies
7h4m

I have it running on my Mac right now, took < 2 minutes (had to manually download one of the ollama models)

BrutalCoding
2 replies
8h40m

That’s a great project you pulled off. From the time I starred it (10-12h ago I think), and upon re-checking this post, you gained 500+ stars lol.

Visualized in a chart with star-history: https://star-history.com/#nilsherzig/LLocalSearch

sroussey
0 replies
2h30m

Ah, that’s a nice chart generator. Will have to use if I ever get any, lol.

nilsherzig
0 replies
8h24m

haha thanks for the chart link. i woke up with 1k more than it had yesterday, im kinda stressed out

xydac
1 replies
17h34m

This is cool, haven't run this yet but seems really promising. Am thinking how this can be a super useful to hook with internal corporate search engines and then get answers from that.

Good to see more of these non API key products being built (connected to local llms)

nilsherzig
0 replies
7h43m

I might try to hook this into our internal confluence, shouldn't be a problem

siborg
1 replies
9h5m

Exciting project. Trying to install it but running into some issues with searxng. Anyone else?

nilsherzig
0 replies
9h1m

please tell me about the problem or open an issue :)

sgt
1 replies
7h7m

Speaking of LLM's... here's my "dear lazyweb" to HN:

What would be the best self hosted option to build sort of a textual AI assistant into your app? Preferably something that I can train myself over time with domain knowledge.

traverseda
0 replies
6h53m

Fine tuning on your own knowledge probably isn't what you want to do, you probably want to do retrieval aided generation instead. Basically a search engine on some local documents, and you put the results of the search into your prompt. The search engine uses the same vector space as your language model as its index, so the results should be highly relevant to whatever the prompt is.

I'd start with "librechat" and mistral, so far that's one of the best chat interfaces and has good support for self hosting. For the actual model runner, ollama seems to be the way to go.

I believe it's built on "langchain", so you can switch to that when it makes sense to. When you've tested all your queries and setup with librechat, know that librechat is a wrapper around "langchain".

I'd start by testing the workflow in librechat, and if librechat's API doesn't do what you want, well I've always found fastAPI pleasant to work with.

---

Less for your use case, and more in-general. I've been assessing a lot of LLM interfaces lately, and the weird porn community has some really powerful and flexible interfaces. With sillytavern you can set up multiple agents, have one agent program, another agent critique, and a third asses it for security concerns. This kind of feedback can help catch a lot of LLM mistakes. You can also go back and edit the LLM's response, which can really help. If you go back and edit an LLM message to fix code or change variable names, it will tend to stick with those decisions. But those interfaces are still very much optimized for "Role playing".

Recommend keeping an eye on https://www.reddit.com/r/LocalLLaMA/

oysterpingu
1 replies
10h3m

Awesome project! As I newbie myself in everything LLM, where should I start looking to create a similar project than yours? Which resources/projects are good to know about? Thank you for sharing!

nilsherzig
0 replies
9h52m

I think the easiest entry point would be the python langchain project? It has a lot more documentation and working examples than the golang one I've used :)

If you could tell me more about your goals, I can probably provide a more narrow answer :)

noisy_boy
1 replies
10h12m

Would be good if the readme mentions minimum hardware specs to get a reasonably decent performance. E.g. I have a ThinkPad X1 extreme i7 with MaxQ graphics, any hopes of running this on it without completely ruining the performance?

nilsherzig
0 replies
9h45m

You could run the LLM using your CPU and normal (non video) ram. But that's a lot slower. There are people working on making it a lot faster tho. The bottleneck is the transfer speed between the ram Sticks and the CPU.

Just taking a guess, but I wouldn't expect more than a couple tokens (more or less like syllables) per second. Which is probably to slow, since it has to read a couple thousand per search result.

It's hard to provide minimum requirements, since there are so many edge cases.

hubraumhugo
1 replies
11h15m

Excellent work! Cool side projects like that will eventually help you get hired by a top startup or may even lead to building your own.

I can only encourage other makers to post their projects on HN and put them out into the world.

nilsherzig
0 replies
9h44m

Yea it's also quite fulfilling to see people likening something you've put some work into :)

hackernewds
1 replies
13h54m

what does this have to do with Perplexity? it should reference the underlying models used instead

vishnumohandas
0 replies
13h48m

The UX is comparable.

adr1an
1 replies
9h35m

This is so cool! And the fact that you can use Ollama as 'llm backend' makes it sustainable. didn't see how to switch models in the demo, that might be worth to highlight in readme..

adr1an
0 replies
8h54m

I have a 'feature request', can we manage which sites are being used by some categories in the frontend? For example, if I build a list of websites and out them under "coding", then I'd like to use those to answer my programming questions. Meanwhile, I'd like to add an "art" category for museum's homepages so that I can ask which year was XYZ painting from. And so on. The current implementation looks like the inter-operability with searing is more static... IDK if searxng has an API to switch those filters or if they can be managed already through 'profiles'.. that kind of thing..

pcthrowaway
0 replies
17h55m

Now if someone can hook this into Plandex (also shared today - https://news.ycombinator.com/item?id=39918500) to make a tool that enables you to collaborate with AI without any of your code leaving your computer, that would be amazing!

nilsherzig
0 replies
9h54m

Uhh sorry guys, I was asleep and now the project has like 1k stars haha

I will try my best to catch up with everyone <3

gardenhedge
0 replies
8h27m

Completely locally running search engine.. that queries the Internet

firtoz
0 replies
12h45m

Excellent work! I plan to use it with existing LLMs tbh, but great to see it working locally also! Thank you so much for sharing. I love the architecture.

darby_eight
0 replies
16h57m

Perplexity seems to be a chatbot competitor.

aagha
0 replies
10m

According to Crunchbase [0], Perplexity has raised over $100M.

You built this in your spare time?

The following things jump out to me:

- How much a hype cycle invites insane amounts of money - How trash the entire VC world is during a hype cycle - What an amazing thing ingenuity and passion are

Great job!

0 - https://www.crunchbase.com/organization/perplexity-ai