HN comments for: YaCy, a distributed Web Search Engine, based on a peer-to-peer network

ssijak

41 replies

5d10h

2024-03-06 07:56:03 UTC

Long time ago I worked for a startup called Wowd which built distributed search engine. It was acquihired by Facebook.

On of the biggest issues was how to entice people to download and run the client/node.

I half wondered afterwards if slapping some crypto on top of it which would be mined by running the node and providing resources would help. My gut says easy yes, but my mind grimace at the abomination.

lifty

23 replies

5d9h

2024-03-06 09:10:54 UTC

Not sure why it would be an abomination. This is the exact use case which is a fit for cryptocurrency networks.

rakoo

22 replies

5d5h

2024-03-06 12:52:01 UTC

You have to look beyond the surface. Cryptocurrencies work specifically to address a system where no node can trust any other node. If I cannot trust any other node, why would I fetch anyone else's index, or ask them for the results of a query, or even talk to them ?

Unless there can be a way to trivially verify what others tell you, crypto currencies are a dead end

mhluongo

9 replies

5d5h

2024-03-06 13:09:57 UTC

You have that issue without cryptocurrencies as well, you'd just be relying on the kindness of users rather than crypto incentives.

You always need a way to hold nodes accountable in a system like this, or it'll be rife with manipulation — because there's already a strong, innate incentive to manipulate results. Today, we call that industry "SEO".

rakoo

8 replies

5d5h

2024-03-06 13:17:26 UTC

What you don't understand is that "I don't trust others" is not a terminal statee I'd rather build trust again, create human connections, or rather, put them in front because there are always connections; nothing works if you trust noone.

Building a societal system where you know you can rely on your peers, you build together, is a more joyful, more resilient, more ecological and also more realistic way of building a thriving society than distrust-by-default that cryptocurrencies live for.

idiotsecant

5 replies

5d4h

2024-03-06 14:22:02 UTC

Your current fiat currency is not based on love and trust. Its Proof-of-World-Hegemony which puts crypto based consensus mechanisms to shame in terms of how not based on love and trust it is.

rakoo

3 replies

5d2h

2024-03-06 15:43:45 UTC

My current fiat currency is absolutely based on trust that the State will uphold any disagreement, even though I know it is not benevolent.

I also don't understand your point. The current world is not what I want, so let's make it worse according to my values ?

idiotsecant

2 replies

5d1h

2024-03-06 17:04:19 UTC

The value of, for example, the dollar is not based on your trust, at least not at the first order. It's based on the economic and military power backing it up.

rakoo

1 replies

4d21h

2024-03-06 21:23:48 UTC

Absolutely it is: it is based on the trust we all have that the government will do whatever it takes to guarantee the value of a dollar. Me being able to commerce with you in dollars and not, say, in old zimbabwean dollars rests on the shared assumptions that the US State can and will be there.

idiotsecant

0 replies

1d19h

2024-03-09 22:55:10 UTC

We're playing with words now but the power of the dollar does not arise from your faith in it, it arises from the nations around the world that view it as a stable medium of trade. This stability is based on the employment and perception of ability to employ hedgemonic hard and soft power. The US isn't going anywhere because it has the guns, bullets, bombs, allies, trade, and diplomacy to stay in the top spot. Your faith is a product of that system, not a cause of it. Your faith won't build aircraft carriers.

Brian_K_White

0 replies

4d23h

2024-03-06 18:43:48 UTC

Sure it is. When someone gives me a dollar, I have no idea if it's fake or stolen.

That sort of thing only gets handled very indirectly and much later and after a bad actor does their bad thing enough times for the surrounding greater population of good actors to notice a pattern.

Brian_K_White

1 replies

4d23h

2024-03-06 18:39:04 UTC

And I think this is not even stupid either.

Bad actors exist and there must be some process for identifying and dealing with them, but they are not the majority of people and so probably don't have to be the first, last, primary, and only consideration at all times.

IE living in a bomb shelter is not a life worth living, even though yes you will be safe from bombs and theives.

rakoo

0 replies

4d20h

2024-03-06 21:29:25 UTC

Exactly. If I'll have to depend on someone else anyway (and I will), might as well build trust because a life being cautious about everything and everyone is not worth living. Only those with already vast amounts of money can afford it because they trust (heh) other people working for them to taue care of that, but to non-jokingly propose it as a standard for everyone is a dystopia.

zubairq

3 replies

5d2h

2024-03-06 15:49:20 UTC

Interesting comment about how cryptocurrencies can enable a system where no node can trust any other node. Something for me to think about as I am building a peer to peer system (not a search engine though)

rakoo

2 replies

5d2h

2024-03-06 15:57:54 UTC

Cryptocurrencies only help where no one can trust anyone. But if that's the case, I claim that such a system is not viable in the long term.

zubairq

1 replies

4d23h

2024-03-06 19:17:34 UTC

Good point. Does this mean that Bitcoin is not viable in the long term?

rakoo

0 replies

4d20h

2024-03-06 21:49:29 UTC

Bitcoin as a speculating tool lives as long as speculation can live. Bitcoin, or any cryptocurrency as an actual currency exchanged at large scale will not work, or at least not in a democracy.

miohtama

2 replies

5d2h

2024-03-06 15:57:44 UTC

Cryptocurrencies solve spam problem, not trust problem. No one can spam the network with new write data (transactions) because spam would become expensive. Although people still do, and Ethereum is full of spam tokens, meaning the transaction cost is still too low. This was also the use case of Hashcash, predecessor in proof-of-work, and was designed to solve email spam.

You are paying either

- Block space: your transaction to be included in a block

- State: modifying the world state (EVM in Ethereum)

Trust problem is solved by various other means, usually on libp2p level, by banning node (IP addresses) that send you bad data, which you can verify by comparing it to data from other peers.

rakoo

0 replies

4d20h

2024-03-06 21:47:28 UTC

Cryptocurrencies slow down the rate of data not because of spam but because a slower rate means a higher consistency across the network: cryptocurrencies' goal is to agree on a consistent state with peers who do not want to negotiate. If the consistent state is pure garbage then that is not a problem for blockchains, because from blockchains' point of view, everything is fine.

Spam is not a function of rate but of content. Spam can absolutely be sent in a blockchain, as you say, and making the price higher only makes both spam and non-spam more difficult. Spam for me might be actual legit information for you.

Hashcash is another beast, it only has the proof-of-work part, not the money part (contrary to its name) so it's not comparable.

dumbfounder

0 replies

4d23h

2024-03-06 19:22:11 UTC

They also solve the trust problem through consensus using proof of stake. If there is enough financial skin in the game to behave correctly, then that should be enough to make sure that results are not tainted.

mattdesl

2 replies

5d2h

2024-03-06 16:08:26 UTC

This seems like something that could be verified through ZK proofs. The data to search could be represented by a public merkle root, and the searching/indexing given the user query could be programmed in a ZKVM like RISC0[1].

[1] https://www.risczero.com/zkvm

notfed

1 replies

5d1h

2024-03-06 17:22:34 UTC

Most information is not a math equation.

mattdesl

0 replies

4d10h

2024-03-07 07:44:04 UTC

As it turns out, a lot can be.

The concept of a ZK VM is that it is able to prove arbitrary code. Risc0 and the more recent SP1[1] both compile arbitrary Rust programs into ZK circuits for generating and verifying execution proofs.

[1] https://github.com/succinctlabs/sp1

lifty

1 replies

5d3h

2024-03-06 14:44:56 UTC

You should be able add incentives in the system so that people store the correct index. You can check the incentive design of Filecoin for an example of how you can do that. Obviously it depends on the application how the incentive mechanism should be built.

rakoo

0 replies

5d2h

2024-03-06 15:49:05 UTC

Filecoin is "easy": it is trivial to verify that the blob you stored is the one I wanted you to store. There is no trivial way to verify that you indexed what I wanted you to index, or that you reply what I wanted you to reply.

I highly dislike monetary incentives because they perpetuate inequalities by design, so here's another incentive: if you store a correct index, I will keep working with you and we can build an awesome system together. We can coordinate by talking to each other rather than trying to get money from each other.

worksonmine

10 replies

5d9h

2024-03-06 08:52:00 UTC

but my mind grimace at the abomination

Why would that be an abomination? It's a perfect use-case. Like you noticed people need incentives to volunteer their hardware. If you hate crypto because it's crypto you can just use fiat instead.

komali2

4 replies

5d8h

2024-03-06 10:14:21 UTC

Like you noticed people need incentives to volunteer their hardware.

I wonder if this is because "volunteer your hardware" projects sometimes involve someone making else money, and if someone else is making money but not you, why should you donate your hardware?

For the truly libre "hardware donation" projects, they seem to be doing ok without financial incentivization. What immediately comes to mind is the petabytes of data flying around on peer to peer systems through torrenting. I know people that spend thousands of dollars a year on upkeep and upgrades for what are essentially super seedbox homelabs (I'm one of them too :P )

There's also communities like soulseek where people keep TBs of music up, often seeking out rare tracks to make available to the community for free.

There's folding@home and seti@home, and I'm sure other similar projects I haven't heard of, where people donate cycles just for the common good.

folding@home is a great example because we can directly compare the people that are "incentivized" to participate with bananocoin, a cryptocurrency rewarded based on work cycles in folding@home. You can see all bananocoin miners here under the banano.cc team: https://stats.foldingathome.org/ That team is in first place for work completed, however are only just surpassing the linus tech tips team, and not to mention compared to a bunch of other teams (and private "donors") they're a very small % of work completed for folding@home

So therefore I disagree that people "need" incentives, there just needs to be no, erm, disincentives, if that's a word.

shinryuu

3 replies

5d7h

2024-03-06 10:47:26 UTC

I know people that spend thousands of dollars a year on upkeep and upgrades for what are essentially super seedbox homelabs

And then you end with "there just needs to be no disincentives". If anything spending thousands of dollars a year on upkeep should be a disincentive for most people. You are not most people though, since you do it voluntarily.

komali2

2 replies

5d7h

2024-03-06 11:06:40 UTC

I'm a maniac though. I used to run my stack just fine off a raspberri pi with a USB harddrive plugged in.

Actually, before that, I used to run it off an old macbook.

Do we need it to be where everyone hosts a node? I just had this conversation with a friend yesterday actually. We were in disagreement about the accessibility of self hosting and federation. He was of the opinion that we should push LLMs to where anyone can type "I want to host a video hosting platform" and chatgpt.exe will find and install jellyfin on their computer and set up a cloudflare tunnel, or whatever.

I'm more of the opinion that we should increase the quality of documentation until the one person just weird and nerdy enough out of a group of 20 will be able to deploy things on leftover hardware, and share with their friends.

What do you think?

shinryuu

1 replies

4d20h

2024-03-06 21:27:18 UTC

In terms of accessibility I don't think it would be bad per se if chatgpt.exe would be able to help you with that. Though both of us know that there is maintenance involved and once something catch fire (which will happen at some point), you are kind of helpless.

Something like pikapods.com certainly helps with accessibility, even if it isn't self-hosting per se.

But all of that doesn't have little to do with incentives or disincentives. Even with very high accessibility there are disincentives to self-host. It will cost time and money in some way. For some people the intrinsic motivation will override those disincentives. But I think for the majority of people there will still not be enough motivation to do it.

There are more important things to do for them.

komali2

0 replies

4d15h

2024-03-07 03:06:38 UTC

There are more important things to do for them.

Well yes, because right now society disincentivizes people from ever spending their time from anything that doesn't earn them at least a little bit of money. Kind of to my earlier point that "FOSS" projects with a monetization angle dicincentivizes people to contribute their time, to make someone else money. Well, except for the fact that it's almost a requirement for people in certain geographies to have FOSS commits on their portfolio, due to economic disparity. Yay free labor pool.

Should we actually leverage our technology to share the bounty of post-scarcity we could have today, don't you think people would spend more time on passion projects?

bawolff

4 replies

5d4h

2024-03-06 13:43:12 UTC

I mean, how do you verify nodes are being honest and not just sending fake data for the free crypto (like what happened with seti@home and there wasn't even money involved)

Not to mention, where is the value of this coin going to come from? Will people pay to use this search engine? That seems unlikely.

It doesn't sound like the perfect use case to me.

numpad0

1 replies

5d2h

2024-03-06 16:23:05 UTC

Agreed; feels to me that people here is underestimating malice on the Internet. Simple crypto-based search credit system will be overtaken with fake queries and fake data.

I'm not entirely confident that crypto-like reward mechanisms for distributed search is fundamentally flawed and unusable, but both the problem and solution needs to be refined a bit more.

worksonmine

0 replies

2024-03-06 17:28:26 UTC

Agreed; feels to me that people here is underestimating malice on the Internet.

I don't think we do. We just prefer to put our trust in algorithms and verifiable data sources. It's not like Google et al are the pinnacle of altruism, there have been cases where the promoted results are faked copies of the actual site you want to visit, fooling less computer savvy users to install malware.

The trust is put into the code, same principle as reproducible builds. It doesn't matter where you get the source, as long as the checksum matches. This way the censor side of the problem is solved.

That leaves the spam, which isn't really solved by the big corporations either. Last time I used google I got 2-3 pages of the same auto-generated bullshit on every technical search term I tried. This could be fixed by having the main index limited to trusted sites at the expense of discovering new content. The latter can be handled by opt-in indexes. If the goal is to index everything users could have their own filters for sites they don't want.

If you really want to spice it up allow me to maintain my own query function (dangerous and potential exploit yes) that I send to the nodes and I can handle my own ranking.

There's nothing that makes a distributed index more unsafe than one run by Google. If every query picks 2 random nodes and compares the results I would trust that query more than current Google execs opinions of what I'm allowed to see.

worksonmine

0 replies

5d3h

2024-03-06 14:45:39 UTC

That's exactly why blockchain is a good choice. You verify that whatever X sends matches what Y and Z would send before any reward is received. Based on the shared index every query should return the same results, kids stuff really.

The monetization is a nut to crack yes, but Kagi works as a paywalled search engine. Otherwise just serve ads like all the rest already do? Tried and proven model, and in this solution they could be very transparent as there's no corporation behind trying to dupe the users for clicks to maximize profits. I even see the possibility for a hybrid model, don't like ads? Pay for the compute with your own coins.

The value comes from the network, trust and use-case. It doesn't have to be a new coin.

px43

0 replies

5d3h

2024-03-06 14:30:51 UTC

By ignoring cryptocurrencies, you've missed out on over 10 years of progress in this space. We have things like zero knowledge notaries and data availability sampling proofs. Actively Validated Services are also a thing. Service providers stake some asset, and interested parties can challenge them at certain intervals to ensure that they are properly performing their duties. Through the magic of Merkel Trees, and soon Verkel Trees (basically Merkel tress, but using vector commitments for super fast proofs) challengers can demand that that service providers generate a proof that some data they hold matches some criteria. The nice thing about it is that because it's a zero knowledge proof, the challenger doesn't even need to know what that data is, and what they get back is a succinct proof that they can check very quickly, basically like checking an md5 some for execution correctness.

It's cool shit, you should really look into it.

zoklet-enjoyer

2 replies

5d9h

2024-03-06 08:36:07 UTC

We have proof of stake now. The nodes could be run by the chain validators and they get a cut of the staking rewards. Look up how proof of stake works on the Cosmos chain. You could totally do this and I bet it would take off, at least in that section of the Internet that's into Cosmos/Tendermint chains. I'd use it

zoklet-enjoyer

0 replies

2024-03-06 18:16:12 UTC

Hahaha one downvote. I love to see it

ssijak

0 replies

5d9h

2024-03-06 08:38:50 UTC

I was definitely thinking of some kind of proof of stake, not proof of work.

mdaniel

0 replies

5d1h

2024-03-06 17:11:46 UTC

Nothing new under the sun, as they say: https://www.presearch.io/engine and just as you said I was unwilling to run a closed-source node binary

colinsane

0 replies

2024-03-06 18:19:58 UTC

if the situation is really "nobody will run this software unless i pay them to", then you're doomed regardless. there's nothing wrong with the classic route: package your software for the stores/distros you're familiar with, make your software as easy to package as humanly possible for anyone else who'll come around, document the hell out of it, submit it to the handful of top-level news feeds from which it'll percolate, and then wait. maybe you don't like waiting?

6510

0 replies

5d9h

2024-03-06 09:20:02 UTC

After that the issue becomes ranking. Should say became since LLM's could both rank pages and generate them on "demand" to fit the query.

YaCy has so many buttons I'm not even sure if it lacks it but playing around with it it is very cool to crawl large amounts of pages and serve requests until you want to do other things with the computer and the background process is to bloated. Something like a turtle mode like torrent clients have would be useful.

Long ago there was a Chinese p2p client with a rootkit that would seed at 1 kb. I haven't used it but was told it worked remarkably well.

vGPU

6 replies

5d9h

2024-03-06 08:30:27 UTC

Has it gotten any better recently?

I run a node but I haven’t actually used it as a search engine in a while, as I found the result quality to be exceedingly poor.

rahen

3 replies

5d9h

2024-03-06 08:52:56 UTC

I remember trying it for a while in 2012, but the results were essentially worthless, probably because there were so few nodes/crawlers back then. I guess the more users there are, the better the results.

WarOnPrivacy

1 replies

5d4h

2024-03-06 14:13:49 UTC

I remember trying it for a while in 2012, but the results were essentially worthless,

I had mine crawling gov, mil, etc sties for pages that Google was starting to delist back then. Inbound requests were heavy with porn until I tweaked - IDK, something.

Brian_K_White

0 replies

4d12h

2024-03-07 05:56:32 UTC

"until I tweaked - IDK, something."

omg so much this.

I got an instance going in a truenas core jail, freebsd and using freebsd java not a linux vm or linux abi compatibility. had to make my own rc script.

Then had to mess with the disk & ram settings to get it to run for more than a day. But the settings are not actually explained at all and whatever they do, they definitely don't do what their names and worthless tooltips say they do.

It seems to be running now indefinitely without killing either itself or the host, in full p2p mode, but I really have no idea why it's working, or really for sur if it actually is fully. I changed "idk, something"

And I don't use it for search myself so far. Maybe some day but for now I'm paying for kagi.

I just like the idea and want it to be a thing, and it seemed a little less "invite a world of shit and attention onto my ip" than running say a tor exit or something. Maybe only a bit less but I'll see how it goes and react if I need to.

viraptor

0 replies

5d6h

2024-03-06 11:26:05 UTC

Alternatively, ignore the public network (it's still useless) and run it as your own crawler. Seed it with your browsing history, some aggregators like HN, your favourite RSS feeds, etc. and you'll be good.

Avamander

1 replies

5d4h

2024-03-06 14:22:30 UTC

No.

Either it picks up too much garbage if you allow any P2P data exchange (can't allow only outgoing AFAIK) or it kinda only knows about the sites you know about. Which kinda defeats the purpose.

Even assuming you just want a specific index for yourself of your own content then it struggles to display useful snippets about the results, which makes it really tedious to shift through the already poor results.

If you try to proactively blacklist garbage, which is incredibly tedious because there's no quick "delete from index and blocklist" button under index explorer, then you'll soon run into an unmanageable blocklist, the admin interface doesn't handle long lists well. At some point (around 160k blocked domains) Yacy just runs out of heap during startup trying to load it which makes the instance unusable.

It also can't really handle being reverse proxied (accessed securely by both the users and peers).

It also likes to completely deplete disk space or memory, so both have to be forcefully constrained. But that ends up with a nonfunctional instance you can't really manage. It also doesn't separate functionality enough that you could manually delete a corrupt index for example.

Running (z)grep on locally stored web archives works significantly better.

bobajeff

0 replies

4d23h

2024-03-06 18:36:59 UTC

Those are pretty bad issues. I remember using it along time ago and only remember the results being bad. I've heard that Yacy could be good for searching sites you've already visited but it sounds like even that might not be a good use case for it.

I do understand the taking up of disk space thing. It's hard to store text of all your sites without it talking up a lot of space unless you can intelligently determine which text is unique and desired. Unless you are just crawling static pages it becomes hard to know what needs to be saved or updated.

b2bsaas00

6 replies

5d9h

2024-03-06 08:26:45 UTC

Could this be used for a Torrent search engine?

worksonmine

2 replies

5d9h

2024-03-06 08:57:33 UTC

Recently there was a distributed tracker on the front page. Probably more what you're looking for.

rakoo

0 replies

5d4h

2024-03-06 14:16:26 UTC

Note that it's not a distributed tracker, it's an indexer/tracker/search engine that uses distributed resources (the nodes in the dht)

BLKNSLVR

0 replies

5d5h

2024-03-06 12:49:38 UTC

Bit Magnet: https://bitmagnet.io

feverzsj

1 replies

5d8h

2024-03-06 10:03:53 UTC

btdig is still alive.

qingcharles

0 replies

5d1h

2024-03-06 16:37:32 UTC

btdig has the data, but its search is subpar :(

fddrdplktrew

0 replies

5d9h

2024-03-06 08:54:43 UTC

if it is not censored, probably?

renegat0x0

4 replies

5d4h

2024-03-06 14:23:07 UTC

There are already many project about search:

- https://www.marginalia.nu/

- https://searchmysite.net/

- https://lucene.apache.org/

- elastic search

- https://presearch.com/

- https://stract.com/

- https://wiby.me/

I think that all project are fun. I would like to see one succeeding at reaching mainstream level of attention.

I have also been gathering links meta data for some time. Maybe I will use them to feed any eventual self hosted search engine, or language model, if I decide to experiment with that.

- domains for seed https://github.com/rumca-js/Internet-Places-Database

- bookmarks seed https://github.com/rumca-js/RSS-Link-Database

- links for year https://github.com/rumca-js/RSS-Link-Database-2024

wongarsu

0 replies

2024-03-06 17:26:12 UTC

To be fair, of those only Apache Lucene predates YaCy. YaCy is very mature, but in terms of relative popularity for general web search probably peaked around 15 years ago.

legrande

0 replies

5d3h

2024-03-06 14:56:47 UTC

Also these:

https://swisscows.com/en

https://search.disconnect.me/

https://www.ecosia.org/

https://metager.org/

https://searx.space/

fsflover

0 replies

5d3h

2024-03-06 14:35:05 UTC

But which of those projects are distributed and FLOSS?

ColinHayhurst

0 replies

5d2h

2024-03-06 15:41:44 UTC

https://www.mojeek.com/ self-disclosure, mojeek team member

DrDroop

4 replies

5d10h

2024-03-06 07:56:03 UTC

I once went to a workshop on a Sunday morning at the local makerspace to listen to someone talk about some kind of distributed search engine or something like that. One of the developers came from (I think) Germany to explain this to us the centralized sheeple. He just gave a demonstration of the thing, like here is the box you type stuff and here are the results. When I started to ask questions about how it worked an all he sort of acted annoyed saying it was all too difficult to explain. This was more than ten years ago, and yes I am still angry about it.

ssijak

1 replies

5d10h

2024-03-06 07:58:51 UTC

At the core it was probably based on peer to peer distributed hash tables, so here you go, read the source https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia...

belter

0 replies

5d9h

2024-03-06 09:18:58 UTC

160 bits ought to be enough for anybody :-)

albert180

1 replies

5d9h

2024-03-06 08:31:13 UTC

It's probably him YaCy is made by a German Dude

synctext

0 replies

5d9h

2024-03-06 09:05:34 UTC

Impressive 20 year project by one key developer.

See 20 year post in German by YaCy founder: https://community.searchlab.eu/t/yacy-vor-20-jahren/1543

buffalobuffalo

3 replies

5d2h

2024-03-06 16:09:09 UTC

I ran YaCy for a while, but not as a node on their distributed search index. I just ran it as a search engine for all my own bookmarks. Unfortunately I never found a particularly good way of getting bookmarks into the system. So eventually I shut it down. Cool idea in theory though.

justusthane

2 replies

4d15h

2024-03-07 02:55:30 UTC

I have plan that I haven’t implemented yet, but I want to route all my outbound internet traffic through a Squid reverse proxy, which will in turn add every visited URL to YaCy (except for domains I choose to exempt).

That way I’ll have a fully searchable index of every website I ever visit, which will hopefully solve the “Oh shit, what was that one website I found about X two months ago?”

A potentially easier thing to do would be create a bookmarklet that adds the current page to YaCy.

mdaniel

0 replies

3d22h

2024-03-07 20:15:36 UTC

relevant: https://github.com/ArchiveBox/ArchiveBox#readme and https://github.com/Rhizome-Conifer/conifer#readme (nee "webrecorder/webrecorder")

buffalobuffalo

0 replies

4d3h

2024-03-07 14:48:26 UTC

Yeah. Bookmark indexing was my original goal. But yacy doesn't have a great interface for that. Doable with some work, but not something i wanted to sink too much time into.

arboles

2 replies

5d6h

2024-03-06 11:46:05 UTC

Sort of hijacking the thread to ask, can YaCy or similar, be an alternative to Google's Programmable Search Engine? All I use it for is limit a search to a medium-sized list of domains. The aspect that makes running a search engine difficult on your own is lack of resources for crawling, I expect. But since I only care about a small list of domains, could I ditch Google's and run my own crawler like YaCy?

gtirloni

1 replies

5d6h

2024-03-06 12:13:24 UTC

Is that the deceased code search tool?

You could run Sourcegraph and import/sync those repositories.

Or you could run your own ElasticSearch/Melisearch and crawl the websites yourself (if you're interested in things other than git repositories).

arboles

0 replies

5d5h

2024-03-06 12:31:50 UTC

Is that the deceased code search tool?

No, it's Programmable. Though it's not actually programmable. I should've written Custom Search Engine instead, that's also a name for it.

cse.google.com - It's quaint that past the modern landing page, when using the search portal today, you still get some outdated iteration of Google UI design.

It's used, for example, for making OSINT searches.[0] Or at some point by at least one Wikipedia editor for a custom list of Reliable Sources for Anime & Manga.[1]

[0] https://www.osintme.com/index.php/2020/09/28/

[1] https://gwern.net/me#wikis

anthk

2 replies

5d4h

2024-03-06 13:55:17 UTC

Ugh, Java. I'll wait for something like i2pd does for I2P, something called yacyd either in c, c++ or golang.

ravenstine

1 replies

5d1h

2024-03-06 16:29:47 UTC

What's your objection to Java?

anthk

0 replies

2024-03-06 18:11:54 UTC

High CPU and RAM usage.

WarOnPrivacy

2 replies

5d4h

2024-03-06 14:02:10 UTC

Yacy's still around. Nice.

After a year or two of hosting a Yacy instance (2014?) I started winding up on some general (probes, etc) blacklists.

I also host a small mail server and I was getting mail returned. I'd force an IP swap and a few weeks later it'd be the same. I had to let Yacy go.

1oooqooq

1 replies

5d2h

2024-03-06 15:43:33 UTC

So that is how they block a people's search/crawler. Didn't thought they would use the most complicated method.

They also use block lists to add every single TOR node (even if not an exit) and every VPN under the sun (except for streaming, because, why would them, that's why they exist)

WarOnPrivacy

0 replies

4d12h

2024-03-07 05:38:33 UTC

So that is how they block a people's search/crawler.

It seems less that was the intent and more they blacklist IPs with behavior they find annoying. They're super general lists.

They also use block lists to add every single TOR node

This annoyed the crap out of me. The stupid Dan list guy made it as easy as he could to lump low risk bridges in with high risk exit nodes.

jrussbowman

1 replies

2024-03-06 17:28:30 UTC

Nice to see search projects are still popping up. After a move, family life taking over and me getting more interested in Unreal Engine, my poor search engine is now more of an experiment in seeing how well it runs while basically on life-support maintenance updates I do. Starting to think I honestly should just take it down and save my $50 a month I spend maintaining it.

But I'll post it in a hacker news comment and maybe you all will give it enough traffic I can get excited about it again, lol

https://www.unscatter.com

jrussbowman

0 replies

2024-03-06 17:31:33 UTC

And for my immature moment of the day, the above comment was comment #69

gonesilent

1 replies

5d9h

2024-03-06 08:49:55 UTC

Infrasearch / Gonesilent sold to Sun turned into project JXTA and died.

mdaniel

0 replies

2024-03-06 17:56:23 UTC

While trying to read more about it, turns out there's an O'Reilly book, too: https://www.oreilly.com/library/view/jxta-in-a/059600236X/ch... and there's also this https://wiki.wireshark.org/JXTA (I'm guessing those specification links are in wayback but I didn't chase them)

fortran77

1 replies

5d1h

2024-03-06 16:55:00 UTC

Related to this — I’d love to see individuals making web pages again, and federated search engines indexing them. People don’t make their own hobby or fan or art websites anymore, and I think that’s partly because nobody will every find them with the big search engines.

emrah

0 replies

5d1h

2024-03-06 17:00:14 UTC

I think it would be nice if the search results were "distributed" rather than deterministic.

So when i enter the same keywords, let's say there are 50 pages each of which would be equivalently good result for the search, rather than one page "winning", the search engine would alternate the winner among the many possibilities

charcircuit

1 replies

5d9h

2024-03-06 09:11:27 UTC

Are the results still being gamed by sites using content keyword stuffing? The last time I used it the searching and ranking technology felt like they were 40 years behind state of the art.

liotier

0 replies

5d4h

2024-03-06 13:37:39 UTC

In distributed indexing, spam management seems a much bigger problem than the indexing itself.

boyter

1 replies

5d9h

2024-03-06 09:18:41 UTC

I actually half wrote a RFC of a spec and 2 implementations of a federated search last year. Rather than do the disturbed hash table that yacy does.

I wanted results to be re-rankable by the peers by sharing the scores that went into them. The idea being with a common protocol based on the ideas of ActivityPub you could get peers of searches working together to hopefully surface interesting things.

Something I should probably finish and publish at some point. It worked to the hundreds of peers I tested.

The reason I mention this is because I wanted to also add a front into yacy which tuned out to be harder than I expected. It’s a wonderful project and you can find great stuff through it but the way the peers return results sometimes it’s hard to find it again. It’s also not quite as hackable as I would have hoped at the time probably due to he project age.

I still think there is value in it though and I’d love to see yacy have its protocol explained as an apex so people could,build implementations in other languages more easily.

detourdog

0 replies

5d1h

2024-03-06 16:40:26 UTC

I remember the first days of gopher browsing were like that. Gopher browsing to me was like swinging on vine to vine. The trick was remembering/documenting where each vine went.

treprinum

0 replies

4d23h

2024-03-06 18:58:57 UTC

Is it worth dedicating 1-2 low power NUCs (4-8 core) to this on a 250MBit/s connection? Or does it need beefier CPUs/network?

rasulkireev

0 replies

5d10h

2024-03-06 08:21:09 UTC

Love it. Super easy to self host and use. Now I have a personal Google!

nairboon

0 replies

4d22h

2024-03-06 20:18:48 UTC

If you run YaCy with docker and it is still a junior peer, does the search return results from the global index or just the one that appears to be 'preinstalled'?

maxloh

0 replies

5d9h

2024-03-06 08:26:21 UTC

See also: Presearch, another decentralized search engine, claimed that it will be open source. No source code available at the moment though.

https://presearch.com/

fho

0 replies

2024-03-06 17:59:01 UTC

I've been using several times over the last decades and never got good results. I think one instance is still running on my old computer at uni :-)

dredmorbius

0 replies

2024-03-06 18:10:19 UTC

Previously:

YaCy – your own search engine | https://news.ycombinator.com/item?id=32597309 | 2 years ago | 93 comments

YaCy: Decentralized Web Search | https://news.ycombinator.com/item?id=22246732 | 4 years ago | 41 comments

YaCy – The Peer to Peer Search Engine | https://news.ycombinator.com/item?id=17089240 | 6 years ago | 3 comments

YaCy: a free distributed search engine | https://news.ycombinator.com/item?id=12433010 | 8 years ago | 24 comments

YaCy: Decentralized Web Search | https://news.ycombinator.com/item?id=8746883 | 9 years ago | 29 comments

YaCy takes on Google with open source search engine | https://news.ycombinator.com/item?id=3288586 | 12 years ago | 17 comments

RGBCube

0 replies

5d9h

2024-03-06 08:49:22 UTC

    curl failed to verify the legitimacy of the server and therefore could not
    establish a secure connection to it. To learn more about this situation and
    how to fix it, please visit the web page mentioned above.

Can't seem to access the page.