return to table of content

Vector indexing all of Wikipedia on a laptop

gfourfour
46 replies
1d3h

Maybe I’m missing something but I’ve created vector embeddings for all of English Wikipedia about a dozen times and it costs maybe $10 of compute on Colab, not $5000

bunderbunder
19 replies
1d3h

This is covering 300+ languages, not just English, and it's specifically using Cohere's Embed v3 embeddings, which are provided as a service and currently priced at US$0.10 per million tokens [1]. I assume if you're running on Colab you're using an open model, and possibly a relatively lighter weight one as well?

[1]: https://cohere.com/pricing

traverseda
11 replies
1d3h

This is pretty early in the game to be relying on proprietary embeddings, don't you think? If if they are 20% better, blink and there will be a new normal.

It's insane to me that someone, this early in the gold rush, would be mining in someone else's mine, so to speak

janalsncm
8 replies
1d1h

It’s not just that. Embeddings aren’t magic. If you’re going to be creating embeddings for similarity search, the first thing you need to ask yourself is what makes two vectors similar such that two embeddings should even be close together?

There are a lot of related sources of similarity, but they’re slightly different. And I have no idea what Cohere is doing. Additionally, it’s not clear to me how queries can and should be embedded. Queries are typically much shorter than their associated documents, so they typically need to be trained jointly.

Selling “embeddings as a service” is a bit like selling hashing as a service. There are a lot of different hash functions. Cryptographic hashes, locality sensitive hashes, hashes for checksum, etc.

pradn
3 replies
2h37m

I'm with you on this. The vector embedding craze seems to be confusing mechanism and problem. The problem is semantic similarity search. One mechanism is vector embedding. I think all this comes from taking LLMs as a given, seeing that they work reasonably well with phrase-input semantic retrieval, and then hyper-optimizing vector embedding / search to achieve it.

Are there other semantic search systems? What happened to the entire field of Information Retrieval - is vector search the only method? Are all the stemming, linguistic analysis, all that - all obsoleted by vectors?

Or is it purely because vector search is quick? That's just an engineering problem. I'm not convinced it's the only method here. Happy to be corrected!

bunderbunder
1 replies
1h31m

The entire field of information retrieval is still here. This was touched on by the OReilly article on lessons learned working with LLMS that hit the HN front page yesterday [1], in their section on RAG.

My sense is that you can currently break the whole thing down into two groups: the proverbial grownups in the room are typically building pipelines that are still doing it basically how the top-performing systems did in the '90s, with a souped up keyword and metadata search engine for the initial pass and an embedding model for catching some stuff it misses and/or result ranking. This isn't how most general-purpose search engines work, but it's likely how the ones you don't particularly mind using work. Web search, for example.

And then there's the proverbial internet comments section, which wants to skip past all the boring labor-intensive oldschool stuff, and instead just begin and end with approximate nearest neighbors search using an off-the-shelf embedding model. The primary advantage to this approach - and I should admit here that I've tried it myself - is that you can bodge it together over a weekend and have the blog post up by Monday.

I guess what I'm getting at is, the people producing content on the Internet and the people producing effective software aren't necessarily the same people. I mean, heck, look at me, I'm only here to type this comment because I'm slacking off at work today.

1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...

pradn
0 replies
1h16m

Your comment makes a lot of sense to me.

What I wonder though is - we've been a year and a half into the LLM craze and we still don't see a really good information processing system for them. Yes, there's chatbots, some that let you throw in images and PDFs.

But what we need is more like a ground-up rethink of these UIs. We need to invent the "desktop" of LLMs.

But the keys here, I think, are that

a) the LLMs are only part of the solution. A chat interface is immature and not enough.

b) external information is brought in by the user, and augmented by a universe of knowledge given by the provider

c) being overly general is probably a trap. Yes, LLMs can talk about everything - but why not solve a concrete vertical?

Semantic search helps with a part of this, but is just one component.

janalsncm
0 replies
39m

Are there other semantic search systems?

Not a semantic search but stemming + BM25 often works surprisingly well and is a fast and cheap baseline.

vineyardmike
1 replies
18h14m

If you’re going to be creating embeddings for similarity search, the first thing you need to ask yourself is what makes two vectors similar such that two embeddings should even be close together?

I have no association with Cohere, but in their docs clearly say that their embedding were trained so two similar vectors have similar "semantic meaning". Which is still pretty vague, but it's at least clear what their goals were.

Selling “embeddings as a service” is a bit like selling hashing as a service.

Coincidentally, Cohere also aggressively advertises that they want you to fine-tune and co-develop custom models (with their proprietary services).

nostrebored
0 replies
17h37m

But this is the GPs point — that doesn’t mean they’re optimized for retrieval.

riku_iki
0 replies
5h18m

here are a lot of different hash functions. Cryptographic hashes, locality sensitive hashes, hashes for checksum, etc.

and there are some standard hash functions in the lib, which cover 98% of usecases. I think the same is embeddings, you can train some foundational multitask model, and embedding will work for variety of tasks too.

bunderbunder
0 replies
1d

You probably won't find out exactly what they're doing any more than anyone's going to find out a whole lot of details on what OpenAI is doing with GPT. But, as the popularity of GPT demonstrates, it seems that many business customers are now comfortable embracing closed models.

There is more information here, though: https://cohere.com/blog/introducing-embed-v3

bunderbunder
1 replies
1d2h

I have no idea. But that wasn't the question I was answering. It was, "how does the article's author estimate that would cost $5000?" And I think that's how. Or at least, that gets to a number that's in the same ballpark as what the author was suggesting.

That said, first guess, if you do want to evaluate Cohere embeddings for a commercial application, using this dataset could be a decent basis for a lower-cost spike.

jbellis
0 replies
1d2h

Yes, that is how I came up with that number.

janalsncm
2 replies
22h56m

I also don’t quite understand the value of embedding all languages into the same database. If I search for “dog” do I really need to see the same article 300 times?

As a first step they are using PQ anyways. It seems natural to just assume all English docs have the same centroid and search that subspace with hnswlib.

jbellis
1 replies
22h54m

It's split by language. TFA builds an index on the English language subset.

janalsncm
0 replies
22h52m

Ah, missed that.

gfourfour
2 replies
1d2h

Ah didn’t realize it was every language. Yes I’m using a light weight open model - but also my use case doesn’t require anything super heavy weight. Wikipedia articles are very feature-dense and differentiable from one another. It doesn’t require a massive feature vector to create meaningful embeddings.

tjake
1 replies
1d2h

it's 35M 1024 vectors Plus the text

gfourfour
0 replies
1d

Still, $5000 is kind of insane. That makes no sense to me.

lyu07282
0 replies
9h58m

hey, 3 cents cheaper than text-embedding-3-large (without batching)!

Are there some benchmarks available that compare it with the openai model?

emmelaich
13 replies
1d3h

Got any details?

gfourfour
12 replies
1d3h

Nothing too crazy, just downloading a dump, splitting it into manageable batch sizes, and using a lightweight embedding model to vectorize each article. Using the best GPU available on colab it takes maybe 8 hours if I remember correctly? Vectors can be saved as NPY files and loaded into something like FAISS for fast querying.

thomasfromcdnjs
3 replies
1d

Did you chunk the articles? If so, in what way?

gfourfour
2 replies
1d

Yes. I split the text into sentence and append sentences to a chunk until the max context window is reached. The context window size is dynamic for each article so that each chunk is roughly the same size. Then I just do a mean pool of the chunks for each article.

thomasfromcdnjs
1 replies
23h56m

Thanks for the answer.

Wikipedia has a lot of tables so I was wondering if content-aware sentence chunking would be good enough for Wikipedia.

https://www.pinecone.io/learn/chunking-strategies/

gfourfour
0 replies
22h10m

mwparserfromhell can parse the text content without including tables

abetusk
3 replies
1d2h

This probably deserves its own article and might be of interest to the HN community.

gfourfour
2 replies
1d

I will probably make a post when I launch my app! For now I’m trying to figure out how I can host the whole system for cheap because I don’t anticipate generating much revenue

solarengineer
1 replies
16h14m

I'm interested in hearing about what you will be hosting.

Would Digital Ocean or Hetzner meet your needs?

gfourfour
0 replies
3h1m

I was going to use ec2 and s3, should I look at digital ocean or hetzner instead?

riku_iki
1 replies
1d1h

What is the end task(e.g. RAG, or just vector search for question answering), are you satisfied with results in terms of quality?

gfourfour
0 replies
1d

The end result is a recommendation algorithm, so basically just vector similarity search (with a bunch of other logic too ofc). The quality is great, and if anything a little bit of underfitting is desirable to avoid the “we see you bought a toilet seat, here’s 50 other toilet seats you might like” effect.

j0hnyl
1 replies
1d3h

How big is the resulting vector data?

gfourfour
0 replies
1d2h

Like 8 gb roughly

hivacruz
6 replies
1d3h

Did you do use the same method, i.e. split by chunks each article and vectorize each chunk?

dudus
4 replies
1d3h

That's the only way to do it. You can't index the whole thing. The challenge is chunking. There are several different algorithms to chunk content for vectorization with different pros and cons.

minimaxir
3 replies
1d2h

You can do much bigger chunks with models that support RoPE embeddings, such as nomic-embed-text-1.5 which has a 8192 context length: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5

In theory this would be an efficiency boost but the performance math can be tricky.

rahimnathwani
1 replies
1d1h

Why would you want chunks that big for vector search? Wouldn't there be too much information in each chunk, making it harder to match a query to a concept within the chunk?

nostrebored
0 replies
17h35m

The problem is that often semantic meaning depends on state multiple paragraphs or sections away.

This is a coarse way to tackle that

qudat
0 replies
1d1h

As far as I understand it, context length degrades llm performance, so just because an llm "supports" a large context length it basically just clips a top and bottom chunk and skips over the middle bits.

gfourfour
0 replies
1d3h

Yes

janalsncm
1 replies
1d1h

Also, if you’re spending $5000 to compute embeddings, why are you indexing them on a laptop?

bytearray
1 replies
23h25m

Do you have a link to the notebook?

gfourfour
0 replies
20h57m

No haha just a rats nest of a bunch of notebooks

DataDaemon
0 replies
1d3h

How?

traverseda
28 replies
1d3h

Disable swap before building the index. Linux will aggressively try to cache the index being constructed to the point of swapping out parts of the JVM heap, which is obviously counterproductive. In my test, building with swap enabled was almost twice as slow as with it off.

This is an indication to me that something has gone very wrong in your code base.

nisa
6 replies
1d2h

This is an indication to me that something has gone very wrong in your code base.

I'm not sure on what planet all of these people here live that they have success with Linux swap. It's been broken for me forever and the first thing I do is disable it everywhere.

funcDropShadow
3 replies
1d1h

On a planet where the heap size of the JVM is properly set.

hot_gril
2 replies
1d1h

I don't use Java often, but when I do, I have to look up again how the heap settings work.

taspeotis
1 replies
1d

It’s just -xXxXxX360NoScopexXxXxHeapSize=69G

Izkata
0 replies
23h32m

There should be at least one : in there.

(IIRC these options are used like -XX:foo=bar)

(Edit - no, but some do use : instead of = I guess)

zero_k
0 replies
1d2h

100% agree

nolist_policy
0 replies
23h44m

Linux swap has been fixed on Chromebooks for years thanks to MGLRU. It's upstream now and you can try it with a recent enough kernel with

  echo y >/sys/kernel/mm/lru_gen/enabled

dekhn
4 replies
1d3h

As a workaround, you can mlock your process which should prevent the application pages from being evicted by swap.

jbellis
3 replies
1d2h

FWIW this is what Cassandra does on startup, but it didn't seem worth it to go to the trouble of dealing with Unsafe for a demo project.

nextaccountic
2 replies
1d1h

Can't mlock be wrapped out in a safe API?

jbellis
1 replies
1d1h

Probably, but I'm not aware of any Java library that provides that functionality.

AnotherGoodName
4 replies
1d1h

Memory managed languages won’t do long term garbage collection till memory is nearly full.

I suspect they just need to pass in -xmx options to the jvm to avoid this.

menacingly
1 replies
1d

I actually wish this were true because of more predictable gc pauses

threeseed
0 replies
23h6m

With G1GC there is a setting called MaxGCPauseMillis which can give you predictability.

threeseed
0 replies
23h7m

Not sure what long term garbage collection means. Because the JVM will GC all objects at any point if they are unused.

If you're referring to full GC you can configure how often that happens and by default it doesn't just wait until memory is nearly full.

marginalia_nu
0 replies
1d1h

Even if this wasn't factually incorrect, default max heap size on the JVM is 25% of system RAM.

nsingh2
3 replies
1d3h

Yea this was strange, I've only ever seen swaps when my main memory was near being full. Maybe they're storing all embeddings in memory?

jbellis
2 replies
1d2h

I routinely see Linux page memory out to swap while having 10+GB free. I can only guess that it really really really likes to cache recently used data from disk.

hot_gril
1 replies
1d1h

I've seen the same. It will sometimes prefer to swap rather than evict the disk cache.

I don't know how Linux does this in particular, but intuitively swapping can make sense if part of your allocated RAM isn't being accessed often and the disk is. The kernel isn't going to know for sure of course, and seems in my case it guessed wrong.

kevincox
0 replies
22h23m

At a very abstract level there isn't much difference between dropping a disk cache and writing anonymous memory to swap. Obviously the actual step of writing to swap is more expensive but once the data is out of memory they will both require a read to get the data back. So if you imagine that you are occasionally reading a file and an anonymous page you can thrash on either if your system doesn't have enough memory. After the initial write to swap it is effectively identical to keep reading either back in.

It is much better to swap out anonymous memory that isn't being used than to flush file data that is being used. Or another way of looking at it if you run out of memory you will thrash with or without swap enabled. The difference is that with swap the kernel can evict the most rarely accessed memory whether file-backed or anonymous. Without swap the kernel is forced to only consider file-backed memory, even if it is being used more frequently than the anonymous memory.

spullara
1 replies
1d1h

All the major web companies disable swap so no work has been done to optimize it.

hot_gril
0 replies
1d1h

Yeah, in most situations I'd rather kill processes with excessive mem usage (possibly due to memleak) than have the machine grind to a halt by swapping. Sometimes my headless lab machines would become practically inaccessible over SSH if I didn't swapoff before running something that accidentally chews up mem.

I'll let my personal laptop swap, though. Especially if my wife is also logged in and has tons of idle stuff open.

jbellis
1 replies
1d2h

[article author]

TBH this was sloppy on my part. I tested multiple runs of the index build and early on kswapd was super busy. I assumed Linux was just caching recently read parts of the source dataset, but it's also possible it was something external to the index build since it's my daily driver machine. After I turned off swap I had no issues and didn't look into it harder.

hot_gril
1 replies
1d2h

Linux swap has some fuzzy logic that I never fully understood. There have been times I've disabled it because it wasn't doing what I wanted.

lynx23
0 replies
1d2h

I almost always reduce swappiness on a new install, the default of (60?) never served me well.

lionkor
0 replies
15h26m

Could just turn down swappiness?!

Atotalnoob
6 replies
20h9m

Why is the author listing himself as datastax cto?

He isn’t according the Wikipedia, my friend who works there, and their company website. https://www.datastax.com/our-people

That’s kind of weird

metadat
2 replies
20h3m

What are you talking about? The datastax site lists it:

> SANTA CLARA, Calif. – September 28, 2020 – DataStax today announced that DataStax Co-Founder and CTO Jonathan Ellis will deliver a keynote address at ApacheCon @Home 2020

https://www.datastax.com/press-release/datastax-co-founder-a....

As an aside, I'm an ApacheCon presenter but there was no press release about the hot excitement of my involvement. Maybe next time :)

Atotalnoob
1 replies
19h58m

That’s from 2024. They aren’t the cto of datastax currently

metadat
0 replies
19h55m

Maybe it's the highest rank they achieved and still trying to capitalize on it. So what? Should they now say "Unemployed"?

There has to be more interesting things to discuss than this.

P.s. I think you meant 2020.

Atotalnoob
0 replies
19h59m

They were definitely a founder, but they are not the current cto

jbellis
0 replies
19h36m

I guess I'm kind of a CTO emeritus now -- I mostly write code, by choice. https://github.com/jbellis

HammadB
5 replies
21h48m

"The obstacle is that until now, off-the-shelf vector databases could not index a dataset larger than memory, because both the full-resolution vectors and the index (edge list) needed to be kept in memory during index construction. Larger datasets could be split into segments, but this means that at query time they need to search each segment separately, then combine the results, turning an O(log N) search per segment into O(N) overall."

How is a log N search over S segments O(N)?

jbellis
3 replies
19h57m

I was trying to make the point that the dominant factor becomes linear instead of logarithmic, but more accurately it's O(S log N) = O(N log N) because S is proportional to N.

HammadB
2 replies
15h57m

Sure, I see. I think this is an area where complexity analysis doesn’t lead to useful information.

To be more correct it’s O(N/C log C) where C is the capacity of a segment. In this case you can ignore 1/C and log C as constant. So now sure, you actually just have O(N). But this is not super useful as it says that a segmented hnsw approach and brute force approach are the same - when this is really not the case in practice.

Also O(N log N) > O(N) so I’m not sure why we would ever do anything with segmentation according to that analysis if it were correct.

jbellis
1 replies
12h58m

I’m not sure why we would ever do anything with segmentation according to that analysis if it were correct.

What's your alternative when you can't build an index larger than C?

HammadB
0 replies
1h49m

If segmented hsnw indices were O(N log N) - it would make no sense to build the index at all - brute force would be better as O(N log N) > O(N)

thfuran
0 replies
21h39m

Doesn't doubling N double S?

xandrius
2 replies
23h58m

Internal sever error :(

jbellis
0 replies
23h5m

Oops, HN maxed out the free Cohere API key it was using. Fixed.

bytearray
0 replies
23h22m

Same.

noufalibrahim
4 replies
14h3m

What are the good solutions in this space? Vector databases I mean. Mostly for semantic search across various texts.

I have a few projects I'd like to work on. For typical web projects, I have a "go to" stack and I'd like to add something sensible for vector based search to that.

riku_iki
0 replies
3h41m

Could you tell how scalable JVector is? How many vectors it can handle, like millions, billions, hundreds of billions?

noufalibrahim
0 replies
10h32m

Nice. I wanted to try something out on a machine before moving to hosted soclutions.

StrauXX
0 replies
14h0m

In my experience its usually easiest to use a vector store extension for an off-the-shelf database like postgres (pgvector is nice). That way you don't have to manage another, rapidly changing, service and you can easily combine queries on the vectors with regular columns, join them and so on.

m3kw9
4 replies
20h50m

He should have asked HN on the cheapest way to embed Wikipedia before starting

jbellis
3 replies
19h55m

I'm baffled that so many people fixate on the estimated cost and miss the fact that it's a public dataset. As in, free.

m3kw9
1 replies
18h23m

Getting the embeddings ain’t free

jbellis
0 replies
12h57m

Unless you're counting your network access cost, the dataset of embeddings is in fact free and TFA includes instructions on how to download them for free.

syllogistic
0 replies
19h27m

It's in the first sentence of the article too :)

hot_gril
4 replies
1d2h

How many dimensions are in the original vectors? Something in the millions?

jbellis
3 replies
1d2h

1024 per vector x 41M vectors

hot_gril
2 replies
1d1h

1024-dim vectors would fit into pgvector in Postgres, which can do cosine similarity indexing and doesn't require everything to fit into memory. Wonder how the performance of that would compare to this.

jbellis
1 replies
1d1h

It's been a while since I read the source to pgvector but at the time it was a straightforward HNSW implementation that implicitly assumes your index fits in page cache. Once that's not true, your search performance will fall off a cliff. Which in turn means your insert performance also hits a wall since each new vector requires a search.

I haven't seen any news that indicates this has changed, but by all means give it a try!

hot_gril
0 replies
1d1h

Thanks, I didn't know that. Last time I was dealing with these kinds of problems, pgvector didn't exist yet.

lilatree
3 replies
22h53m

“… turning an O(log N) search per segment into O(N) overall.”

Can someone explain why?

StrangeDoctor
2 replies
21h41m

When it’s all in memory you get to amortize the cost of the initial load. Or just pay it when it’s not part of the hot path. When it’s segmented, you’re doing that because memory is full and you need to read in all the segments you don’t have. That’ll completely overwhelm the log n of the search you still get

jbellis
1 replies
19h49m

I was trying to make the point that the dominant factor becomes linear instead of logarithmic, but more accurately it's O(S log N) = O(N log N) because S (number of segments) is proportional to N (number of vectors).

StrangeDoctor
0 replies
8h19m

Ah yeah that’s what I wanted to write but I guess I didn’t want to put words in your mouth, and stuck to what I could be certain about happening. We do all this work to throw away the unneeded bits in one situation and when comparing it to a slightly different situation go “huh some of that garbage would be kinda nice here”

isoprophlex
3 replies
1d2h

$5000?! I indexed all of HN for ... $50 I think. And that's tens of millions of posts.

criddell
1 replies
2h56m

How are you using that index?

isoprophlex
0 replies
2h36m

https://www.searchhacker.news/

A tool that (hopefully) surfaces interesting HN discussion threads; I wanted an excuse to investigate (hybrid) full text and vector search at a substantial scale beyond toy datasets.

Sadly (well not really) I changed jobs soon after building the first version. Life caught up and I never got around to adding more features and polishing up the frontend (eg. the broken back button

Ideas for new features are very welcome :)

xandrius
0 replies
23h59m

To be fair Wikipedia has over 60 million pages and this is for 300+ languages. But yeah, the value shows that they might not be using the cheapest service out there.

peter_l_downs
2 replies
1d1h

JVector, the library that powers DataStax Astra vector search, now supports indexing larger-than-memory datasets by performing construction-related searches with compressed vectors. This means that the edge lists need to fit in memory, but the uncompressed vectors do not, which gives us enough headroom to index Wikipedia-en on a laptop.

It's interesting to note that JVector accomplishes this differently than how DiskANN described doing it. My understanding (based on the links below, but I didn't read the full diff in #244) is that JVector will incrementally compress the vectors it is using to construct the index; whereas DiskANN described partitioning the vectors into subsets small enough that indexes can be built in-memory using uncompressed vectors, building those indexes independently, and then merging the results into one larger index.

OP, have you done any quality comparisons between an index built with JVector using the PQ approach (small RAM machine) vs. an index built with JVector using the raw vectors during construction (big RAM machine)? I'd be curious to understand what this technique's impact is on the final search results.

I'd also be interested to know if any other vector stores support building indexes in limited memory using the partition-then-merge approach described by DiskANN.

Finally, it's been a while since I looked at this stuff, so if I mis-wrote or mis-understood please correct me!

- DiskANN: https://dl.acm.org/doi/10.5555/3454287.3455520

- Anisotropic Vector Quantization (PQ Compression): https://arxiv.org/abs/1908.10396

- JVector/#168: How to support building larger-than-memory indexes https://github.com/jbellis/jvector/issues/168

- JVector/#244: Build indexes using compressed vectors https://github.com/jbellis/jvector/pull/244

pbadams
0 replies
1d

It's not mentioned in the original paper, but DiskANN also supports PQ at build-time via `--build_PQ_bytes`, though it's a tradeoff with the graph quality as you mention.

One interesting property in benchmarking is that the distance comparison implementations for full-dim vectors can often be more efficient than those for PQ-compressed vectors (straight-line SIMD execution vs table lookups), so on some systems cluster-and-merge is relatively competitive in terms of build performance.

jbellis
0 replies
1d1h

That's correct!

I've tested the build-with-compression approach used here with all the datasets in JVector's Bench [1] and there's near zero loss in accuracy.

I suspect that the reason the DiskANN authors used the approach they did is that in 2019 Deep1B was about the only very large public dataset around, and since the vectors themselves are small your edge lists end up dominating your memory usage. So they came up with a clever solution, at the cost of making construction 2.5x as expensive. (Educated guess: 2x is from adding each vector to multiple partitions and the extra 50% to merge the results.)

So JVector is just keeping edge lists in memory today. When that becomes a bottleneck we may need to do something similar to DiskANN but I'm hoping we can do better because it's frankly a little inelegant.

[1] https://github.com/jbellis/jvector/blob/main/jvector-example...

jl6
2 replies
22h57m

The source files appear to include pages from all namespaces, which is good, because a lot of the value of Wikipedia articles is held in the talk page discussions, and these sometimes get stripped from projects that use Wikipedia dumps.

worldsayshi
1 replies
22h54m

I'm curious what the main value you see in the talk pages? I almost never look at them myself.

jl6
0 replies
22h50m

They’re not so interesting for mundane topics, but for anything remotely controversial, they are essential for understanding what perspectives aren’t included in the article.

issafram
2 replies
17h19m

Would a docker container help running it on Windows?

jbellis
0 replies
6h48m

Technically it does run on windows, you just can't build the entire dataset without adding the sharding code mentioned. Set divisor=100 in config.properties and it will happily build an index over 1% of the dataset.

cosmojg
0 replies
17h9m

Just use WSL. Or dual boot.

Mathnerd314
2 replies
5h11m

Enough RAM to run a JVM with 36GB of heap space

Are there laptops like that? Maybe an upgraded MacBook, but I have been looking for Windows/Linux laptops and they generally top out at 32GB. I checked Lenovo's website and everything with 64GB and up is not called a laptop but a "mobile workstation".

cbolton
0 replies
5h0m

You can configure a Lenovo Z13 Gen 2 with 64GB for little extra money (and choose between Windows, Ubuntu, Fedora, or no OS preinstalled).

_zoltan_
0 replies
3h18m

You can buy an M3 Max with 128GB memory.

arnaudsm
1 replies
1d

In expert topics, is vector search finally competitive with BM25-like algorithms? Or do we still need to mix the 2 together ?

opdahl
0 replies
17h41m

Would be interesting if you could try implementing the Cohere Reranker into this. Should be fairly easy, and could lead to quite a bit of performance gain.

localhost
0 replies
1d

This is a giant dataset of 536GB of embeddings. I wonder how much compression is possible by training or fine-tuning a transformer model directly using these embeddings, i.e., no tokenization/decoding steps? Could a 7B or 14B model "memorize" Wikipedia?

anonymousDan
0 replies
12h34m

How do embeddings created by state of the art open source models compare to the free embeddings mentioned in the article? Would they actually cost 5k to create given a reasonable local GPU setup?

Khelavaster
0 replies
18h59m

This is how Microsoft powered it's academic paper search in 2016, before rolling it into Bing in 2020!