The BM25 thing is actually a really big deal.
BM25 is similar to TF/IDF. In both cases, the key idea is to consider statistics of the overall corpus as part of relevance calculations. If the user searches for "charities in new orleans" in a corpus where "new orleans" is only represented in a few documents, those should clearly rank highly. If the corpus has "new orleans" in almost every document then the term "charity" is more important.
PostgreSQL FTS cannot do this, because it doesn't maintain statistics for word frequencies across the entire corpus. This severely limits what it can implement in terms of relevance scoring - each result is scored based purely on if the search terms are present or not.
For comparison, SQLite FTS (which a lot of people are unaware of) actually does implement full index statistics, and SQLite FTS5 implements BM25 out of the box.
The problem with BM25 in a database, is that is can have unexpected outcomes for some common use cases.
Take multi-tenancy.
What if user 1 has many more documents than user 2, and uses "new orleans" a lot. But user 2 does not. User 2 does the search.
The db will first use FTS, and then filter. So user 1 will bias the results of user 2. Perhaps enough for user 2 to discover what words are in user 1 corpus.
AWS offers a solution which is OpenSearch serverless collections. A dedicated collection can be spun up for each user/tenant, e.g. «thing1-user345», «thing2-user123», instead of co-mingling the «user123» and «user345» in the same collection index. It increases the overall overhead but with the infrastructure repeatability, courtesy of IaaC, it is easy to roll out and process/consume «thing1-user123» and «thing2-user123» as discrete datasets.
Tangentially related. I have been finding multi-tenancy to have become more of a liability at worst or a nuisance at best due to the increasingly frequent customer demands to satisfy the data sovereignity, data confidentiality (each tenant wants their own, non-shared cryptography keys etc), data privacy, compliance, the right to forget/GDPR and similar requiremenets. For anything more complex than an anonymous online flower shop, it is simpler to partition the whole thing off – for each customer/tenant.
Any idea how “OpenSearch serverless collections” are implemented? I’m guessing that a “collection” is basically an ElasticSearch index, and “serverless” refers to some method of serializing/loading it on-demand with some cold start tradeoffs?
Basically opensearch and elasticsearch both offer serverless modes now. They work differently because all this was developed post fork. But they do similar things. The Elastic implementation is in early access mode right now, so it is not released yet. I saw a demo of this at one of their meetups in June. I think Opensearch actually moved a bit faster than Elastic on this after Elastic announced that they were working on this a few years ago.
The aim with both is to not have users worry about cluster sizes or cluster management any more, which from an operational point of view is a huge gain because most companies using this stuff aren't very good at doing that properly and the consequences are poor performance, outages, and even data loss when data volume grows.
Serverless essentially decouples indexing and querying traffic. All the nodes are transient and use their local disk only as a cache. Data at rest lives in S3 which becomes the single source of truth. So, if a new node comes up, it simply loads it's state from there and it doesn't have to coordinate with other nodes. There's no more possibility for a cluster to go red either. If a node goes down, it just gets replaced. Basically this makes use of the notion that lucene index segments are immutable after they are written to. There's a cleanup process running in the background that merges segments to clean things up but basically that just means you get a new file that then needs to be loaded by query nodes. I'm not 100% sure how write nodes coordinate segment creation and management. But I assume it involves using some kind of queueing.
So you get horizontal scalability for both reads and writes and you no longer have to worry about managing cluster state.
The tradeoff is that you have a bit increased latency before query nodes can see your incoming data because it has to hit S3 as part of the indexing process before read nodes can pick up new segments. Think multiple seconds before new data becomes visible. Both solutions are best suited for time series type use cases but they can also support regular search use cases. Any kind of use case where the same documents get updated regularly or where reading your own writes matters, things are not going to be great.
Amazon's implementation of course probably leverages a lot of their cloud stuff. Elastics implementation will eventually be available in their cloud solution on all supported cloud providers. Self hosting this is going to be challenging with either solution. So that's another tradeoff.
Thanks for the detailed response and insight. This is a great example of when meetups and in-person networking/collaboration can help you stay ahead of the curve.
It does sound like the solution glosses over some cold start problems that will surface with increasing regularity for more fragmented indexes. For example if you have one index per tenant (imagine GitHub search has one public index and then additionally one index per authenticated user containing only repos they are authorized to read), then each user will experience a cold start on their first authenticated search.
I bet these tradeoffs are not so bad, and in practice, are worth the savings. But I will be curious to follow the developments here and to see the limitations more clearly quantified.
(Also this doesn’t address the writes but I’m sure that’s solvable.)
Did I miss something? This is a comment on a publicly accessible website. How did you infer these benefits?
The context offered in their comment was based on a demo they saw at a meetup:
Is this really a viable approach at the scale of B2B SaaS like Salesforce (or, contextually, Algolia)? They would end up with literally 10s of 1000s of DBs. That is surely cost-prohibitive.
Doesn't that affect BM25 with a solution like Elasticsearch as well? Or is that smart enough to apply filters to the corpus statistics before calculating relevance?
You could solve that in SQLite by giving each user their own separate FTS table - not impossibly complex, but would grow increasingly messy if you have 10s of thousands of users.
One way to address this in Elasticsearch would be to put each customers documents in their own index. Other than that, as far as I can tell it's not smart enough to apply filters first.
Also, shards can affect BM25 scoring: https://www.elastic.co/blog/practical-bm25-part-1-how-shards...
It’s why you need “enough” data. Generally the distribution of teens is a power-law distribution. There’s a set of common terms and a very long tail of other terms. This will be the case across every shard if you do in fact have enough data per shard. This becomes a factor in certain distributed aggregations as well. YMMV of course.
In general you trade accuracy for speed. You tend to get a lot of speed for just a small amount of accuracy sacrifice.
Oh sure, but it is a side channel attack to be aware of.
I wrote this blog :). Good to see it still getting use.
FYI for folks just skimming this, shards can affect scoring, but they don't have to. 2 mitigations: 1. The default in Elasticsearch has been 1 shard per index for a while, and many people (if not most) probably don't need more than 1 shard
2. You can do a dfs_query_then_fetch query, which adds a small amount of latency, but solves this problem
The fundamental tenant is accurate here that any time you want to break up term statistics (e.g. if you want each user to experience different relevance results for their own term stats) then yes, you need a separate index for that. I'd say that's largely not all that common though in practice.
A more common problem that warrants splitting indices is when you have mixed language content: the term "LA" in Spanish content adds very little information while it adds a reasonable amount of information in most English language documents (where is can mean Los Angeles). If you mix both content together, it can pollute term statistics for both. Considering how your segments of users and data will affect scoring is absolutely important as a general rule though, and part of why I'm super excited to be working on neutral retrieval now
Thanks for the clarifications! I've been spending the last 3 weeks deep in the weeds of TF/IDF scoring and was about to give up on Elastic Search when this got posted. The article has been eye opening!!!
If you're a Postgres pg_search user, partial indexes (https://docs.paradedb.com/search/full-text/index#partial-bm2...) can solve.
If you're creating one Postgres index per user, you're going to have a bad time really fast.
Would eat up inodes? Not sure how these work, does it create separate files for each one?
Index per tenant but yes, if you have hundreds or thousands of tenants that becomes a more difficult problem to manage but not at all unmanageable.
Using a SQLite DB per tenant is a good alternative to handle that: https://turso.tech/
SQLite has full text search with BM25?
Yes! https://www.sqlite.org/fts5.html
If you care about not losing your data you should reconsider using turso.
https://news.ycombinator.com/item?id=38539595
Really, use some proven and mature tech for your data. Don't jump on every hype train. In case of SQLite self hosting is especially easy using https://litestream.io/.
It could be solved with an index partitioning feature, no idea if it already exists somewhere...
Yup, partial indexes have been a part of PG for decades, and partitioned tables (on tenant) will have different indexes per client automatically.
This would be fine.
ParadeDB supports partial indexes as well: https://docs.paradedb.com/search/full-text/index#partial-bm2...
The multi-tenancy problem actually already applies to almost every multitenant database in Postgres, Oracle and MySQL. Whether or not they use FTS. You just might not notice its impact in your case if query performance is "good enough".
War story time.
Awhile ago, I worked on a big SQL (Oracle) database, millions of rows per tenant, 10ks of tenants. Tenant data was highly dissimilar between most tenants in all sorts of ways, and the distributio n of data "shape" between tenants wasn't that spiky. Tenants were all over the place, and the common case wasn't that common. Tenants (multitenant tables and/or whole separate schemas) routinely got migrated between physical database hosts by hand (GoldenGate didn't exist at the beginning of this company, and was too expensive by the end).
A whole host of problems in this environment cropped up because, inside the DB, indexes and cache-like structures were shared between tenants. An index histogram for a DATETIME column of a multitenant table might indicate that 75% of the dates were in 2016. But it turns out that most of that 75% were the rows owned by a few big tenants, and the other 1000+ tenants on that database were overwhelmingly unlikely to have 2016 dates. As a result, query plans often sucked for queries that filtered on that date.
Breaking tables up by tenant didn't help: query plans, too, were cached by the DB. Issue was, they were cached by query text. So on a database with lots of separate (identical) logical schemas, a query plan would get built for some query when it first ran against a schema with one set of index histograms, and then another schema with totally different histograms would run the same query, pull the now-inappropriate plan out of the cache, and do something dumb. This was a crappy family of bug, in that it was happening all the time without major impact (a schema that was tiny overall is not going to cause customer/stability problems by running dumb query plans on tiny data), but cropped up unpredictably with much larger impact when customers loaded large amounts of data and/or rare queries happened to run from cache on a huge customer's schema.
The solve for the query plan issue? Prefix each query with the customer's ID in a comment, because the plan cacher was too dumb (or intended for this use case, who knows?) to strip comments. The SQL keys in the plan cache would end up looking like a zillion variants of "/* CUSTOMER-1A2BFC10 */ SELECT ....". I imagine this trick is commonplace, but folks at that gig felt clever for finding it out back in the bad old days.
All of which is to say: database multitenancy poses problems for far more than search. That's not an indictment of multitenancy in general, but does teach the valuable lesson that abandoning multitenancy, even as wacky and inefficient as that seems, should be considered as a first-class tool in the toolbox of solutions here. Database-on-demand solutions (e.g. things like neon.tech, or the ability to swiftly provision and detach an AWS Aurora replica to run queries, potentially removing everything but one tenant's data) are becoming increasingly popular, and those might reduce the pain of tenant-per-instance-ifying database layout.
It's been a while since I used tools in this space. Some questions:
Where does AI/semantic search/vector DBs fit into the state of the art today? Is it generally a total replacement for traditional text search tools like elastic, or does it sit side by side? Most, but not all, use-cases I can think of would prefer semantic search, but what are the considerations here?
As others mentioned you want to do combine both semantic and td-idf like searches. The thing is that searches that carry no semantic weight (e.g. a part number) or that have special meaning compared to the corpus used to train your embedding model (e.g. average Joe thinks about the building when seeing the word "bank", but if you work in a city planning firm you might only consider the bank of a river) fail spectacularly when using only semantic search.
Alternatively you can finetune your embedding model so that it was exposed to these words/meanings. However, the best (from personal experience) is doing both and use some sort of query rewriting on the full-text search to keep only the "keywords".
bm25 is often used along side vector search with a reciprocal rerank algorithm.
As others in this thread have mentioned, semantic/vector-based solutions tend to be much better when any of the following is true:
1. There are natural language questions being asked 2. There's ambiguity in any of the query terms but which there's more clarity if you understand the term in context of nearby terms (e.g. "I studied IT", "I saw the movie It", "What is it?") 3. There are multiple languages in either the query or the documents (many embedding models can embed to a very similar vector across languages) 4. Where you don't want to maintain giant synonym lists (e.g. user might search for or document might contain "movie" or "film" or "motion picture")
Whether you need any of those depends on your use case. If you are just doing e.g. part number search, you probably don't need any of this, and certainly not semantic/vector stuff.
But semantic/vector systems don't work well with terms that weren't trained in. e.g. "what color is part NY739DCP?" Fine tuning is bad at handling this (fine tuning is generally good for changing the format of the response, and generally not all that good for introducing or curtailing knowledge). Whether you need a keyword search system depends on whether your information are more of the "general knowledge" type or something more specific to your business.
Most companies have some need for both because they're building a search on "their" data/business, and you want to combine the results to make sure you're not duplicating. But I'll say I've seen a lot of companies get this sort of combination business wrong. Keeping 2 datastores completely in sync is wrought with failure, the models all have different limitations than the keyword system limitations which is good in some ways but can cause unexpected results in others, and the actual blending logic (be it RRF or whatever) can be difficult to implement "right."
I usually recommend folks look to use a single system for combining together these results as opposed to trying to invent yourself. Full disclosure: I work for Vectara (vectara.com) which has a complete pipeline that does combine both a neural/vector system and traditional keyword search, but I would recommend that folks look to combine these into a single system even if they didn't use our solution because it just takes so much operational complexity off the table.
I've learned information retrieval in university before ever using any full text search engine and I was honestly kinda surprised that most engines are literal textbook implementations. IR was one of those courses where I definitely expected the SotA to be further out than the third slide deck.
It kind of is now if you count deep-learning based solutions to an information need haha
That is true
That’s probably because SotA in IR is proprietary implementations like Google Search, the most bread-and-butter trade secrets of their business. So they’re not publishing that (at least in any holistic manner, and even if they did, it’s not like one product could implement all its functionality including the complexities around the periphery). And what you’re left with is some pretty decent parity between industry and academic SotA, same as a lot of other areas of CS (databases, decentralized systems, etc). If anything, in most cases, implementation lags behind academic theory.
I wish SQLite FTS had a nicer to use API. Last time I looked (a few months ago) it was very confusing and not very user friendly.
I've built a bunch of Python (and CLI) code to try and help make that work better: https://sqlite-utils.datasette.io/en/stable/python-api.html#...
Or with the CLI tool: https://sqlite-utils.datasette.io/en/stable/cli.html#configu...That's cool, maybe I can use it as a reference. Thank you!
BM25 is definitely a big deal when you're doing FTS. It's one of the reasons I've switched to Arango as my DB for the project I'm working on that needs FTS. The fact it also comes with graphs means I don't need to bang my head against CTEs either.
Not saying Arango can replace Postgres but for my needs, it's a much better fit AND it offers the FTS that I need out of the box.
Arango is sweet! We've actually talked to many people who switched from Postgres to Arango because of its better FTS. This was one of the reasons for creating ParadeDB in the first place. Many of the users who made this switch wished they could have stayed in Postgres without compromising on FTS.
There was some work toward this, not sure about its current status: https://github.com/postgrespro/rum
Badly needed.
RUM indices are a nice improvements, but they're not in Postgres core and likely won't ever be :( If you're looking for Elastic-level FTS in Postgres, your best bet is likely to overengineer pgvector+tsvector or simply to use ParadeDB pg_search
I didn't realise it was such a big deal. I wrote a custom BM25 and (faceted? boosted?) search implementation using roaring bitmaps, memory mapped files and pebble to replace sqlite FTS when it couldn't cope anymore with my requirements (100+ GB search index).