HN comments for: Turbopuffer: Fast search on object storage

eknkc

15 replies

21h3m

2024-07-09 21:29:17 UTC

Is there a good general purpose solution where I can store a large read only database in s3 or something and do lookups directly on it?

Duckdb can open parquet files over http and query them but I found it to trigger a lot of small requests reading bunch of places from the files. I mean a lot.

I mostly need key / value lookups and could potentially store each key in a seperate object in s3 but for a couple hundred million objects.. It would be a lot more managable to have a single file and maybe a cacheable index.

tionis

5 replies

17h30m

2024-07-10 01:02:10 UTC

You could use a sqlite database and use range queries using something like this: https://github.com/psanford/sqlite3vfshttp https://github.com/phiresky/sql.js-httpvfs

Simon Willison wrote about it: https://simonwillison.net/2022/Aug/10/sqlite-http/

eknkc

3 replies

10h38m

2024-07-10 07:54:17 UTC

Yep this thing is the reason I thought about doing it in the first place. Tried duckdb which has built in support for range requests over http.

Whole idea makes sense but I feel like the file format should be specifically tuned for this use case. Otherwise you end up with a lot of range requests because it was designed for disk access. I wondered if anything was actually designed for that.

hobofan

2 replies

7h15m

2024-07-10 11:16:28 UTC

Parquet and other columnar storage formats are essentially already tuned for that.

A lot of requests in themselves shouldn't be that horrible with Cloudfront nowadays, as you both have low latency and with HTTP2 a low-overhead RPC channel.

There are some potential remedies, but each come with significant architetural impact:

- Bigger range queries; For smallish tables, instead of trying to do point-based access for individual rows, instead retrieve bigger chunks at once and scan through them locally -> Less requests, but likely also more wasted bandwidth

- Compute the specific view live with a remote DuckDB -> Has the downside of having to introduce a DuckDB instance that you have to manage between the browser and S3

- Precompute the data you are interested into new parquest files -> Only works if you can anticipate the query patterns enough

I read in the sibling comment that your main issue seems to be re-reading of metadata. DuckDB is AFAIK able to cache the metadata, but won't across instances. I've seen someone have the same issue, and the problem was that they only created short-lived DuckDB in-memory instances (every time the wanted to run a query), so every time the fresh DB had to retrieve the metadata again.

eknkc

1 replies

7h2m

2024-07-10 11:30:19 UTC

Thanks for the insights. Precomputing is not really suitable for this and the thing is, I'm mostly using it as a lookup table on key / value queries. I know Duckdb is mostly suitable for aggregation but the http range query support was too attractive to pass on.

I did some tests, querying "where col = 'x'". If the database was a remote duckdb native db, it would issue a bunch of http range requests and the second exact call would not trigger any new requests. Also, querying for col = foo and then col = foob would yield less and less requests as I assume it has the necesary data on hand.

Doing it on parquet, with a single long running duckdb cli instance, I get the same requests over and over again. The difference though, I'd need to "attach" the duckdb database under a schema name but would query the parquet file using "select from 'http://.../x.parquet'" syntax. Maybe this causes it to be ephemeral for each query. Will see if the attach syntax also works for parquet.

hobofan

0 replies

6h32m

2024-07-10 11:59:52 UTC

I think both should work, but you have to set the object cache pragma IIRC: https://duckdb.org/docs/configuration/pragmas.html#object-ca...

arcanemachiner

0 replies

12h41m

2024-07-10 05:50:25 UTC

That whole thing still blows my mind.

jiggawatts

2 replies

19h39m

2024-07-09 22:53:09 UTC

trigger a lot of small requests reading bunch of places from the files. I mean a lot.

That’s… the whole point. That’s how Parquet files are supposed to be used. They’re an improvement over CSV or JSON because clients can read small subsets of them efficiently!

For comparison, I’ve tried a few other client products that don’t use Parquet files properly and just read the whole file every time, no matter how trivial the query is.

eknkc

1 replies

10h29m

2024-07-10 08:02:51 UTC

This makes sense but the problem I had with duckdb + parquet is it looks like there is no metadata caching so each and every query triggers a lot of requests.

Duckdb can query a remote duckdb database too, in that case it looks like there is caching. Which might be better.

I wonder if anyone actually worked on a specific file format for this use case (relatively high latency random access) to minimize reads to as little blocks as possible.

jiggawatts

0 replies

10h26m

2024-07-10 08:05:55 UTC

Sounds like a bug or missing feature in DuckDB more than an issue with the format

imiric

2 replies

19h33m

2024-07-09 22:58:55 UTC

ClickHouse can also read from S3. I'm not sure how it compares to DuckDB re efficiency, but it worked fine for my simple use case.

masterj

1 replies

18h8m

2024-07-10 00:23:23 UTC

Neither of these support indexes afaik. They are designed to do fast scans / computation.

hodgesrm

0 replies

14h16m

2024-07-10 04:15:44 UTC

It depends on what you mean by "support." ClickHouse as I recall can read min/max indexes from Parquet row groups. One of my colleagues is working on a PR to add support for bloom filter indexes. So that will be covered as well.

Right now one of the main performance problems is that Clickhouse does not cache index metadata yet, so you still have to scan files rather than keeping the metadata in memory. ClickHouse does this for native MergeTree tables. There are a couple of steps to get there but I have no doubt that metadata caching will be properly handled soon.

Disclaimer: I work for Altinity, an enterprise provider for ClickHouse software.

cdchn

1 replies

14h2m

2024-07-10 04:29:22 UTC

Is there a good general purpose solution where I can store a large read only database in s3 or something and do lookups directly on it?

I think this is pretty much what AWS Athena is.

tiew9Vii

0 replies

6h15m

2024-07-10 12:16:23 UTC

Cloud backed SQLLite looks like it might be good for this. Doesn’t support S3 though

https://sqlite.org/cloudsqlite/doc/trunk/www/index.wiki

canadiantim

0 replies

4h20m

2024-07-10 14:11:31 UTC

LanceDB

cmcollier

8 replies

22h13m

2024-07-09 20:18:32 UTC

Unrelated to the core topic, I really enjoy the aesthetic of their website. Another similar one is from Fixie.ai (also, interestingly, one of their customers).

swyx

2 replies

16h21m

2024-07-10 02:10:48 UTC

what does fixie do these days?

sitkack

1 replies

12h15m

2024-07-10 06:16:34 UTC

They pivoted, but will probably pivot back to their original quest.

zkoch

0 replies

3h39m

2024-07-10 14:52:37 UTC

Nah, we're pretty happy with the new trajectory. :)

xarope

0 replies

15h59m

2024-07-10 02:33:05 UTC

Yes, I like the turboxyz123 animation and contrast to the minimalist website (reminds me of the zen garden with a single rock). I think people forget nowadays in their haste to add the latest and greatest react animation, that too much noise is a thing.

nsguy

0 replies

20h14m

2024-07-09 22:17:33 UTC

Yeah! fast, clean, cool, unique.

k2so

0 replies

11h9m

2024-07-10 07:23:06 UTC

This was my first thought too, after reading through their blog. This feels like a no-frills website made by an engineer, who makes things that just work.

The documentation is great, I really appreciate them putting the roadmap front and centre.

itunpredictable

0 replies

20h52m

2024-07-09 21:39:37 UTC

This website rocks

0 replies

2h56m

2024-07-10 15:35:45 UTC

indeed! what a nice, minimal page... that comes with ~1.6mb of javascript.

nh2

5 replies

14h33m

2024-07-10 03:58:34 UTC

$3600.00/TB/month

It doesn't have to be that way.

At Hetzner I pay $200/TB/month for RAM. That's 18x cheaper.

Sometimes you can reach the goal faster with less complexity by removing the part with the 20x markup.

AYBABTME

2 replies

9h8m

2024-07-10 09:23:42 UTC

200$/TB/month for raw RAM, not RAM that's presented to you behind a usable API that's distributed and operated by someone else, freeing you of time.

It's not particularly useful to compare the cost of raw unorganized information medium on a single node, to highly organized information platform. It's like saying "this CPU chip is expensive, just look at the price of this sand".

kirmerzlikin

0 replies

8h26m

2024-07-10 10:05:32 UTC

AFAIU, 3600$ is also a price for "raw RAM" that will be used by your common database via sys calls and not via a "usable API operated by someone else"

hodgesrm

0 replies

3h59m

2024-07-10 14:32:40 UTC

It's not particularly useful to compare the cost of raw unorganized information medium on a single node, to highly organized information platform.

Except that it does prompt you to ask what you could do to use that cheap compute and RAM. In the case of Hetzner that might be large caches that allow you to apply those resources on remote data whilst minimizing transfer and API costs.

formerly_proven

0 replies

8h55m

2024-07-10 09:36:46 UTC

You seem to be quoting the highest figure from the article out of context as-if that is their pricing, but the opposite is the case.

$3600.00/TB/month (incumbents)

$70.00/TB/month (turbopuffer)

That's still 3x cheaper than your number and it's a SaaS API, not just a piece of rented hardware.

TechDebtDevin

0 replies

10h58m

2024-07-10 07:33:31 UTC

I will likely never leave Hetzner.

solatic

4 replies

2h12m

2024-07-10 16:19:33 UTC

Is it feasible to try to build this kind of approach (hot SSD cache nodes sitting in front of object storage) with prior open-source art (Lucene)? Or are the search indexes themselves also proprietary in this solution?

Having witnessed some very large Elasticsearch production deployments, being able to throw everything into S3 would be incredible. The applicability here isn't only for vector search.

rohitnair

2 replies

42m

2024-07-10 17:50:08 UTC

Elasticsearch and OpenSearch already support S3 backed indices. See features like https://opensearch.org/docs/latest/tuning-your-cluster/avail... The files in S3 are plain old Lucene segment files (just wrapped in OpenSearch snapshots which provide a way to track metadata around those files).

francoismassot

1 replies

35m

2024-07-10 17:56:59 UTC

But you don’t have fast search on those files stored on object storage.

rohitnair

0 replies

30m

2024-07-10 18:01:58 UTC

Yes, there is a cold start penalty but once the data is cached, it is equivalent to disk backed indices. There is also active work being done to improve the performance, example https://github.com/opensearch-project/OpenSearch/issues/1380...

francoismassot

0 replies

1h36m

2024-07-10 16:55:43 UTC

If you don't need vector search and have very large Elasticsearch deployment, you can have a look at Quickwit, it's a search engine on object storage, it's OSS and works for append-only datasets (like logs, traces, ...)

Repo: https://github.com/quickwit-oss/quickwit

omneity

4 replies

19h52m

2024-07-09 22:40:02 UTC

In 2022, production-grade vector databases were relying on in-memory storage

This is irking me. pg_vector has existed from before that, doesn't require in-memory storage and can definitely handle vector search for 100m+ documents in a decently performant manner. Did they have a particular requirement somewhere?

jbellis

3 replies

19h49m

2024-07-09 22:42:59 UTC

Have you tried it? pgvector performance falls off a cliff once you can't cache in ram. Vector search isn't like "normal" workloads that follow a nice pareto distribution.

omneity

2 replies

18h53m

2024-07-09 23:38:58 UTC

Tried and deployed in production with similar sized collections.

You only need enough memory to load the index, definitely not the whole collection. A typical index would most likely fit within a few GBs. And even if you need dozens of GBs of RAM it won’t cost nearly as much as $20k/month as the article surmises.

lyu07282

1 replies

5h36m

2024-07-10 12:55:43 UTC

How do you get to "a few GBs"? A hundred million embeddings, if you have 4 byte floats 1024 dimensions would be >400 GB alone.

omneity

0 replies

2h33m

2024-07-10 15:58:22 UTC

I did say the index, not the embeddings themselves. The index is a more compact representation of your embeddings collection, and that's what you need in memory. One approach for indexing is to calculate centroids of your embeddings.

You have multiple parameters to tweak, that affect retrieval performance as well as the memory footprint of your indexes. Here's a rundown on that: https://tembo.io/blog/vector-indexes-in-pgvector

bigbones

4 replies

21h57m

2024-07-09 20:35:05 UTC

Sounds like a source-unavailable version of Quickwit? https://quickwit.io/

pushrax

3 replies

21h43m

2024-07-09 20:48:50 UTC

LSM tree storage engine vs time series storage engine, similar philosophy but different use cases

singhrac

2 replies

15h0m

2024-07-10 03:31:27 UTC

Maybe I misunderstood both products but I think neither Quickwit or Turbopuffer is either of those things intrinsically (though log structured messages are a good fit for Quickfit). I think Quickwit is essentially Lucene/Elasticsearch (i.e. sparse queries or BM25) and Turbopuffer does vector search (or dense queries) like say Faiss/Pinecone/Qdrant/Vectorize, both over object storage.

pushrax

1 replies

5h7m

2024-07-10 13:24:43 UTC

It's true that turbopuffer does vector search, though it also does BM25.

The biggest difference at a low level is that turbopuffer records have unique primary keys, and can be updated, like in a normal database. Old records that were overwritten won't be returned in searches. The LSM tree storage engine is used to achieve this. The LSM tree also enables maintenance of global indexes that can be used for efficient retrieval without any time-based filter.

Quickwit records are immutable. You can't overwrite a record (well, you can, but overwritten records will also be returned in searches). The data files it produces are organized into a time series, and if you don't pass a time-based filter it has to look at every file.

singhrac

0 replies

3h18m

2024-07-10 15:13:24 UTC

Ah I didn’t catch that Quickwit had immutable records. That explains the focus on log usage. Thanks!

zX41ZdbW

1 replies

11h46m

2024-07-10 06:45:54 UTC

A correction to the article. It mentions

    Warehouse BigQuery, Snowflake, Clickhouse ≥1s Minutes

For ClickHouse, it should be: read latency <= 100ms, write latency <= 1s.

Logging, real-time analytics, and RAG are also suitable for ClickHouse.

Sirupsen

0 replies

4h56m

2024-07-10 13:36:15 UTC

Yeah, thinking about this more I now understand Clickhouse to be more of an operational warehouse similar to Materialize, Pinot, Druid, etc. if I understand correctly? So bunching with BigQuery/Snowflake/Trino/Databricks... wasn't the right category (although operational warehouses certainly can have a ton of overlap)

I left that category out for simplicity (plenty of others that didn't make it into the taxonomy, e.g. queues, nosql, time-series, graph, embedded, ..)

softwaredoug

1 replies

22h31m

2024-07-09 20:01:07 UTC

Having worked with Simon he knows his sh*t. We talked a lot about what the ideal search stack would look when we worked together at Shopify on search (him more infra, me more ML+relevance). I discussed how I just want a thing in the cloud to provide my retrieval arms, let me express ranking in a fluent "py-data" first way, and get out of my way

My ideal is that turbopuffer ultimately is like a Polars dataframe where all my ranking is expressed in my search API. I could just lazily express some lexical or embedding similarity, boost with various attributes like, maybe by recency, popularity, etc to get a first pass (again all just with dataframe math). Then compute features for a reranking model I run on my side - dataframe math - and it "just works" - runs all this as some kind of query execution DAG - and stays out of my way.

bkitano19

0 replies

18h28m

2024-07-10 00:04:01 UTC

+1, had the fortune to work with him at a previous startup and meetup in person. Our convo very much broadened my perspective on engineering as a career and a craft, always excited to see what he's working on. Good luck Simon!

yawnxyz

0 replies

14h11m

2024-07-10 04:21:04 UTC

can't wait for the day the get into GA!

yamumsahoe

0 replies

19h2m

2024-07-09 23:29:35 UTC

unsure if they are comparable, but is this and quickwit comparable?

vidar

0 replies

21h46m

2024-07-09 20:45:58 UTC

Can you compare to S3 Athena (ELI5)?

hipadev23

0 replies

17h13m

2024-07-10 01:19:19 UTC

That’s some woefully disappointing and incorrect metrics (read and write latency are both sub-second, storage medium would be “ Memory + Replicated SSDs”) you’ve got for Clickhouse there, but I understand what you’re going for and why you categorized it where you did.

endisneigh

0 replies

16h58m

2024-07-10 01:33:38 UTC

Slightly relevant - do people really want article recommendations? I don’t think I’ve ever read an article and wanted a recommendation. Even with this one - I sort of read it and that’s it; no feeling of wanting recommendations.

Am I alone in this?

In any case this seems like a pretty interesting approach. Reminds me of Warpstream which does something similar with S3 to replace Kafka.

drodgers

0 replies

21h10m

2024-07-09 21:22:19 UTC

I love the object-storage-first approach; it seems like such a natural fit for the could.

cdchn

0 replies

14h0m

2024-07-10 04:32:14 UTC

The very long introductory page has a ton of very juicy data in it, even if you don't care about the product itself.

arnorhs

0 replies

6h1m

2024-07-10 12:31:08 UTC

This looks super interesting. I'm not that familiar with vector databases. I thought they were mostly something used for RAG and other AI-related stuff.

Seems like a topic I need to delive into a bit more.

CyberDildonics

0 replies

21h24m

2024-07-09 21:08:11 UTC

Sounds like a filesystem with attributes in a database.