What I'm really missing in this space is something like this for content addressed blob storage.
I feel like a lot of complexity and performance overhead could be reduced if you only store immutable blobs under their hash (e.g Blake3). Combined with a soft delete this would make all operations idempotent, blobs trivially cacheable, and all state a CRDT/monotonically mergeable/coordination free.
There is stuff like IPFS in the large, but I want this for local deployments as a S3 replacement, when the metadata is stored elsewhere like git or a database.
you might be interested in https://github.com/perkeep/perkeep
Perkeep has (at least until last I checked it) the very interesting property of being completely impossible for me to make heads or tails of while also looking extremely interesting and useful.
So in the hope of triggering someone to give me the missing link (maybe even a hyperlink) for me to understand it, here is a the situation:
I'm a SW dev that also have done a lot of sysadmin work. Yes, I have managed to install it. And that is about it. There seems to be so many features there but I really really don't understand how I am supposed to use the product or the documentation for that matter.
I could start an import of Twitter or something else an it kind of shows up. Same with anything else: photos etc.
It clearly does something but it was impossible to understand what I am supposed to do next, both from the ui and also from the docs.
Beside personal photo store, I use the storage part for file store at work (basically, indexing is off), with a simplifying wrapper for upload/download: github.com/tgulacsi/camproxy
With the adaptive block hashing (varying block sizes), it beats gzip for compression.
I was curious to see if I could help, and I wondered if you saw their mailing list? It seems to have some folks complaining about things they wish it did, which strangely enough is often a good indication of what it currently does
There's also "Show Parkeep"-ish posts like this one <https://groups.google.com/g/perkeep/c/mHoUUcBz2Yw> where the user made their own Pocket implementation complete with original page snapshotting
The thing that most stood out to me was the number of folks who wanted to use Parkeep to manage its own content AND serve as the metadata system of record for external content (think: an existing MP3 library owned by an inflexible media player such as iTunes). So between that and your "import Twitter" comment, it seems one of its current hurdles is that the use case one might have for a system like this needs to be "all in" otherwise it becomes the same problem as a removable USB drive for storing stuff: "oh, damn, is that on my computer or on the external drive?"
I agree 100%
Perkeep is such a cool, interesting concept, but it seems like it's on life-support.
If I'm not mistaken, it used to be funded by creator Brad Fitz, who could afford to hire a full-time developer on his Google salary, but that time has sadly passed.
It suffers from having so many cool use-cases that it struggles to find a balance in presentation.
Or some even older prior art (which I recall a Perkeep dev citing as an influence in a conference talk)
http://doc.cat-v.org/plan_9/4th_edition/papers/venti/
https://en.wikipedia.org/wiki/Venti_(software)
Yeah, there are pleanty of dead and abandoned projects in this space. Maybe the concept is worthless without a tool for metadata management? Also I should probably have specified that by "missing" I mean, "there is nothing well maintained and production grade" ^^'
Yeah I've been following it on and off since it was camli-store. Maybe it tried to do too much at once and didn't focus on just the blob part enough, but I feel like it never really reached a coherent state and story.
You might also be interested in Tahoe-LAFS https://www.tahoe-lafs.org/
I get a
So it looks like it is pretty dead like most projects in this space?
Because the website seems to have a temporary issue, the project must be dead?
Tahoe-LAFS seems alive and continues development, although it seems to not have seen as many updates in 2024 as previous years: https://github.com/tahoe-lafs/tahoe-lafs/graphs/contributors
More like based on the prior that all projects in that space arent' in the best of health. Thanks for the github link, that didn't pop up in my quick google search.
Garage splis the data into chunks for deduplication, so it basically already does content addressed storage under the hood..
They probably don't expose it publicly though.
Yeah, and as far as I understood they use the key hash to address the overall object descriptor. So in theory using the hash of the file instead of the hash of the key should be a simple-ish change.
Tbh I'm not sure if content aware chunking isn't a sirens call:
Sounds a little like Kademlia, the DHT implementation that BitTorrent uses.
It's a distributed hash table where the value mapped to a hash is immutable after it is STOREd (at least in the implementations that I know)
Kademlia could certainly be a part of a solution to this, but it's a long road from the algorithm to the binary that you can start on a bunch of machines to get the service, e.g. something like SeaweedFS. BitTorrent might actually be the closest thing we have to this, but it is at the opposite spectrum of the latency -distributed axis.
Check also SeaweedFS, it has some interesting tradeoffs made, but I hear you with wanting some of the properties you're looking for.
I am using seaweed for a project right now. Some things to consider with seaweed.
- It works pretty well, at least up to the 15B objects I am using it for. Running on 2 machines with about 300TB, (500 raw) storage on each.
- The documentation, specifically with regards to operations like how to backup things, or different failure modes of the components can be sparse.
- One example of the above is I spun up a second filer instance (which is supposed to sync automatically) which caused the master server to emit an error while it was syncing. The only way to know if it was working was watching the new filers storage slowly grow.
- Seaweed has a pretty high bus factor, though the dev is pretty responsive and seems to accept PRs at a steady rate.
Take a look at https://github.com/n0-computer/iroh
Open source project written in Rust that uses BLAKE3 (and QUIC, which you mentioned in another comment)
It certainly has a lot of overlap and is a very interesting project, but like most projects in this space, I feel like it's already doing too much. I think that might be because many of these systems also try to be user facing?
E.g. it tries to solve the "mutability problem" (having human readable identifiers point to changing blobs); there are blobs and collections and documents; there is a whole resolver system with their ticket stuff
All of these things are interesting problems, that I'd definitely like to see solved some day, but I'd be more than happy with an "S3 for blobs" :D.
The RADOS K/V store is pretty close. Ceph is built on top of it but you can also use it as a standalone database.
Nothing content-addressed in RADOS. It's just a key-value store with more powerful operations that get/put, and more in the strong consensus camp than the parents' request for coordination free things.
(Disclaimer: ex-Ceph employee.)
Something related that I've been thinking about is that there aren't many popular data storage systems out there that use HTTP/3 and/or gRPC for the lower latency. I don't just mean object storage, but database servers too.
Recently I benchmarked the latency to some popular RPC, cache, and DB platforms and was shocked at how high the latency was. Every still talks about 1 ms as the latency floor, when it should be the ceiling.
Yeah QUIC would probably be a good protocol for such a system. Roundtrips are also expensive, ideally your client library would probably cache as much data as the local disk can hold.
Have you seen https://github.com/willbryant/verm?
Yeah, the subdirectories and mime-type seemed like an unnecessary complication. Also looks pretty dead.
IPFS like "coordination free" local S3 replacement! Yes. That is badly needed.
That's how we use S3 in Peergos (built on IPFS). You can get S3 to verify the sha256 of a block on write and reject the write if it doesn't match. This means many mutually untrusting users can all write to the same bucket at the same time with no possibility for conflict. We talk about this more here:
https://peergos.org/posts/direct-s3
I would settle for first-class support for object hashes. Let an object have metadata, available in the inventory, that gives zero or more hashes of the data. SHA256, some Blake family hash, and at least one decent tree hash should be supported. There should be a way to ask the store to add a hash to an existing object, and it should work on multipart objects.
IOW I would settle for content verification even without content addressing.
S3 has an extremely half-hearted implementation of this for “integrity”.