SeaweedFS does the thing: I've used it to store billions of medium-sized XML documents, image thumbnails, PDF files, etc. It fills the gap between "databases" (broadly defined; maybe you can do few-tens-KByte docs but stretching things) and "filesystems" (hard/inefficient in reality to push beyond tens/hundreds of millions of objects; yes I know it is possible with tuning, etc, but SeaweedFS is better-suited).
The docs and operational tooling feel a bit janky at first, but they get the job done, and the whole project is surprisingly feature-rich. I've dealt with basic power-outages, hardware-caused data corruption (cheap old SSDs), etc, and it was possible to recover.
In some ways I feel like the surprising thing is that there is such a gap in open source S3 API blob stores. Minio is very simple and great, but is one-file-per-object on disk (great for maybe 90% of use-cases, but not billions of thumbnails). Ceph et al are quite complex. There are a bunch of almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc, or chunking (like MongoDB), but really you just want to concatenate the bytes like a .tar file, and index in with range requests.
The Wayback Machine's WARC files plus CDX (index files with offset/range) is pretty close.
Written in Go no less, a GC language!
I was expecting C/C++ or Rust, pleasantly surprised to see Go.
Why pleasantly surprised compared to Rust? What’s the significance of GCing?
A lot of people regard GCs as something one should not use for low level components like file systems and databases. So that this performs so well might be the surprise for GP.
Which is annoying, as there are various GC systems that are near, or even equal to, performance of comparable non-GC systems. (I personally blame Java for most of this)
Yes and no. While for most application, the GC is hardly an issue and is fast enough, the problem is for application where you need to be able to control exactly when and how memory/objects will be freed. These will never do well with any form of GC. But a looot of software can perform perfectly fine with a GC. If anything, it is mostly Go error handling that is the bigger issue...
Why is Go error handling the bigger issue?
You can often tell a system is written in Go when it locks up with no feedback. Go gives the illusion that concurrency is easy, but it simply makes it easy to write fragile concurrent systems.
A common pattern is that one component crashes because of a bug or a misconfiguration, then the controlling component locks up because it can't control the crashed component, and then all the other components lock up because they can't communicate with the locked up controller.
Anyway that's my experience with several Go systems. Of course it's more a programming issue than a deficiency in Go itself. Though I think the way errors are return values that are easily ignored and are frustrating to deal with encourage this sort of lax behavior.
For not disciplined devs (…) it can easily eat errors. Linters catch some of that and of course you can also do that in exception based languages but in those you have to really explicitly put catch {} which is a code smell while missing an error check in go is easier to just ‘forget’. I actually like the go way but not that it’s easy to forget handling; that’s why I prefer a Haskell/Idris return error (monad) way like Go but making it impossible to pass the result without explicitly testing for errors more.
I was quite surprised to discover that minio is one file per object. Having read some papers about object stores, this is definitely not what I expected.
What are the pros/cons of storing one file per object? As a noob in this domain, this made sense to me.
It will be great if you can share name or reference of some papers around this. Thank you in advance.
The other commenter already outlined the main trade-offs, which boils down to increased latency and storage overhead for the file-per-object model. As for papers, I like the design of Haystack.
https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...
For many small objects a generic filesystem can be less efficient than a more specialised store. Things are being managed that aren't needed for your blob store, block alignment can waste a lot of space, there are often inefficiencies in directories with many files leading to a hierarchical splitting that adds more inefficiency through indirection, etc. The space waste is mitigated somewhat by some filesystems by supporting partial blocks, or including small files directly in the directory entry or other structure (the MFT in NTFS) but this adds an extra complexity.
The significance of these inefficiencies will vary depending on your base filesystem. The advantage of using your own storage format rather than naively using a filesystem is you can design around these issues taking different choices around the trade-offs than a general filesystem might, to produce something that is both more space efficient and more efficient to query and update for typical blob access patterns.
The middle ground is using a database rather than a filesystem is usually a compromise: still less efficient than a specially designed storage structure, but perhaps more so than a filesystem. They tend to have issues (it just inefficiencies) with large objects though, so your blob storage mechanism needs to work around those or just put up with them. A file-per-object store may have a database also anyway, for indexing purposes.
A huge advantage of one file per object is simplicity of implementation. Also for some end users the result (a bunch of files rather than one large object) might better fit into their existing backup strategies¹. For many data and load patterns, the disadvantages listed above may hardly matter so the file-per-object approach can be an appropriate choice.
--
[1] Assuming they are not relying on the distributed nature of the blob store² which is naive³ age doesn't protect you against some thinks a backup does unless the blob store implements features to help out there (minimum distributed duplication guarantee any given peice of data, keeping past versions etc).
[2] Also note that not all blob stores are distributed, and many are but support single node operation.
[3] Perhaps we need a new variant if the "RAID is not a backup" mantra. "Distributed storage properties are not, by themselves, a backup" or some such.
When using HDDs, you want to chunk files at about 1MB-10MB. This helps with read/write scaling/throughput etc.
I imagine very large objects you'd like to be able to shard across multiple servers.
GarageS3 is a nice middle ground, it is not file on disk per object but it's simpler than SeaweedFS as well.
https://garagehq.deuxfleurs.fr/
One will want to be cognizant that Garage, like recent MinIO releases, is AGPL https://git.deuxfleurs.fr/Deuxfleurs/garage/src/tag/v0.9.1/L...
I'm not trying to start trouble, only raising awareness because in some environments such a thing matters
Garage has no intention to support erasure coding though.
Yes, garage sourcecode is very easy to read and understand. Didn’t read seaweed yet.
When you had corruption and failures, what was the general procedure to deal with that? I love SeaweedFS and want to try it (Neocities is a nearly perfect use case), but part of my concern is not having a manual/documentation for the edge cases so I can figure things out on the fringes. I didn't see any documentation around that when I last looked but maybe I missed something.
(If any SeaweedFS devs are seeing this, having a section of the wiki that describes failure situations and how to manage them would be a huge add-on.)
The dev is suprisingly helpful but yeah I agree the wiki is in need of some beefing up w.r.t operations.
why you would base64 encode them, they all store binary formats?