SeaweedFS fast distributed storage system for blobs, objects, files and datalake

SeaweedFS does the thing: I've used it to store billions of medium-sized XML documents, image thumbnails, PDF files, etc. It fills the gap between "databases" (broadly defined; maybe you can do few-tens-KByte docs but stretching things) and "filesystems" (hard/inefficient in reality to push beyond tens/hundreds of millions of objects; yes I know it is possible with tuning, etc, but SeaweedFS is better-suited).

The docs and operational tooling feel a bit janky at first, but they get the job done, and the whole project is surprisingly feature-rich. I've dealt with basic power-outages, hardware-caused data corruption (cheap old SSDs), etc, and it was possible to recover.

In some ways I feel like the surprising thing is that there is such a gap in open source S3 API blob stores. Minio is very simple and great, but is one-file-per-object on disk (great for maybe 90% of use-cases, but not billions of thumbnails). Ceph et al are quite complex. There are a bunch of almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc, or chunking (like MongoDB), but really you just want to concatenate the bytes like a .tar file, and index in with range requests.

The Wayback Machine's WARC files plus CDX (index files with offset/range) is pretty close.

Written in Go no less, a GC language!

I was expecting C/C++ or Rust, pleasantly surprised to see Go.

Why pleasantly surprised compared to Rust? What’s the significance of GCing?

A lot of people regard GCs as something one should not use for low level components like file systems and databases. So that this performs so well might be the surprise for GP.

Which is annoying, as there are various GC systems that are near, or even equal to, performance of comparable non-GC systems. (I personally blame Java for most of this)

Yes and no. While for most application, the GC is hardly an issue and is fast enough, the problem is for application where you need to be able to control exactly when and how memory/objects will be freed. These will never do well with any form of GC. But a looot of software can perform perfectly fine with a GC. If anything, it is mostly Go error handling that is the bigger issue...

Why is Go error handling the bigger issue?

You can often tell a system is written in Go when it locks up with no feedback. Go gives the illusion that concurrency is easy, but it simply makes it easy to write fragile concurrent systems.

A common pattern is that one component crashes because of a bug or a misconfiguration, then the controlling component locks up because it can't control the crashed component, and then all the other components lock up because they can't communicate with the locked up controller.

Anyway that's my experience with several Go systems. Of course it's more a programming issue than a deficiency in Go itself. Though I think the way errors are return values that are easily ignored and are frustrating to deal with encourage this sort of lax behavior.

For not disciplined devs (…) it can easily eat errors. Linters catch some of that and of course you can also do that in exception based languages but in those you have to really explicitly put catch {} which is a code smell while missing an error check in go is easier to just ‘forget’. I actually like the go way but not that it’s easy to forget handling; that’s why I prefer a Haskell/Idris return error (monad) way like Go but making it impossible to pass the result without explicitly testing for errors more.

I was quite surprised to discover that minio is one file per object. Having read some papers about object stores, this is definitely not what I expected.

What are the pros/cons of storing one file per object? As a noob in this domain, this made sense to me.

It will be great if you can share name or reference of some papers around this. Thank you in advance.

The other commenter already outlined the main trade-offs, which boils down to increased latency and storage overhead for the file-per-object model. As for papers, I like the design of Haystack.

https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...

For many small objects a generic filesystem can be less efficient than a more specialised store. Things are being managed that aren't needed for your blob store, block alignment can waste a lot of space, there are often inefficiencies in directories with many files leading to a hierarchical splitting that adds more inefficiency through indirection, etc. The space waste is mitigated somewhat by some filesystems by supporting partial blocks, or including small files directly in the directory entry or other structure (the MFT in NTFS) but this adds an extra complexity.

The significance of these inefficiencies will vary depending on your base filesystem. The advantage of using your own storage format rather than naively using a filesystem is you can design around these issues taking different choices around the trade-offs than a general filesystem might, to produce something that is both more space efficient and more efficient to query and update for typical blob access patterns.

The middle ground is using a database rather than a filesystem is usually a compromise: still less efficient than a specially designed storage structure, but perhaps more so than a filesystem. They tend to have issues (it just inefficiencies) with large objects though, so your blob storage mechanism needs to work around those or just put up with them. A file-per-object store may have a database also anyway, for indexing purposes.

A huge advantage of one file per object is simplicity of implementation. Also for some end users the result (a bunch of files rather than one large object) might better fit into their existing backup strategies¹. For many data and load patterns, the disadvantages listed above may hardly matter so the file-per-object approach can be an appropriate choice.

[1] Assuming they are not relying on the distributed nature of the blob store² which is naive³ age doesn't protect you against some thinks a backup does unless the blob store implements features to help out there (minimum distributed duplication guarantee any given peice of data, keeping past versions etc).

[2] Also note that not all blob stores are distributed, and many are but support single node operation.

[3] Perhaps we need a new variant if the "RAID is not a backup" mantra. "Distributed storage properties are not, by themselves, a backup" or some such.

When using HDDs, you want to chunk files at about 1MB-10MB. This helps with read/write scaling/throughput etc.

I imagine very large objects you'd like to be able to shard across multiple servers.

GarageS3 is a nice middle ground, it is not file on disk per object but it's simpler than SeaweedFS as well.

https://garagehq.deuxfleurs.fr/

One will want to be cognizant that Garage, like recent MinIO releases, is AGPL https://git.deuxfleurs.fr/Deuxfleurs/garage/src/tag/v0.9.1/L...

I'm not trying to start trouble, only raising awareness because in some environments such a thing matters

Garage has no intention to support erasure coding though.

Yes, garage sourcecode is very easy to read and understand. Didn’t read seaweed yet.

When you had corruption and failures, what was the general procedure to deal with that? I love SeaweedFS and want to try it (Neocities is a nearly perfect use case), but part of my concern is not having a manual/documentation for the edge cases so I can figure things out on the fringes. I didn't see any documentation around that when I last looked but maybe I missed something.

(If any SeaweedFS devs are seeing this, having a section of the wiki that describes failure situations and how to manage them would be a huge add-on.)

The dev is suprisingly helpful but yeah I agree the wiki is in need of some beefing up w.r.t operations.

almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc

why you would base64 encode them, they all store binary formats?

Thanks for sharing! I work on SeaweedFS.

SeaweedFS is built on top of a blob storage based on Facebook's Haystack paper. The features are not fully developed yet, but what makes it different is a new way of programming for the cloud era.

When needing some storage, just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.

There will be more features built on top of it. File system and Object store are just a couple of them. Need more help on this.

what makes it different is a new way of programming for the cloud era.

just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.

How is that not mmap?

Also what is the difference between a file, an object, a blob, a filesystem and an object store? Is all this just files indexed with sql?

How is that not mmap?

The allocated storage is append only. For updates, just allocate another blob. The deleted blobs would be garbage collected later. So it is not really mmap.

Also what is the difference between a file, an object, a blob, a filesystem and an object store?

The answer would be too long to fit here. Maybe chatgpt can help. :)

Is all this just files indexed with sql?

Sort of yes.

You made the claim:

what makes it different is a new way of programming for the cloud era.

but you aren't even explaining how anything is different from what a normal file system can do, let alone what makes it a "new way of programming for the cloud era".

Sorry it was not so clear. Previously fallocate just allocate disk space for a local server. Now SeaweeedFS can allocate a blob on a remote storage.

What is the difference between a blob and a file and what is the difference between allocating a blob on remote storage or a file on remote storage?

A large file can be chunked into blobs.

I really don't understand why you aren't eager to explain the differences and what problems are being solved.

Sorry, everybody has different background of knowledge. Hard to understand where the question comes from. I think https://www.usenix.org/system/files/fast21-pan.pdf may be helpful here.

Sorry, everybody has different background of knowledge. Hard to understand where the question comes from.

They were straightforward questions. The paper you linked talks about blobs as a term for appending to files. Mostly it seems to be about wrapping and replicating XFS.

Is that why you are avoiding talking about specifics? Are you wrapping XFS?

Why does a user need that? Filesystems already break up files into blocks / sectors. Why wouldn't a user just deal with files and let the filesystem handle it?

A blob has its own storage, which can be replicated to other hosts in case current host is not available. It can scale up independently of the file metadata.

Why does a user need that? Filesystems already break up files into blocks / sectors. Why wouldn't a user just deal with files and let the filesystem handle it?

I really don't understand why you aren't eager to explain the differences and what problems are being solved.

I, too, am interested in your views on the last 2 questions, since your views, not chatGPT's, are what informed the design. Part of learning from others' designs [0] is understanding what the designers think about their own design, and how they came about it.

Would you mind elaborating on them? HN gives a lot of space, and I'm confident you can find a way to summarize without running out, or sounding dismissive (which is what the response kind of sounds like now).

0 – https://aosabook.org/en/

The blob storage is what SeaweedFS built on. All blob access has O(1) network and disk operation.

Files and S3 are higher layers above the blob storage. They require metadata to manage to the blobs, and other metadata for directories, S3 access, etc.

These metadata usually sit together with the disks containing the files. But in highly scalable systems, the metadata has dedicated stores, e.g., Google's Colossus, Facebook's Techtonics, etc. SeaweedFS file system layer is built as a web application of managing the metadata of blobs.

Actually SeaweedFS file system implementation is just one way to manage the metadata. There are other possible variations, depending on requirements.

There are a couple of slides on the SeaweedFS github README page. You may get more details there.

Thank you, that was very informative. I appreciate your succinct, information dense writing style, and appreciate it in the documentation, too, after reviewing that.

First, the feature set you have built is very impressive.

I think SeaweedFS would really benefit from more documentation on what exactly it does.

People who want to deploy production systems need that, and it would also help potential contributors.

Some examples:

* It says "optimised for small files", but it is not super clear from the whitepaper and other documentation what that means. It mostly talks about about how small the per-file overhad is, but that's not enough. For example, on Ceph I can also store 500M files without problem, but then later discover that some operations that happen only infrequently, such as recovery or scrubs, are O(files) and thus have O(files) many seeks, which can mean 2 months of seeks for a recovery of 500M files to finish. ("Recovery" here means when a replica fails and the data is copied to another replica.)

* More on small files: Assuming small files are packed somehow to solve the seek problem, what happens if I delete some files in the middle of the pack? Do I get fragmentation (space wasted by holes)? If yes, is there a defragmentation routine?

* One page https://github.com/seaweedfs/seaweedfs/wiki/Replication#writ... says "volumes are append only", which suggests that there will be fragmentation. But here I need to piece together info from different unrelated pages in order to answer a core question about how SeaweedFS works.

* https://github.com/seaweedfs/seaweedfs/wiki/FAQ#why-files-ar... suggests that "vacuum" is the defragmentation process. It says it triggers automatically when deleted-space overhead reaches 30%. But what performance implications does a vacuum have, can it take long and block some data access? This would be the immediate next question any operator would have.

* Scrubs and integrity: It is common for redundant-storage systems (md-RAID, ZFS, Ceph) to detect and recover from bitrot via checksums and cross-replica comparisons. This requires automatic regular inspections of the stored data ("scrubs"). For SeaweedFS, I can find no docs about it, only some Github issues (https://github.com/seaweedfs/seaweedfs/issues?q=scrub) that suggest that there is some script that runs every 17 minutes. But looking at that script, I can't find which command is doing the "repair" action. Note that just having checksums is not enough for preventing bitrot: It helps detect it, but does not guarantee that the target number of replicas is brought back up (as it may take years until you read some data again). For that, regular scrubs are needed.

* Filers: For a production store of a highly-available POSIX FUSE mount I need to choose a suitable Filer backend. There's a useful page about these on https://github.com/seaweedfs/seaweedfs/wiki/Filer-Stores. But they are many, and information is limited to ~8 words per backend. To know how a backend will perform, I need to know both the backend well, and also how SeaweedFS will use it. I will also be subject to the workflows of that backend, e.g. running and upgrading a large HA Postgres is unfortunately not easy. As another example, Postgres itself also does not scale beyond a single machine, unless one uses something like Citus, and I have no info on whether SeaweedFS will work with that.

* The word "Upgrades" seems generally un-mentioned in Wiki and README. How are forward and backward compatibility handled? Can I just switch SeaweedFS versions forward and backward and expect everything will automatically work? For Ceph there are usually detailed instructions on how one should upgrade a large cluster and its clients.

In general the way this should be approached is: Pretend to know nothing about SeaweedFS, and imagine what a user that wants to use it in production wants to know, and what their followup questions would be.

Some parts of that are partially answered in the presentations, but it is difficult to piece together how a software currently works from presentations of different ages (maybe they are already outdated?) and the presentations are also quite light on infos (usually only 1 slide per topic). I think the Github Wiki is a good way to do it, but it too, is too light on information and I'm not sure it has everything that's in the presentations.

I understand the README already says "more tools and documentation", I just want to highlight how important the "what does it do and how does it behave" part of documentation is for software like this.

We tested both SeaweedFS and Min.io for cheaply (HDD) storing > 100TB of audio data.

Seaweed had much better performance for our use case.

Forgive my ignorance but why is this preferable to a big ZFS pool?

I could be wrong here, but I believe this (ceph, et al) is the answer to the question: > """But what if I don't have a JBOD of 6x18TB hard drives with good amount of ECC RAM for ZFS? What if I have 3 raspberry pi 4's, at different houses with 3x 12TB externals on them, and 2 other computers with 2x 4TB externals on them, and I want to use that all together with some redundancy/error checking?" That would give (3x3x12)+(2x2x4)=124 TB of storage, vs 108TB in the ZFS single box case (of raw storage).

If you could figure out the distributed part (and inconsistency in disk size and such), then this is a very nice system to have.

"What if I have 3 raspberry pi 4's [...] with no significant performance requirement?"

Ceph is nice, but performance is lackluster on anything but a proper cluster (pun intended).

It's also somewhat heavyweight. I ran ZFS over iSCSI with four RPis serving the iSCSI targets via SATA-USB. It was network limited mostly. The advantage of that is that you can take the same disks and plug them all into a single host and import the pool directly (ie not via iSCSI), if needed.

I would love to learn more about that setup. Do you have it documented anywhere?

Unfortunately I have not.

The core concept is rather simple though: iSCSI allows for an iSCSI server to expose raw block devices (called targets[1]), so do that and use them from a client machine as-if they were locally connected disks.

So I used LIO[2] as the iSCSI server on each of the Pi's, making sure to use the "by-id" to reference the block device so it would work fine across reboots (skipped changing "directories"):

    sudo targetcli
    /backstores/block> create name=block0 dev=/dev/disk/by-id/usb-SAMSUNG_MZ7PD256HAFV-000_0123456789000000005-0:0
    /iscsi> create
    /iscsi> cd iqn.2003-01.org.linux-iscsi.rpi4.armv7l:sn.fc1b1c9879a1/
    /iscsi/iqn.20....fc1b1c9879a1> cd tpg1/luns
    /iscsi/iqn.20...9a1/tpg1/luns> create /backstores/iblock/block0

Note you might have to use /block/ instead of /iblock/ in the last command there[3], depending on targetcli version.

Then you create the portal which exposes the target (backed by the raw disk) to the network as normal (see Debian guide fex). I did it like this, you have to adjust for the name of your device:

    /iscsi/iqn.20...1c9879a1/tpg1> cd portals
    /iscsi/iqn.20.../tpg1/portals> delete 0.0.0.0 3260
    /iscsi/iqn.20.../tpg1/portals> create 10.1.1.101
    /iscsi/iqn.20.../tpg1/portals> cd ..
    /iscsi/iqn.20...1c9879a1/tpg1> set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1

This creates a portal which exposes a single LUN, backed by the raw disk, on the IP 10.1.1.101, and disables authentication and demo mode (write protection).

Next you install the iSCSI client (called initiator)[4], I did this on a separate machine. There I connected to each of the target on each Pi. Once successful you should get a number of disks under /dev/disk/by-path/ip-*, which you can then specify when importing the pool on the iSCSI client machine, say

    zpool -d /dev/disk/by-path/ip-X -d /dev/disk/by-path/ip-Y

Since you're exposing the raw block device, there's no difference in that regard to having the disk plugged into the client machine directly.

Btw, I recommend not messing with authentication at first, though it's not terribly difficult to set up.

edit: As mentioned the nice thing about this approach is that it's effectively "harmless". If it doesn't work out for you, you can always put all the disks in a single machine and import the pool as normal.

Also, iSCSI has a lot of robustness built in. For example the iSCSI client will temporarily store writes and re-issue once the target (server) is back online. I safely rebooted one of the Pis while copying data to the pool, for example.

[1]: https://en.wikipedia.org/wiki/ISCSI#Concepts

[2]: https://wiki.debian.org/SAN/iSCSI/LIO

[3]: https://github.com/ClusterLabs/resource-agents/pull/1373

[4]: https://wiki.debian.org/SAN/iSCSI/open-iscsi

Not the only reason, but we have a distributed workload so HTTP is a better protocol than NFS for our use case.

Internally ZFS is more or less an object store. I seem to recall there was a project to expose this to users directly, similar to how ZPL (filesystem) or ZVOLs are just two different ways to expose the internal object store.

Seems like it could be an interesting approach.

its distributed: will survive if your server dies..

Because you need an S3 compatible API?

I use ZFS for most of my things, but I have yet to find a good way of just sharing a ZFS dataset over S3.

Any hiccups?

Drop in S3 compatibility with much better performance would be insane

Setup is a little obscure, but the developer is responsive on Slack and GH.

We are only a couple months in and haven’t had to add to our cluster yet, storing about 250TB, so it’s still early for us. Promising so far and hardware has already paid for itself.

Do you wish it supported Erasure Coding for lower disk usage, or is your workload such that the extra spindles from replication are useful?

That would be nice and that’s why we first tried MinIO.

But with MinIO and erasure coding a single PUT results in more IOPS and we saw lower performance.

Also, expanding MinIO must be done in increments of your original buildout which is annoying. So if you start with 4 servers and 500TB, they recommend you expand by adding another 4 servers with 500TB at least.

why not ceph?

I was asking around in my network after experience with self hosting S3 like solutions. One serious user of SeaweedFS recommended looking into min.io instead. Another serious user of min.io recommend looking into SeaweedFS instead…

If your looking for more recommendations, try Garage ( https://garagehq.deuxfleurs.fr/ ), which is on my short list to try in my home lab...

Longhorn is another that I see quite a lot, next to Ceph/Rook and lately SeaweedFS.

https://github.com/longhorn/longhorn

Longhorn claims to be a block storage solution so more like EBS than S3.

It used to be if you wanted thousands of tiny files give seaweed a go, minio would suck. But minio has since had a revision so you'd have to test it out.

Seaweed has been running my k8s persistent volumes pretty admirably for like a year for about 4 devs.

Lets say instead of 1000s I need to store billions.

So far I have been testing with seaweed and it seems to chug along fine at around ~4B files and it is still increasing.

Has minio improved on that lately ?

I set up seaweed on my home lab a few weeks ago and while it was a bit difficult to initially get everything configured it seems to work really well once I got it running. I’m using it with the CSI driver and have it incrementally backing up to s3 (I originally had it mirrored then realized for my use case I didn’t need to do that).

My one complaint is that I could not really get it to work with an S3 compatible api that wasn’t officially supported in the list of S3 backends, even though that should have been theoretically possible. I ended up picking a supported backend instead.

A serious user of both suggested to use iroh instead

If you're talking about this https://github.com/n0-computer/iroh ... Iroh is a p2p file syncing protocol. That's not even close to the same wheelhouse as SeaweedFS. What was their rationale for recommending it?

Take a look at GarageS3, it's a niceoption for "just an S3 server" for self hosting.

https://garagehq.deuxfleurs.fr/

I use it for self hosting.

Sounds like you should try both and write an article!

Tried and rejected SeaweedFS due to Postgres failing to even initialize itself on a POSIX FS volume mounted over SeaweedFS' CSI driver. And that's too bad, because SeaweedFS was otherwise working well!

What we need and haven't identified yet is an SDS system that provides both fully-compliant POSIX FS and S3 volumes, is FOSS, a production story where individuals can do all tasks competently/quickly/effectively (management, monitoring, disaster recovery incl. erasure coding and tooling), and CSI drivers that work with Nomad.

This rules out Ceph and friends. GarageFS, also mentioned in this thread, is S3 only. We went through everything applicable on the K8S drivers list https://kubernetes-csi.github.io/docs/drivers.html except for Minio, because it claimed it needed a backing store anyways (like Ceph) although just a few days ago I encountered Minio being used standalone.

While I'm on this topic, I noticed that the CSI drivers (SeaweedFS and SDS's in general) use tremendous resources when effecting mounts, instantiating nearly a whole OS (OCI image w/ functional userland) just to mount over what appears to be NFS or something.

running something like postgres over a networked filesystem sounds very wrong

There was some work done to add a S3 storage backend for ZFS[1], precisely with the goal of running PosgreSQL on effectively external storage.

A key point was to effectively treat S3 as a huge, reliable disk with 10MB "sectors". So the bucket would contain tons of 10MB chunks and ZFS would let S3 handle the redundancy. For performance it was coupled with a large, local SSD-based write-back cache.

Sadly it seems the company behind this figured it needed to keep this closed-source in order to get ROI[2].

[1]: https://youtu.be/opW9KhjOQ3Q

[2]: https://github.com/openzfs/zfs/issues/12119

Neon does this for PostgreSQL and it's open source (more like code-dump though)

But it also sounds like a dream if it could actually work. If you have enough local, performant disk that you are sharing with the cluster you should be able to get good performance and rely on the system to provide resilience and extra space.

In practice you can't get high-availability this way without additional logic and circuit breakers. Running multiple postgres with postgres-aware replication and failover is safer, faster, and more performant (though harder to set up).

You do know that you cannont implement a fully-compliant POSIX FS with only the S3 API? None of the scalalbe SDS' support random writes. Atomic rename (for building transactional systems like lakehouse table formats) is not there. Listing of files is often eventually consistent. The closest functional API to a posix-compliant one in scalable SDS' is the HDFS API. Only ADLS supports that. But then again, they are the only one who enable you to fuse mount a directory for local FS read/write access. All of the S3 fuse mount stuff is fundamentally limited by the S3 API.

This is where we learned that! Ceph does it, because separate components are responsible for each of underlying storage, S3 API, and FS API. We tripped on the Seaweed FS and the Garage FS indicia, where "FS" in these contexts typically means File System. But, neither SeaweedFS nor GarageFS is a File System at all; with grace and lenience they could be mildly regarded as Filing Systems, but the reality is that they are actually object stores. SeaweedOS? SeaweedS3?

What about JuiceFS?

I've never used it myself and just learned about it from this thread but it seems to fit the bill.

This sounds like what Microsoft has tried but failed to do in numerous iterations for two decades: OFS (Cairo, unreleased predecessor to Windows 95), Storage+ (SQL Server 7.0), RFS (SQL Server 2000), Exchange Webstore, Outlook LIS, WinFS, and finally Microsoft Semantic Engine.

All projects were either cancelled, features cut, or officially left in limbo.

It's a pretty remarkable piece of Microsoft history as it has been there on the sidelines since roughly post-Windows 3.11. The reason they returned to it so often was in part because Bill Gates loved the idea of a more high level object storage that, like this, bridges the gap between files and databases.

He would probably have loved this kind of technology part of Windows -- and indeed in 2013, he cited the failure of WinFS as his greatest disappointment at Microsoft, that it was ahead its time and that it would re-emerge.

> and indeed in 2013, he cited the failure of WinFS as his greatest disappointment at Microsoft,

Failing to capture any of the mobile handset market while missing out almost entirely on search and social media businesses would be higher on my list if I were in BG's shoes.

Microsoft has never been good in either consumer electronics or advertising (social media).

MS carved out economic rent on business with Windows and Office, Apple actually failed at that.

"Gaming is now the third largest business at Microsoft." After Office and Azure, before Windows.

https://www.theverge.com/2024/1/30/24055445/microsoft-q2-202...

Sure. But windows was/is the loss leader for that. No gaming or office without windows. Even Azure benefits from it as it ties into Active Directory with Azure AD. This makes it a completely integrated story. So that they are even still making money off of windows directly is just a direct benefit.

It is amazing how sticky auth is.

Intellectually though, I could see the WinFS failure as being more disappointing. If it had worked, local computing would have been completely different. Much like BeFS, WinFS (as marketed) would have introduced many new ways to interact with your computer and data.

Having a bigger presence in mobile and social would have been more lucrative, but from a CS geek point of view, the failure of WinFS might have been more stinging.

Did those happen under Gates?

Speaking of mobile handset markets, does SeaweedFS support Android?

Things to make sure of when choosing your distributed storage:

1) are you _really_ sure you need it distributed, or can you shard it your self? (hint, distributed anything sucks at least one if not two innovation tokens, if you're using other innovation tokens as well. you're going to have a very bad time)

2) do you need to modify blobs, or can you get away with read/modify/replace? (s3 doesn't support partial writes, one bit change requires the whole file to be re-written)

3) whats your ratio of reads to writes (do you need local caches or local pools in gpfs parlance)

4) How much are you going to change the metadata (if theres posix somewhere, it'll be a lot)

5) Are you going to try and write to the same object at the same time in two different locations (how do you manage locking and concurrency?)

6) do you care about availability, consistency or speed? (pick one, maybe one and a half)

7) how are you going to recover from the distributed storage shitting it's self all at the same time

8) how are you going to control access?

Sounds like you are talking from experience. Are you storage specialist, how did you learn so much about this?

VFX engineer, I have suffered through:

_early_ lustre (its much better now)

GPFS

Gluster (fuck that)

clustered XFS (double fuck that)

Isilon

Nowadays, a single 2u server can realistically support 2x 100gig nics at full bore. So the biggest barrier is density. You can probably get 1pb in a rack now, and linking a bunch of jbods(well NVMEs) is probably easily to do now.

"1PB in a rack"? You can apparently already buy 2.5PB in a single 4U server:

https://www.techradar.com/pro/seagate-has-launched-a-massive...

sorry I should have added a caveat of 1pb _at decent performance_

That seagate array will be fine for streaming (so long as you spread the data properly) as soon as you start mixing read/write loads on that, it'll start to chug. You can expect 70-150iops out of each drive, and thats a 60 drive array (from guess, you can get 72 drives in a 4u, but they are less maintainable, well used to be, thing might have improved recently)

When I was using luster with ultra scisi (yes, that long ago) we had good 10-20 racks to get to 100tb, that could sustain 1gigabyte a second.

Agreed, it depends on the use case. For some "more storage" is all that matters, for others you don't want to be bottlenecked on getting it into / out of the machine or through processing.

1) only if it removes a "janitor" token of nannying the servers. Right now I just have one big server with a big 160TB ZFS pool, but it's running out.

2) No modifications, just new files and the occasional deletion request.

3) Almost just 1 write and 1 read per file, this is a backing storage for the source files, and they are cached in front.

4) Never

5) Files are written only by one other server, and there will be no parallel writes.

6) I pick consistency and as the half, availability.

7) This happened something like 15 years ago with MogileFS and thus scared us away. (Hence the single-server ZFS setup).

8) Reads are public, writes restricted to one other service that may write.

GPFS is pretty sexy nowadays, although its really expensive: https://www.ibm.com/products/storage-scale

For companies hosting their entire infra on AWS, what's the advantage of SeaweedFS running on a fleet of EC2 machines over storing on S3?

Hard to imagine anything.

Nothing. AWS doesn't give you the option to rent HDDs to create your own S3 so you're locked in to use S3.

Sounds great!

Now I only need to wait 10 years until all the hidden but crucial bugs are found (at the massive loss of real data, ofc) before I'm ready to use it,

like with every new piece of technology...

Or what should give me the confidence that it isn't so?

This is an old project, I had a quick look and see that I submitted a pull-request back in 2015:

https://github.com/seaweedfs/seaweedfs/pull/187

First commit in the Google code repo seems to be 2011-11-30

Have used SeaweedFS to store billions of thumbnails. The tooling is a bit clunky, but it mostly works. The performance is very good for small-ish objects (memory usage + latency), and latency remains consistently good into 99.9 percentiles. We had some issues with data loss and downtime, but that was mostly our own fault.

What issues did you run into? Not settling replication?

This was a couple of years ago now, but for example, some very minor amount of objects had not been replicated at all. This happened during heavy concurrent write traffic, and a couple of these race condition-ish bugs have been fixed over the years.

The comments already mention several alternatives (Minio, Ceph, GarageFS). I think another one, not mentioned yet, is JuiceFS [1]. Found one comparison here [2].

[1] https://juicefs.com/en/

[2] https://dzone.com/articles/seaweedfs-vs-juicefs-in-design-an...

I tried JuiceFS with AWS S3 and an (admittedly slow) self-hosted postgres instance, and it didn't work at all. I would have understood if it had been really slow, but erroring it really seems wrong for software where correctness is paramount.

JuiceFS isn't standalone, it requires separate backing storage for each of data [0] and metadata. So for example, JuiceFS would target SeaweedFS or GarageFS as its data store. JuiceFS can also target the local file system, but .. SDS use cases typically care about things like redundancy and availability of the data itself, things that JuiceFS happily delegates. JuiceFS itself can be distributed, but that's merely the control place as I understand it.

[0] https://juicefs.com/docs/community/reference/how_to_set_up_o...

Advantages over Ceph?

Ceph should have 10x+ metadata overhead for chunk storage. When using erasure-coding writes are faster because it's using replication and then erasure-coding is done async for whole volumes (30GB).

I am aware of some research into operating systems with a database rather than filesystem as a base layer. If SeaweedFS serves a middle-ground between databases and filesystems, could it also suggest a middle-ground in conceiving of research operating systems??

SeaweedFS is a non hierarchical distributed key value store. It makes different tradeoffs to a filesystem which provides a hierarchical view of local only data. There’s some evidence to suggest that a hierarchical structuring of the data itself isn’t actually beneficial for modern systems. And you could design a system that used similar techniques to SeaweedFS to do a semi-distributed local store (ie locally stored data for fast access with offloading to cheap remote storage for durability / infinite extensibility). And the plain KV store will likely be faster for most operations although in practice you’ll probably only see it in micro benchmarks.

Does anyone know how well the Seaweed Filer works for renaming files or missing files? My use case’s involves writing a lot of data to temporary files that are then renamed to their final name. This is always the Achilles heel for distributed file storage, where files are put into buckets based on the file path… when you rename the path, but keep the data, lookups become more complicated.

(This is HPC work with large processing pipelines. I keep track of if the job was successful based upon if the final file exists. The rename only happens if the job was successful. It’s a great way to track pipeline status, but metadata lookups can be a pain — particularly for missing files. )

Should not be a problem.

One similar use case used Cassandra as SeaweedFS filer store, and created thousands of files per second in a temp folder, and moved the files to a final folder. It caused a lot of tombstones for the updates in Cassandra.

Later, they changed to use Redis for the temp folder, and keep Cassandra for other folders. Everything has been very smooth since then.

We (https://hivegames.io/) use this for storing 50+ TB of multiplayer match recordings ("replays"), heavily using the built-in expiry feature. It's incredibly easy to use and to built on top off; never had an issue updating, migrating or utilizing new features.

What do you use for the metadata store?

What’s the different between files, objects, blobs and data lake?

Each one is a pay increase for the administrator and vendor.

I there any reason to use something like this instead of S3 or similar products when you are not running your own infra?

If you’re running it on AWS? Probably not.

Otherwise: egress costs.

I'm a small user— only about 250,000 objects in storage and a lot of those cold storage behind Cloudflare, but I've been using SeaweedFS for years.

I think since v0.7— I was always intrigued by Facebook's Haystack.

SeaweedFS been super reliable, efficient, and trouble free.

I don't understand why you wouldn't just use plain s3. There is no comparison in the readme and I would love to understand what the benefits are. Also I would have expected a comparison to maybe Apache Iceberg, but this might be more specialized for relational data lake data?

What would be the best S3 like storage software with user based access and limits that I can locally host?

Nobody pointed out yet that Chris, the main developer, developed this for Roblox. My kids love Roblox - massively popular game.

Is it compatible with OpenStack?

@chrislusf, i use btrfs with lz4 compression and beesd for deduplicatiion,.. does seaweed support chunking in a way so that deduplication happens?