return to table of content

Bcachefs Merged into the Linux 6.7 Kernel

Having just heard of bcachefs when reading this article, I tried to understand what makes it better than other existing FS but couldn't quite find a clear answer. It feels like it's feature set is equivalent to ZFS.

Do you guys know why someone should get excited by bcachefs?

ZFS will never be integrated into the Linux kernel due to it's licence. btrfs is complicated to use and has many pitfalls that can lead to it eating your data.

> btrfs is complicated to use and has many pitfalls that can lead to it eating your data

I use btrfs in preference over ext4 for Linux filesystems and turn on zstd compression for performance and a bit of space saving. It seems simple enough for my use case, though I'm not doing any snapshots etc.

What are some of the potential pitfalls?

I was very excited about btrfs' advanced features, but that meant that btrfs would bite me multiple times when I expected it to 'just work™':

- RAID5/6 are still not stable

- it will not mount a RAID in degraded mode automatically, failing the high availability promise that might tempt you towards RAID.

- swapfile support exists, but it breaks snapshots (and I don't want to snapshot the swapfile)

- Just an Ubuntu/Debian thing, but snapshots are not integrated into the update process unless you install `apt-btrfs-snapshot` (and know that package exists)

- RAID5/6 are still not stable

Why does everyone insist on RAID5? I am completely fine with btrfs RAID1, survived few disk crashes as advertised.

- it will not mount a RAID in degraded mode automatically

it will if using -o degraded, it's the tooling which generally sucks and won't support this. I even had problems booting at all with / on multi-device btrfs using recommended tools (dracut, grub-mkconfig), the bugs are known and unfixed, ended up rolling out own initramfs.

- swapfile support exists, but it breaks snapshots

you are supposed to have all your stuff in a subvolume, and snapshot that, not whole toplevel root filesystem...and yes it should be documented better that it interferes with snapshots

- but snapshots are not integrated..

yep tooling sucks

RAID1 is a lot more expensive and/or stores a lot less data than RAID5/6.

How much is the "a lot"?

You can try yourself: https://carfax.org.uk/btrfs-usage/

But generally, RAID1 gets you 50% of storage space, whereas RAID5 gets you 66% (and any odd combination of disks).

RAID5 should actually get you n-1 disks of space (if they're equally large), while raid1 only gets you n/2

Raid1 actually gets you less than that. If you mirror 5 drives that only gives you 1 drive worth of data (with 4 redundancies).

But… you have to weigh it against losing all your data. With raid5 if any 2 of your disks die at the same time then you've lost your volume and most likely 100% of your data (no matter how many disks you have).

With raid1 you have to lose _all_ your disks before that happens. Typically that's also just 2 but you can mirror 3 or more drives if you need some data to be _really_ resistant to disk failures and you don't care about "wasting" n-1 times the space.

So you end up trading off efficiency of storage space and resiliency to data loss.

Myself, disks got cheap enough that I always just buy 2 disks and mirror them. I find it easier to reason about overall, especially in the face of a degraded array.

We're talking about btrfs raid1 here which always gives you half of total capacity of all disks (except for extreme cases). With 5 drives it's 2.5 drive worth of data. If 2 drives die means 1 drive worth of data was lost (more or less, depends on balance).

> it will not mount a RAID in degraded mode automatically,

I remember when btrfs was very young and they announced the ability to create mirrors. I tested this out and was pleased, and then I tested the failure scenario: pull a drive, try to boot.

It wouldn't boot!

I jump into IRC and ask if it's expected that you can't boot from a degraded mirror and the answer was "not supported" which means the mirror is pointless.

Obviously it has improved since then as there's a way to force it to work, but I returned to ZFS on FreeBSD and never looked back

Yeah this is the probably the worst default for a fs that supports raid.

It's the safest default. If you want a filesystem that automatically downgrades the safety of your data when things go wrong, you should have to opt-in to such behavior (and btrfs does make that possible).

There's less risk of damage from running off a single disk in a faulted mirror than to rebuild the mirror. The stress the drive is under to rebuild can and will kill drives at the end of their life.

If one half of a mirrored pair dies, you're left with one copy of the data on a drive that's at risk of also dying, plus whatever portion of the data you have backed up. If you then write new data to that drive, it isn't mirrored anywhere and is also guaranteed to not be backed up yet. So the newly written data is at equal or higher risk than the data that was already on the surviving drive. It's not hard to see how silently accepting new writes that cannot be mirrored could be an unacceptable risk to some end users.

With btrfs, you can easily add a new drive so that new writes can be accepted and mirrored immediately, then start rebuilding the old data with lost redundancy (possibly after running another incremental backup, which will be less stressful to the drive than the full rebuild). But the "add a new drive" step happens outside the kernel, in userspace and possibly in meatspace, so it can't be a default action for the kernel to take.

If that is a concern, then it sounds like running mirrored disks isn't sufficient for your use case and maybe require three disks to enable a disk failure and a rebuild to not cause an issue.

Also, if you're happy running off a single disk in a faulted mirror, then I'd question why you've got the mirror setup at all.

Can you not just place the swapfile inside of another subvolume? Since subvolumes are not included in snapshots. There's generally a bunch of stuff in /var that you don't want to include in snapshots, too, so it's not like putting the swapfile inside of a subvolume is an exotic task.

I guess so, I just didn't know when I created the swapfile and was surprised later when I tried to create a snapshot.

Be careful, just like disk images, you probably don't want to place swapfiles on a CoW filesystem.

You can mark individual files as No_COW in btrfs, and No_COW + preallocation is a requirement for swapfiles anyway due to how the swap subsystem works.

> Just an Ubuntu/Debian thing, but snapshots are not integrated into the update process unless you install `apt-btrfs-snapshot` (and know that package exists)

Thanks - I did not know about that.

I agree with the stance of not mounting a degraded RAID automatically as then the danger is that someone might not notice it and be subjected to total data loss later on. The best option would be to allow over-riding that choice if the RAID is otherwise monitored.

Aside from the issues described further downthread:

• On Btrfs the `df` command lies. You can't get an accurate count of free space.

• There is no working `fsck` and the existing repair tools come with dire warnings. Take these very very seriously. I have tested them. They do not work and will destroy data.

• The main point of Btrfs is snapshots. [open]SUSE, Spiral Linux, Garuda Linux and siduction all use these heavily for transactional updates.

But the snapshot tool cannot test that there's enough free space for the snapshot, because `df` lies. So, it will fill up your disk.

Writing to a full Btrfs volume will corrupt it. In my testing it destroyed my root partition roughly once per year. It was the most unstable fs I have tried since the era of ext2 in the mid-1990s. (Yes I am that old.)

> There is no working `fsck` and the existing repair tools come with dire warnings

This stems from a misunderstanding. Fsck and fsck-adjacent tools have three purposes:

1. Replay journal entries in a journalled filesystem, so that the filesystem is repaired to a good state for mounting

2. Scrub through checksums and recover any data/metadata that has a redundant copy

3. In rare cases, a fsck tool encountering invalid data can make guesses as to how the filesystems should be structured -- basically, shot in the dark attempts at recovery.

Btrfs does not need #1 because it is not journalled. Assuming write barriers are working, any partially written copy of the filesystem is valid and will simply appear as if the pending writes had been rolled back. This aspect alone greatly diminishes the need for a fsck tool that filesystems like ext4 have.

As for #2, Btrfs already has scrub support. No issues there.

As for #3, it's questionable whether you should ever rely on such functionality, and fsck tools that do implement such functionality tend to have little maneuverability in the first place.

See my reply below.

I disagree.

You seem to be attempting to justify a profound failing by quibbling about the meaning of words or commands.

The real problem is: Btrfs corrupts readily, and it lacks tools to fix the corruption.

What the tools are called, what their functional role is theoretically meant to be, and whether this is justified in a tool of a given name is tangential and relatively speaking unimportant.

>The real problem is: Btrfs corrupts readily, and it lacks tools to fix the corruption.

Here's a recent example of corruption unearthed by users after Fedora started defaulting to btrfs: https://bugzilla.redhat.com/show_bug.cgi?id=2169947.

I thought that the main purpose of fsck was to rebuild the inode table and to put the filesystem in a correct state (all inodes linked in inode table).

When did you run in any of these issues recently? These issues were a fact in the past but have been sorted out.

df is not lying because the fs layer reports corrupt data but because of dedup and fs layering.

I left SUSE close to the end of 2021, and I had had to reinstall my work laptop twice that year alone. I consider that recent enough to call it current.

> df is not lying

To me, that reads as "df isn't lying because $EXCUSES."

I disagree. I don't care about excuses. I want a 100% accurate accounting of free space at all times via the standard xNix free-disk-space reporting command, and the same from the APIs that command uses so that applications can also get an accurate report of free space.

If a filesystem cannot report free space reliably and accurately, then that filesystem is IMHO broken. Excuses do not exonerate the FS, and having other FS-specific commands that can report free space do not exonerate it. The `df` command must work, or the FS is broken.

The primary point of Btrfs is that it is the only GPL snapshot-capable FS. The other stuff is gravy: it's a bonus. There are distros that use Btrfs that don't use snapshots, such as Fedora.

Some Btrfs advocates use this to claim that the problems are not problematic. If the filesystem is of interest on the basis of feature $FOO, then "product $BAR does not exhibit this problem" is not an endorsement or a refutation if $BAR does not use feature $FOO.

Btrfs RAID is broken in important ways, but that is not a deal-breaker because there are other perfectly good ways of obtaining that functionality using other parts of the Linux stack. If no feature or functionality is lost considering the OS and stack as a whole, then that isn't a problem. However, this remains serious and an issue.

Additional problems include:

• Poor integration into the overall industry-wide OS stack.

Examples:

- Existing commands do not work or give inconsistent results.

- Duplication of functionality (e.g. overlap with `mdraid`)

• Poor integration into specific vendors' OS stacks.

Examples:

- SUSE uses Btrfs heavily.

But SUSE's `zypper` package manager is not integrated with its `snapper` tool. Zypper doesn't include snapshot space used by Snapper in its space estimation.

Snapper is integrated with Btrfs; licence restrictions notwithstanding, I would be much reassured if Snapper supported other COW filesystems.

(This has been attempted but I don't think anything shipped -- https://github.com/openSUSE/snapper/issues/145 . I welcome correction on this!)

The transactional features of SUSE's MicroOS family of distros rely heavily on Btrfs. As such, this lack of awareness of snapshot space utilization deeply worries me. I have raised this with SUSE management, but my concerns were dismissed. That worries me.

What I want to see, for clarity, is for Zypper to look at what packages will be replaced, then ask the FS how big the consequent snapshot will be, and include that snapshot in its space estimation checks before beginning the operation so that at least the packaging operation can be safely aborted before starting.

A better implementation would be to integrate package management with snapshot management so that older snapshots could be automatically pruned to ensure necessary space is made available, while also ensuring that a pre-operation snapshot is retained for rollback. That's harder but would work better.

As it is, currently neither is attempted, and Zypper will start actions that result in filling the disk and thus trashing the FS, and there are no working repair tools to recover.

- Red Hat removed Btrfs support from RHEL. As a result it has had to bodge transactional package management together by grafting Git-like functionality into OStree, then building two entirely new packaging systems around OStree, one for the OS itself and a different one for GUI-level packages. The latter is Flatpak, of course.

This strikes me as prime evidence that:

1. Btrfs isn't ready.

2. Linux needs an in-kernel COW filesystem -- because much of the complexity of OStree, Flatpak, Nix/NixOS, Guix, SUSE's `transactional-update` commands and so on would be rendered unnecessary if it were in there.

> To me, that reads as "df isn't lying because $EXCUSES."

In that case, df also lies on ZFS, bcachefs, and LVM+snapshots. It's in the nature of thin allocation and CoW; if you ask two things sharing storage how much space they have available, that doesn't mean that space is available to both of them at the same time.

The df issue to me seems like an artifact of new data's RAID level being undecided. Because one of the planned features is per-subvolume raid level. Or at the very least metadata and data can have different RAID levels.

Nope. It's still an issue on single volume on a single disk.

You can use duplicate on a single disk. IIRC metadata defaults to that.

I've experienced corruption and data loss with Btrfs each of the times I've tried using it, too, after only about a week of use at most.

Thankfully, all of those incidents were with some non-critical, throw-away VMs where the data loss wasn't really an issue.

I've also used ext4 under the same circumstances for years, and I can't think of a single time that I've lost data, nor have I experienced corruption that fsck couldn't easily deal with.

I, too, would have to go back to the 1990s to think of a filesystem I used that was that unreliable.

After what I experienced, I don't trust Btrfs at all, and I have no plans to ever use it again.

I am glad to hear it's not just me!

I worked at SUSE for 4Y and used it every day. The company is in deep denial about its problems, or that there are any problems, and when I pointed at ZFS as a more mature tool, this was actually mocked.

One that seems to catch out many people is that you need to explicitly add the "degraded" option to allow mounting raid volumes which contain a broken drive. This is opposite to almost all other filesystems. I've seen people confused and thinking they lost the whole volume, even though they just need to replace the bad drive and rebuild as usual.

Personally, I agree with that. It's good practise for systems to fail quickly if they find themselves in an uncertain state and with RAID, it's important for the operator to know if they've suddenly lost redundancy so that they can resolve the issue.

If you can't even boot because of a mystery error how are you supposed to resolve the issue?

It should instead be giving the user error messages written to their terminal, in logs, etc instead of breaking the entire system until the user finds the manual

If your NAS has a hot spare drive installed then it should probably include the degraded mount option by default and automatically add the hot spare drive to the filesystem in the event of a failure. Alternatively, if there's enough free space on the surviving drives to rebalance the array and restore redundancy without replacing the failed drive, that operation could be kicked off automatically. Or the filesystem can be (hopefully temporarily) set to not store new data redundantly, if that is an acceptable risk for the user. But the filesystem cannot know which method the user would prefer; automatically rebuilding the array involves policy decisions that are outside the scope of the filesystem and requires userspace tooling.

If the system doesn't have spare capacity ready, the only sane response is to not boot/mount normally. "giving the user error messages written to their terminal, in logs, etc" isn't a real solution for something like a NAS with no terminal connected and nobody looking at the logs as long as they can still establish a SMB connection; it's too likely to be a silent failure in practice. Mounting the filesystem degraded but read-only makes sense if it's necessary to boot the system so that the user (or their pre-configured userspace tooling) can decide how to deal with the problem, but a lot of Linux distros aren't happy with the root filesystem being read-only.

In summary: there's no single right answer to the problem of a failed drive, and btrfs defaults to what is the safest behavior based on the information available to the filesystem itself. Userspace tooling with more information can make other, less universal choices. A distro that tries to simply adopt btrfs as a drop-in replacement for ext4 probably doesn't have all the tooling necessary to make good use of the unique features of btrfs.

> If the system doesn't have spare capacity ready, the only sane response is to not boot/mount normally.

It doesn't need the spare to "boot normally" and the system can turn on a scary LED, ring bells, call you, text you, hit you up on WhatsApp, DM you on Instagram, or whatever method you want your NAS to use to notify you there's a degradation. (You're monitoring it right??)

This explanation of "it's dangerous to boot off a degraded array" is lunacy. I will not take this terrible advice from armchair experts when I've been doing this for over 25 years

I didn't say it's dangerous to boot off a degraded array. I said it's dangerous to boot off a degraded array normally. Mounting it degraded but read-only is reasonable, because that prevents silently writing new data without the level of redundancy the user previously requested.

There's nothing terrible about advice against responding to a drive failure by putting the system into an even more precarious state without user interaction.

Just wondering have you worked on any large DCs or large NAS or SAN systems? Drive failures are a daily occurrence in places with a lot of spinning metal, having things fail to boot by default would be a nightmare.

> having things fail to boot by default would be a nightmare.

Having things fail to boot would just mean you haven't configured your system appropriately for your environment. If you are using a btrfs RAID filesystem for your root filesystem, and you need that fs to be writeable in order to boot, and you want it to boot even if it's missing a drive, then you need to add an extra mount option and a few lines to your init scripts to persist new downgraded RAID settings in the event a degraded mount was necessary.

But that's hardly the only valid use case for btrfs; plenty of users want strong guarantees about the redundancy of their data rather than silent downgrading.

Also, do you really expect me to believe that any of the large shops still running enough spinning rust to have daily drive failures are still booting off those arrays instead of having separate SSDs as their boot drives? Separate storage of the OS from storage of the important data is such a common and long-ingrained practice that it is embodied in the physical layout of typical server systems, and the primary reason for it is the need for different tradeoffs between performance, redundancy, capacity and cost.

It really depends on what's your situation. At home I probably want the disks to stay idle until replacement. In a bigger production system I want to keep the availability and ping the on-call person to replace the drive in the background.

Exactly - the default should be to get someone to pay attention to it when something breaks, and if you're planning on high availability, then you can choose that, assuming you keep an eye on the health of the RAID.

> What are some of the potential pitfalls?

RAID-5/6:

* https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#raid5...

Wouldn't describe that as a pitfall. As far as I can tell, every place in the docs/help which talks about those raid modes tell you they're experimental and shouldn't be used. At that point it's "you've done a stupid thing and discovered the consequences you were told about".

It's a feature that other file systems have: why should I choose Btrfs when (e.g.) ZFS can do everything it can and more on top of that?

I remember when Btrfs was announced in 2007 as I was already running Solaris 10 with ZFS in production (and ZFS had non-"experimental" RAID-5-like RAID-Z from day one). Here we are 15+ years later and Btrfs still doesn't have it?

I'm just saying it's not a pitfall, because it's clearly documented up front and warned about. That's orthogonal to whether you should choose this filesystem and why a feature is implemented or not.

The nice thing about btrfs is that you can add/remove drives at will. That allows for easy expansion of an existing pool.

The one thing that ZFS cannot do is defragment a pool.

There are abuse patterns that are toxic for ZFS pools (and all other filesystems). Btrfs appears to be able to repair this damage.

https://tim.cexx.org/?p=1236

https://www.usenix.org/system/files/login/articles/login_sum...

eh 10 years ago RAID56 was already declared 'pretty much ready', that hint was added later when major data loss bugs were discovered.

It performs poorly for certain workloads -- notably DBs -- unless you disable copy-on-write, compression, and checksumming. But then why use it over ext4 in the first place?

That's not surprising though, as DBs usually work best when given raw storage - the features of the filesystem are being duplicated by the DB and thus doing almost twice the work.

What seems particularly interesting about Bcachefs is how much it seems to be using database concepts to implement a general filesystem. Ultimately, it seems inevitable that filesystems and databases will converge as they're both supposed to manage data.

The "fix" to databases and VMs on btrfs by disabling CoW unfortunately disables nearly all the useful features.

Just an anecdote, but when I used gocryptfs on a btrfs partition, I'd always end up with a few corrupted files on power failure. After switching to gocryptfs on ext4, I never have any corruption.

I have used heavily[1] the combination btrfs+gocryptfs for many years and had no problems. The only quirk is that I have to pass the -noprealloc flag to gocryptfs, otherwise the performance is really bad.

[1] by heavily, I mean that I use it for my home directory

I used it for my home directory too for several years up until a few months ago, but with default settings. The corrupted files were almost always Firefox cookies db and cache files. Maybe there's something specific about how Firefox writes them that makes them prone to corruption.

With ZFS, if a redundant member of a pool goes offline for a time but is then returned, only the updated blocks are written to bring it up to date, and this happens automatically.

Unfortunately, btrfs is not that smart, and you must trigger a rebalance event to rewrite every block in the filesystem to return to full redundancy.

This rebalance behavior is a deal-killer for many uses.

IMHO bcachefs has important advantages over ZFS. It is far more flexible. ZFS is really similar to traditional block-based RAID. You can get pretty flexible configurations but 1. They are largely fixed after creation and 2. They only operate at "dataset" level granularity.

bachefs has a really flexible design here where you basically add all of your disks to the storage pool and then you can pick redundancy and performance settings per folder (arbitrary subtrees, not just datasets decided at setup time) or even file. For example you can configure a default of 2 replicas for all data, but for your cache directory set it to 1 replica. If you have an important documents folder you can set that to 3 replicas, or 4.2 erasure coding.

Similarly you can tell it to put your cache folder on devices labeled "ssd" but your documents folder should write to "ssd" but then be migrated to "hdd" when they are cold.

And again, all of this can be set at any time on any subtree. Not just when you initially set up your disks or create the directories.

The #1 point of ZFS is protections again bitrot, ie. checksums on all data. Does bcachefs do this?

https://bcachefs.org/

It's literally the second item listed on the main web site

Full data and metadata checksumming

Thanks, that's good news. Unfortunately it seems bcachefs isn't using pools like ZFS but instead, like btrfs, creates file systems directly on a collection of disks. That's a bummer if true.

What are the use cases where you find micromanaging vdevs worthwhile compared to automatic storage allocation that can be changed and rebalanced later? bcachefs does offer several per-device controls over data placement and redundancy that btrfs lacks; are those still not enough?

Exactly, I _don't_ want to micromanage devices so I throw them all into a pool (with defined redundancy) and then I allocate file systems and vdevs (for iSCSI) from that. With btrfs (and bcachefs) I have to manually assign devices to file systems and I can't have multiple file systems (well, you can have sub-volumes but you can't avoid a big file system in that collection of devices).

Maybe I'm missing something, but the pool abstraction makes this very clean and clear.

> so I throw them all into a pool (with defined redundancy) and then I allocate file systems and vdevs (for iSCSI) from that.

That sounds backwards. Don't you have to manually define the layout of the vdevs first in order to establish the redundancy, and then allocate the volumes you use for filesystems or iSCSI? If you just do a `zpool create` and give it a dozen disks and ask for raidz2, you're just creating a single vdev that's a RAID6 over all the drives. There's an extra step compared to the btrfs workflow, but if you're not using that opportunity to micromanage your array layout I don't see why you'd prefer that extra step to exist.

> and I can't have multiple file systems (well, you can have sub-volumes but you can't avoid a big file system in that collection of devices).

Isn't this a purely cosmetic complaint? With at least btrfs, you don't even have to mount the root volume, you can simply mount the subvolumes directly wherever you want them and pretty much ignore the existence of the root volume except when provisioning more subvolumes from the root. You can pretend that you do have a ZFS-style pool abstraction, but one that's navigable like a filesystem in the Unix tradition instead of requiring non-standard tooling to inspect.

If you wan't volume manager features, you can put btrfs/bcachefs on top of LVM

bcachefs currently has full metadata and data checksumming, but there is no scrub implementation. This will likely be in the works soon now that it has been merged.

If you mix a small SSD and a large HDD into a single bcachefs pool, and set it to 2 replicas, do writes have to succeed on both devices before returning success? I.e. is performance constrained by the slowest device? And what happens when the small SSD fills up? Does it carry on writing to the HDD but with only one replica, or does it report no space left?

Are you trying to ask if bcachefs lets you set up stupid configurations? If you only have two devices and you configure it use replication to store two copies of all your data, then you are unavoidably constrained by the capacity of the smaller device. Whether the smaller device is a hard drive or SSD is irrelevant, because neither copy can be regarded as a discardable cache when they're both necessary to maintain the requested level of redundancy.

I am trying to understand bcachefs by asking about an edge case that might be illustrative.

> you are unavoidably constrained by the capacity of the smaller device

Sure, so what does bcachefs actually do about it? ENOSPC?

Answers to the question on synchronous write behavior also welcome.

> I am trying to understand bcachefs by asking about an edge case that might be illustrative.

It looks more like you're questioning whether the filesystem that Linus just merged does obviously wrong things for the simplest test cases of its headline features. Has something given you cause to suspect that this filesystem is so deeply and thoroughly flawed? Because this doesn't quite look like trying to learn, it looks like trying to find an excuse to dismiss bcachefs before even reading any of the docs. Asking if everything about the filesystem is a lie seems really odd.

> Asking if everything about the filesystem is a lie seems really odd.

This person was asking pertinent questions, no need to bash it.

I lost some data with XFS after it was declared "stable" in the linux kernel. Every day at around 00:00 the power will fail for around 2 seconds. Other filesystems (ext2, jfs, reiser) will do a fsck, but xfs was smarter. After 2 or 3 crashes the xfs volume will not be usable anymore (no fsck possible).

So yes, some of us do need more than a "trust me, it's ok" when we are talking about our data.

You explicitly configure the cache, as laid out here: https://bcachefs.org/Caching/

In the common case that you mentioned, data present on the full SSD would be overwritten "in standard LRU fashion"; meaning the "Least Recently Used" data would no longer be cached. New data would be written to the SSD while a background "rebalance thread" would copy that data to the HDD. I assume that the "sync" command would wait for the "rebalance thread" to finish, though I will admit my own ignorance on that front.

that actually sounds quite neat

It is (soon) officially in kernel, as opposed to zfs.

It has a more limited feature set and said to have simpler codebase than zfs/btrfs. It has a single outstanding non-stable feature.

It seems to be in active development, while btrfs seems to have become stagnant/abandonware before it was finished/stabilised completely. I have read several horror stories about data loss, so I have avoided it so far.

On the other hand it is not widely deployed yet, there is less accumulated knowledge than in case of zfs.

I'm looking forward to trying it in my NAS when buying new disks next year. The COW snapshots would fit my needs (automatic daily snapshots, weekly backups).

(Now using LUKS+LVM+ext4, this would give a better, more integrated, deduplicated solution, I have lots of duplicated data right now)

> while btrfs seems to have become stagnant/abandonware before it was finished/stabilised completely

Why would you think so? I can't remember the last time a kernel was released without something at least a bit exciting about btrfs

https://kernelnewbies.org/LinuxChanges#Linux_6.5.File_system...

Because they still haven't fixed the write hole issues in certain RAID configurations,which have been known for a decade or more.

A project not finishing some feature is not the same as being abandoned. It seems lots of people are happy to use btrfs in production without that raid mode. In other words, for all the complaining about raid5 that happens every time btrfs is mentioned, you'd think there would be at least one person who cares enough to implement it. Yet people use it in production and keep improving the other parts of that project instead.

It's fine for single a single disk or something that presents as a single disk, like a SAN. RAID1 seems fine also. I really wanted to love it, but after it ate my data a couple of times, I gave up trying to use it for mass storage. At that time they didn't have warnings in the documentation and had stated that RAID5/6 were basically complete.

I found using mirrored vdevs in ZFS much easier to manage and much more stable.

RAID1 in Btrfs is not entirely fine. It won't nuke your data and there is no write-hole issue, but if a disk fails you'll have to go into a read-only mode during the rebuild and deal with various hurdles in getting it rebuilt.

> but if a disk fails you'll have to go into a read-only mode during the rebuild and deal with various hurdles in getting it rebuilt.

I can't say for sure that this never happens, but that's certainly not been the failure mode for any of the drive failures my btrfs RAID1 has experienced. I don't think I've ever needed to reboot or even remount my filesystem, just replace the failed drive (physically, then in software). But I always have more than two drives in the filesystem, so a single drive failure only puts a fraction of my data at risk, not everything.

> if a disk fails you'll have to go into a read-only mode during the rebuild

That's not correct, you can rebuild online. The readonly mode is only relevant when you reboot during the failure and don't have the right options set on the volume.

But you totally can replace a live drive without affecting the availability.

> I found using mirrored vdevs in ZFS much easier to manage and much more stable.

That's not exactly a fair comparison. If you restricted your usage of btrfs to a similarly narrow range of features, you would probably have had a much better experience.

> I can't remember the last time a kernel was released without something at least a bit exciting about btrfs

They are fixing the fixable issues, but the on-disk format still makes some gotchas inevitable. It sounds like there's never going to be a great solution to live rebuilding of redundancy.

The main advantage, if I understood it correctly, is supposed to be performance. The promise is to have similar speeds to ext4/xfs with the feature set of btrfs/ZFS. While that sounds nice it took a lot of time to get it stable and upstreamed. Like any FS you might not want to go with the latest shiny thing but there are some that are willing to risk it, similar to debates around Btrfs vs ZFS.

The last benchmarks from Phoenix are a few years old but look promising: https://www.phoronix.com/review/bcachefs-linux-2019

From that benchmark article

> The design features of this file-system are similar to ZFS/Btrfs and include native encryption, snapshots, compression, caching, multi-device/RAID support, and more. But even with all of its features, it aims to offer XFS/EXT4-like performance, which is something that can't generally be said for Btrfs.

I was surprised at that as I believed that btrfs is generally faster than ext4. Looking ahead to the last page, the geometric mean or the benchmarks supports that view too.

I worked on Ceph, a distributed storage system, for a while. Here's what we learned from benchmarks (over a decade ago):

Btrfs goes very fast at first but slows down when it has to start pruning/compacting its on-disk btree structures, and then the performance suffers bad. Thus, btrfs works best when you have a spiky workload that lets it "catch up", and you never fill the disk. Concrete example: historically, removing a snapshot while under load was a disaster, with IO waits over 2 minutes. So, both sides are correct: btrfs is very fast and btrfs is very slow.

XFS was never crazy fast, and has the smallest feature set of bunch, but it just kept chugging at the same pace with almost no change, regardless of what the workload did. In more complex use, you had to avoid triggering bad behavior; e.g. there was a fixed number of write streams open, something like 8, and if you had more than that many concurrent writes going your blocks got fragmented. It was very much a freight train; not particularly fast but very predictable performance, and no serious degradation ever.

ext4 was sort of in between those; mostly very fast, with some hiccups in performance. Great as long as your storage is 100% reliable -- we had scrubbing in-product on top of the filesystem.

We ended up recommending xfs to most customers, at the time. Predictability trumped minor gain in performance, for most uses.

I had a test case with ext4 where it would take 80 seconds to write out an 8MB file on a 16TB filesystem. ext4 does not handle free space fragmentation very well as it still relies on block group bitmaps. Oh the legacy baggage it carries...

The geometric mean on the last page shows ext4 to be faster than btrfs.

Oops - I was thinking that smaller was better.

Working deduplication would be amazing! ZFS has deduplication, but every time I've tried it has ended in a world of pain. Maybe they've fixed it in a more recent release, but the amount of RAM required for deduplication always outstripped the amount of RAM I had available to give it (the deduplication tables have to reside in RAM).

ZFS now has reflink support, which doesn't require lots of RAM, but isn't done automatically while writing. You need to run something like https://github.com/markfasheh/duperemove

The deduplication tables can be put in a special vdev now, I think.

I tried to explain in this piece:

https://www.theregister.com/2022/03/18/bcachefs/

> It feels like it's feature set is equivalent to ZFS.

It does something that ZFS can't- be merged into the kernel.

Am I correct to assume this is GPL and not BSD or MIT? If this is GPL I guess there is no chance this will ship with BSD. And ZFS is sadly ( but not deal breaking ) CDDL.

BSD and GPL is compatible thought, you just have to ship the resulting work under GPL ;)

Thus making it just as bad as CDDL+GPL; you can do it out of tree, but it'll never get mainlined.

You mean like ZFS and Dtrace never got mainlined in FreeBSD?

CDDL is copyleft but not viral, which makes it much easier (read: politically possible) to include in a permissively licensed project than anything under a viral copyleft license like GPL.

FreeBSD has tons of GPL code, including in the kernel. There's been some effort to move away from that, but it's far from a foregone conclusion that it's impossible to merge new GPL code.

A bigger question would be "why bcachefs when we have a stable ZFS in base?"

... which in the context of a BSD OS ever supporting a bcachefs drive, means bcachefs is not compatible with BSD OSes given it SPDX identifies as GPL-2.0: https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/...

Yes, you can bundle GPL software with BSD software, much the same as I can bundle EULA'd software with BSD software: under the stricter terms. Obviously that's not the intent of the question, but sure, it's theoretically possible in some future fictional world where the BSDs are GPL'd.

You don't need to re-license the BSD code to GPL as that code is not a derived from the GPL'd bcache. Distributing binaries is the only tricky bit: a kernel with GPL code would need to be distributed as GPL. This can be solved with loadable kernel modules. There isn't much stopping anyone from using the code from a license perspective.

The problem is this sort of code is strongly tied to the operating system, and porting it would require significant effort, if even feasible at all.

Which BSD? ;) DragonFly has HAMMER2, and the CDDL doesn't seem to be negatively impacting the development and integration of ZFS in FreeBSD.

I too am not a big fan of the GPL, especially if the text of the license is longer than the program itself. But any filesystem (let alone a modern one) is very much a non-trivial feat of engineering; the author should have the full right to protect their (and their users') interests.

I am just thinking if we could have one decent FS across both BSD and Linux. And it would still be better than CDDL.

If you just want to exchange data on removable media, use ExFAT. What problem are you trying to solve?

Different OS's prefer different filesystems, because filesystems tend to be both complicated and heavily opinionated in design and implementation - just like different OS's. Linux is the odd one by supporting several dozen, all the other OS's stick to 1 or 2 (usually "old" and "new" like HFS+/APFS, FAT/NTFS, etc), plus UDF&FAT as the lowest common denominator for data interchange. There is very little precedent / use cases for sharing volumes like you suggest: non-removable disks tend to stay in one machine for their lifetime; dual-booting is extremely niche (where Linux/BSD themselves are all niche) and mostly a domain of enthusiasts.

Sure, but something as mundane as licensing being the reason we can't even try is kinda depressing. One ZFS-grade filesystem that could have Windows, Linux, and BSD implementations, allowing robust external storage with all the promises of modern snapshotting filesystems? Ah. What a dream.

Maybe in 10-20 years if someone white-rooms a bcachefs or ZFS implementation, I guess.

Again, CDDL is effectively "free enough" (it's basically a fork of MPL/weak copyleft) that both FreeBSD and Linux (although not mainline) incorporate ZFS. Whether it can be mixed with GPL is a very good question, the facts are that 1. it wasn't the intention for it to be incompatible, 2. it was never tested in court, 3. several popular distros nowadays ship the spl.ko & zfs.ko binaries.

I recommend this talk by Bryan Cantrill, which provides more context on the whole licensing story for Solaris: <https://www.youtube.com/watch?v=-zRN7XLCRhc&t=1375s> TL;DW: lots of effort and good will was put into making this code as free as possible, with the intention of making it broadly usable.

There are other reasons why e.g. OpenBSD won't adopt ZFS, the main one being the sheer complexity: <https://flak.tedunangst.com/post/ZFS-on-OpenBSD>. Again, different projects, different goals.

Apple heavily considered ZFS (even advertised it as an upcoming feature), but then gave in to NIH. Probably because it didn't fit their plan for mobile, and seeing the runaway success in that dept you can't really blame them.

But we digress! Is ZFS even a good system for removable media? Heck no. What do you want to use it with? Digital cameras? Portable music players? 2007 called and wants its toys back. Backups? Yeah, that can work, but there's little value in being able to recover backups on a foreign OS. Giving someone a random file on a thumb drive? Use ExFAT (or even plain old FAT32), just keep it simple!

From the FAQ https://bcachefs.org/FAQ/ :

Bcachefs is safer to use than btrfs and is also shown to outperform zfs in terms of speed and reliability

So what makes it more reliable? I can't find a simple overview of the design / reasoning behind the whole thing and what makes it 'better' than the rest.

>Bcachefs is safer to use than btrfs

That is a pretty bold claim given that Facebook runs btrfs in prod across the majority of their fleet and almost nobody uses bcachefs.

Facebook et al generally structure their systems to tolerate node failures through redundancy at higher levels. In short: they can not care, or work around, design problems most others can't - or use it for a specific purpose, such as logging, etc.

Btrfs was terrible in the early days, took ages to "git gud", and given a filesystem is supposed to be among the most stable code in the OS, that burned a lot of bridges. It wasn't until fairly recently that btrfs could tolerate being completely filled.

I have no idea how valid the claims are, but bcache's developer claims that btrfs suffers from a lot of terrible early design decisions that can't be undone.

The show-stopper for me is that bcachefs lacks a scrub:

> We're still missing scrub support. Scrub's job will be to walk all data in the filesystem and verify checksums, recovery bad data from a good copy if it exists or notifying the user if data is unrecoverable.

The only argument I see for btrfs is that it supports throwing random drives into a pool and btrfs magically handles redundancy across them.

...but it doesn't support tiered storage like bcachefs does. We're well past "I want to have a redundant filesystem I can randomly add a drive to and magically my shit is mirrored." These days people want to have a large pile of spinning rust with some SSD in front of it, and doing that in all but ZFS is kind of a pain.

Can't you do most of what scrub would do with something like:

   find . -xdev -print0|xargs -0 cat > /dev/null

(ideally replacing cat with something that continues after errors)

Citing the origin of "move fast and break things" as the exemplar of safety is itself pretty bold.

I only read that they use it for their build servers and use the snapshotting feature for quickly resetting the build environments. It's not a common use of a filesystem imo

Could just mean that RAID5/6 isn't broken...

> Bcachefs is safer to use than btrfs

Citation needed. With a sample size of 1 it ate my data and BTRFS has been running perfectly fine on that system (after I bailed off of bcachefs) and other systems. I think it is great that they consider data safety very important but it will take lots of testing and real-world experience to validate that claim.

Well, keep in mind that language has remained unchanged for a few years now. Certainly BcacheFS has some theoretical advantages (lack of write hole etc.) but BTRFS has nonetheless improved since then and BcacheFS has gotten more complex.

It's still a very exciting filesystem so I'm sure we'll be seeing third parties test it rigorously very soon.

The bcachefs architecture overview is here: https://bcachefs.org/Architecture/

The claim for reliability comes from the idea that bcache has been in heavy production use for a decade, and considered rock solid with plenty of testing of corner cases, and that bcachefs builds a filesystem over the bcache block store so that most of the hard things (like locking and such) are managed by the underlying block store, not the bcachefs layer. This way, the filesystem code itself is simple and easy to understand.

One pet peeve is with some of nomenclature/UI that they are using (§2.5):

> Snapshots are writeable and may be snapshotted again, creating a tree of snapshots.

* https://bcachefs.org/bcachefs-principles-of-operation.pdf

> bcachefs provides btrfs style writeable snapshots, at subvolume granularity.

* https://bcachefs.org/Snapshots/

Every other implementation of the concept has the implicit idea that snapshots are read-only:

* https://en.wikipedia.org/wiki/Snapshot_(computer_storage)

The word "clone" seems to have been settled on for a read-write copy of things.

I'm not sure that's completely standard terminology. For example LVM has "snapshots" that are read-write.

From lvcreate(1) on an Ubuntu 22.04 LTS system I have CLI on:

       -s|--snapshot
              Create a snapshot. Snapshots provide a "frozen image" of an ori‐
              gin LV.  The snapshot LV can be used, e.g. for backups, while
              the origin LV continues to be used.  This option can create a
        […]

* Also: https://manpages.ubuntu.com/manpages/lunar/en/man8/lvcreate....

The word 'frozen' to me means unmoving / fixed.

As mentioned in the Wikipedia article, the analogy comes from photography where a picture / snap(shot) is a moment frozen in time.

I've admined NetApps in the past, used Veritas VxFS back in the day, and currently run a lot of ZFS (first using it on Solaris 10), and "snapshot" has meant read-only for the past few decades whenever I've run across it.

The LVM howto https://tldp.org/HOWTO/LVM-HOWTO/snapshotintro.html says "In LVM2, snapshots are read/write by default". And RedHat's docs https://access.redhat.com/documentation/en-us/red_hat_enterp... include text "Since the snapshot is read/write".

So I think LVM2 snapshots are indeed read/write. Perhaps that manpage sentence was not updated since LVM1 read-only snapshots ?

(I agree with you that 'snapshot' to me strongly suggests read/write; I'm just saying that you can't actually rely on that assumption because it's not just bcachefs that doesn't use that meaning.)

The man page continues:

> This option can create a COW (copy on write) snapshot, or a thin snapshot (in a thin pool.) [...] COW snapshots are created when a size is specified. The size is allocated from space in the VG, and is the amount of space that can be used for saving COW blocks as writes occur to the origin or snapshot.

Likely snapshots _were_ originally read-only, and the description of creating thin and COW snapshots was added later, but the man page text was not re-written completely; rather the description of thin and COW snapshots were added to the end of the existing text.

I've just noticed that I wrote "I agree with you that 'snapshot' to me strongly suggests read/write" and of course I meant "read only"! Hope that wasn't too confusing...

It is basically universal in the enterprise storage world.

Id call a RW snapshot a "fork".

How does this compare to file versioning ?

Exciting!

I've played around with it a few times, since it's been easily available in NixOS for a while. Didn't run into any issues with a few disks and a few hundred GB of data.

Some very interesting properties, including (actually efficient) snapshots, spreading data over multiple disks, using a fast SSD as a cache layer for HDDs, built-in encryption (not audited yet though!), automatic deduplication, compression, ...

A lot of that is already available through other file systems (btfs, zfs) and/or by layering different solutions (LVM, dm-crypt, ...) , but getting all of it out of the box with a single FS that's in the mainline kernel is quite appealing.

I don't find this lack of separation of concerns appealing, especially for crypto since it dilutes auditing resources. And blockdev-level crypto is simpler and harder to fuck up.

Snapshots and compression OTOH are better done on the FS level.

> blockdev-level crypto is simpler and harder to fuck up.

Simpler yes, but since generally no extra metadata is allocated it is vulnerable to some cryptanalysis notably comparing different snapshots of the drive. Doing this properly requires storing unique keys for different versions of data. Doing this with typical blockdev-level encryption is very expensive (you either need to reduce the effective block size which disrupts lots of software that assumes things about block size or store the data out-of-line (typically at the end of the disk) which requires up to 2x writes. Doing this in the filesystem allows strong encryption with minimal performance impact (as the IV write is co-located with data that is changing anyways).

> or store the data out-of-line [...] which requires up to 2x writes

and because writes to separate sectors aren't atomic, you probably want to add journaling or some kind of CoW for crash safety, and oh look now you're actually just writing a filesystem and it's not simpler anymore.

No, you're writing something that happens to resemble a small subset of the functionality of a filesystem.

Most importantly, you're not duplicating the effort for every filesystem that you want to support encryption for, and the code can largely remain fixed once mature.

kernel developers are free to factor in common functionality, calling into it library style, without making it into an externally visible layer. IIRC fs encryption and case insensitivity are often done that way, and I think I'd count the page cache and bios as larger library-style components as well (as opposed to the VFS layer which is more in a framework style).

It should be highlighted that all these features are supported by btrfs too which has certainly seen much more test in the last years.

Yes there have been issues in the past but since quite a a while its stable and feature-rich. Easily the most advanced free file system

btrfs does not have the fast-disk cache feature I think, and definitely not the built-in encryption.

It also doesn't do block volumes or active dedupe. Btrfs is great, but I'm very excited about having all these features in one place!

As I said in the last thread, the author could use support: https://www.patreon.com/join/bcachefs

This has been a solitary and largely self-financed effort by Kent over many years. He must feel pretty great to finally see this happen!

When I became a supporter yesterday, the average was under $10 per supporter per month. There's no shame in being a small supporter.

Edit: I currently see 279 supporters for a total of $2,328 per month, so $8.34 average per month per supporter.

It's currently 428 @ $2382, $5.5/per.

The pre-populated options are $20 and $100 a month, which is a lot. I think he'd get a lot more supporters if he dropped those asks to more like $1 and $5.

Its not like that more people will pay with $1. most probably the same amount will pay but he‘ll just get only 10% of the funding.

Huh 10%???

I can't even see the pay what you like feature on the app.

Not sure about the app, but on the website, it was below the tiers. On my laptop screen, the pay-what-you-like option was below the fold.

I used to support him for $1/month previously, I think it was a specific tier.

From memory I stopped because I was adding other creators and I got the impression he was doing okay.

I should have said "paid members" instead of "supporters" to be more precise. He's currently at 432 members, but of those only 283 are paid members.

Either way, $5 is probably both the median and the mode for the payment distribution, whether you include non-paying members or not.

It's surprising no one has mentioned the fact that ZFS doesn't support suspend/resume. So it's a fat no-go for laptops, whereas btrfs and hopefully bcachefs can shine bright supporting all cool features on my laptop so I can learn by playing with them.

(LVM + LUKS + BTRFS does it for me right now)

Why wouldn't ZFS support suspend/resume? Me and several other people I know happily suspend/ resume a laptop with Root-on-ZFS with Debian or FreeBSD every day without problems.

https://github.com/openzfs/zfs/issues/260

I stand corrected, last time I checked they didn't support freeze/unthaw on ZoL, it's always worked on FreeBSD though.

So not Bca chefs!

I can't unsee this. Thanks!

Previous discussion: https://news.ycombinator.com/item?id=38071842

Nice and congrats!

How stable it to use for day to day desktop tasks?

Finally! I've been waiting to give this a try. I thought about adding it manually, but didn't realize it required patching and recompiling my kernel, which wasn't a worthwhile endeavour to me.

I love storage and filesystems, and I am really looking forward to play with bcachefs. Now bcachefs can be tested easily by millions and build a solid reputation as a true next generation Linux filesystem.

Brilliant, I played with this for Ceph OSD's waaaaay back and it worked quite well, albeit a little fragile to deploy.