return to table of content

Ceph: A Journey to 1 TiB/s

66 replies

Does anyone have experience running ceph in a home lab? Last time I looked into it, there were quite significant hardware requirements.

29 replies

There still are. As someone who has done both production and homelab deployments: unless you are specifically just looking for experience with it and just setting up a demo - don't bother.

When it works, it works great - when it goes wrong it's a huge headache.

Edit: As just an edit, if distributed storage is just something you are interested in there are much better options for a homelab setup:

- seaweedfs has been rock solid for me for years in both small and huge scales. we actually moved our production ceph setup to this.

- longhorn was solid for me when i was in the k8s world

- glusterfs is still fine as long as you know what you are going into.

14 replies

I just want to hoard data. I hate having to delete stuff to make space. Things disappear from the web every day. I should hold onto them.

My requirements for a storage solution are:

Single root file system

Storage device failure tolerance

Gradual expansion capability

The problem with every storage solution I've ever seen is the lack of gradual expandability. I'm not a corporation, I'm just a guy. I don't have the money to buy 200 hard disks all at once. I need to gradually expand capacity as needed.

I was attracted to this ceph because it apparently allows you to throw a bunch of drives of any make and model at it and it just pools them all up without complaining. The complexity is nightmarish though.

ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID. Expansion features seem to be just about to land for quite a few years now. I remember getting excited about it after seeing news here only for people to deflate my expectations. Btrfs has a flexible block allocator which is just what I need but... It's btrfs.

3 replies

On a single host, you could do this with LVM. Add a pair of disks, make them a RAID 1, create a physical volume on them, then a volume group, then a logical volume with XFS on top. To expand, you add a pair of disks, RAID 1 them, and add them to the LVM. It's a little stupid, but it would work.

If multiple nodes are not off the table, also look into seaweedfs.

Also consider how (or if) you are going to back up your hoard of data.

2 replies

Also consider how (or if) you are going to back up your hoard of data.

I actually emailed backblaze years ago about their supposedly unlimited consumer backup plan. Asked them if they would really allow me to dump into their systems dozens of terabytes of encrypted undeduplicable data. They responded that yes, they would. Still didn't believe them, these corporations never really mean it when they say unlimited. Plus they had no Linux software.

1 replies

these corporations never really mean it when they say unlimited. Plus they had no Linux software

Afaik they rely on the latter to mitigate the risk of the former.

0 replies

Considering the fact that most data heavy servers are llnux, that would be a pretty clever way of staying true to their word.

2 replies

ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID.

if you don't mind the overhead of a "pool of mirrors" approach [1], then it is easy to expand storage by adding pairs of disks! This is how my home NAS is configured.


0 replies

This is also exactly how mine is done. Started off with a bunch of 2TB disks. I've now got a mixture of 16TB down to 4TB, all in the original pool.

0 replies

50% storage efficiency is a tough pill to swallow, but drives are pretty big and the ability to expand as you go means it can be cheaper in the long run to just buy the larger, new drives coming out than pay upfront for a bunch of drives in a raidz config.

1 replies

Not sure what the multidisk consensus is for btrfs now-a-days but adding/removing devices is trivial, you can do "offline" dedupe, and you can rebalance data if you change the disk config.

As an added bonus it's also in-tree so you don't have to worry about kernel updates breaking things

I think you can also potentially do btrfs+LVM and let LVM manage multi device. Not sure what performance looks like there, though

0 replies

That's all great but btrfs parity striping is still unusable. How many more decades will it take?

1 replies

ZFS using mirrors is extremely easy to expand. Need more space and you have small drives? Replace the drives in a mirror one by one with bigger ones. Need more space and already have huge drives? Just add another vdev mirror. And the added benefit of not living in fear of drive failure while resilvering as it is much faster with mirrors than raidX.

Sure the density isn't great as you're essentially running at 50% or raw storage but - touches wood - my home zpool has been running strong for about a decade doing the above from 6x 6tb drives (3x 6tb mirrors) to 16x 10-20tb drives (8x mirrors, differing sized drives but matched per mirror like a 10tb x2 mirror, a 16tb x2 mirror etc).

Edit: Just realised someone else as already mentioned a pool or mirrors. Consider this another +1.

0 replies

Replace the drives in a mirror one by one with bigger ones.

That's exactly what I meant by "just as bad as RAID". Expanding an existing array is analogous to every single drive in the array failing and getting replaced with higher capacity drives.

When a drive fails, the array is in a degraded state. Additional drive failures put the entire system in danger of data loss. The rebuilding process generates enormous I/O loads on all the disks. Not only does it take an insane amount of time, according to my calculations the probability of read errors happening during the error recovery process is about 3%. Such expansion operations have a real chance of destroying the entire array.

0 replies

I've run Ceph at home since the jewel release. I migrated to it after running FreeNAS.

I use it for RBD volumes for my OpenStack cluster and for CephFS. With a total raw capacity of around 350TiB. Around 14 of that is nvme storage for RBD and CephFS metadata. The rest is rust. This is spread across 5 nodes.

I currently am only buying 20TB exos drives for rust. SMR and I think HSMR are both no goes for Ceph as are non enterprise SSDs, so storage is expensive. Ibdinhave a mix of disks though as the cluster has grown organically. So I have a few 6TB WD Reds in there, before their SMR shift.

My networks for OpenStack, Ceph and Ceph backend are all 10Gbps. With the flash storage when repairing I get about 8GiB/s a second. With rust it is around 270MiB/s. The bottle neck I think is due to 3 of the nodes running on first gen xeon-d boards, the the few Reds do slow things down too. The 4th node runs an AMD Rome CPU, and the newest an AMD Genoa cpu. So I am looking at about 5k CAD a node before disks. I colocate the MDS, OSDs and MONs, with 64GiB of ram each. Each node gets 6 rust, and 2 nvme drives.

Complexity is pretty simple. I deployed the initial iteration by hand, and then when cephadmin was released i converted it daemon by daemon smoothly. I find on the mailing lists and Reddit most of the people encountering problems deploy it via Proxmox and don't really understand Ceph because of it.

0 replies

EOS (, is probably a bit more complicated than other solutions to setup and manage, but does allow to add/remove new disks and nodes serving data on the fly. This is essential to let us upgrade harware of the clusters serving experimental data with minimal to no downtime.

0 replies

If you're willing to use mirror vdevs, expansions can be done two drives at a time.Also, depending on how often your data changes, you should check out snapraid. Doesn't have all the features of ZFS but its perfect for stuff that rarely changes (media or, in your case, archiving).

Also unionfs or similar can let you merge zfs and snapraid into one unified filesystem so you can place important data in zfs and unchanging archive data in snapraid.

3 replies

I'd throw minio [1] in the list there as well for homelab k8s object storage.


1 replies

Minio doesn't make any sense to me in a homelab. Unless I'm reading it wrong it sounds like a giant pain to add more capacity while it is already in use. There's basically no situation where I'm more likely to add capacity over time than a homelab.

0 replies

You get a new nas (minio server pool) and you plug it into your home lab (site replication) and now it's part of the distributed minio storage layer (k8s are happy). How is that hard? It's the same basic thing for Ceph or any distributed JBOD mass storage engine. Minio has some funkiness with how you add more storage but it's totally capable of doing it while in use. Everything is atomic.

0 replies
2 replies

I really wish there was a benchmark comparing all of these + MinIO and S3. I'm in the market for a key value store, using S3 for now but eyeing moving to my own hardware in the future and having to do all the work to compare these is one of the major things making me procrastinate.

0 replies

minio is good but you really need fast disks. They also really don't like, when you want to change the size of your cluster setup. No plan to add cache disks, they just say use faster disks. I have it running, goes smoothly but not really user friendly to optimize

0 replies

Minio gives you "only" S3 object storage. I've setup a 3-node Minio cluster for object storage on Hetzner, each server having 4x10TB, for ~50€/month each. This means 80TB usable data for ~150€/month. It can be worth it if you are trying to avoid egress fees, but if I were building a data lake or anything where the data was used mostly for internal services, I'd just stick with S3.

1 replies

glusterfs is still fine as long as you know what you are going into.

Does that include storage volumes for databases? I was using glusterFS as a way to scale my swarm cluster horizontally and I am reasonably sure that it corrupted one database to the point I lost more than a few hours of data. I was quite satisfied with the setup until I hit that.

I know that I am considered crazy for sticking with Docker Swarm until now, but aside from this lingering issue with how to manage stateful services, I've honestly don't feel the need to move yet to k8s. My clusters is ~10 nodes running < 30 stacks and it's not like I have tens of people working with me on it.

0 replies

Docker Swarm seems to be underrated, from a simplicity and reliability perspective, IMHO.

1 replies

I thought it was popular for people running Proxmox clusters

0 replies

It is, and if you have a few nodes with at least 10 GbE networking, it's certainly the best clustered storage option I can think of.

0 replies

Curious, what do you mean by "know what you go into" re glusterfs?

I recently tried ceph in a homelab setup, gave up because of complexity, and settled on glusterfs. I'm not a pro though, so I'm not sure if there's any shortcomings that are clear to everybody but me, hence why your comment caught my attention.

0 replies

GlusterFS support looks to be permanently ending later this year.

Note that the Red Hat Gluster Storage product has a defined support lifecycle through to 31-Dec-24, after which the Red Hat Gluster Storage product will have reached its EOL. Specifically, RHGS 3.5 represents the final supported RHGS series of releases.

For folks using GlusterFS currently, what's your plan after this year?

0 replies

Ceph is sort of a storage all-in-one: it provides object storage, block storage, and network file storage. May I ask, which of these are you using seaweedfs for? Is it as performant as Ceph claims to be?

10 replies

Why would you bother with a distributed filesystem when you don't have to?

2 replies

For the same reason you would use one in enterprise deployments: if setup properly, it's easier to scale. You don't need to invest in a huge storage server upfront, but could build it out as needed with cheap nodes. Assuming it works painlessly as a single node filesystem, of which I'm not yet convinced if the existing solutions do.

1 replies

if setup properly, it's easier to scale

For home use/needs, I think vertical scaling is much easier.

0 replies

Not really. Most consumer motherboards have a limited number of SATA ports, and server hardware is more expensive, noisy and requires a lot of space. Consumers usually go with branded NAS appliances, which are also expensive and limited at scaling.

Setting up a cluster of small heterogeneous nodes is cheaper, more flexible, and can easily be scaled as needed, _assuming_ that the distributed storage software is easy to work with and trouble-free. This last part is what makes it difficult to setup and maintain, but if the software is stable, I would prefer this approach for home use.

2 replies

So that when you do have to, you know how to do it.

1 replies

I think most of us will go our whole lives never having to deploy Ceph, especially at home.

0 replies

You’re absolutely not wrong - but asking a devops engineer why they over engineered their home cluster is sort of like asking a mechanic “why is your car so fast? Couldn’t you just take the bus?”

1 replies

I'm indifferent towards the distributed nature thing. What I want is ceph's ability to pool any combination of drives of any make, model and capacity into organized redundant fault tolerant storage, and its ability to add arbitrary drives to that pool at any point in the system's lifetime. RAID-like solutions require identical drives and can't be easily expanded.

0 replies

ZFS and BtrFS have some capability for this.

0 replies

lol, wrong place to ask questions of such practicality.

that said, I played with virtualization and I didn't need to.

but then I retired a machine or two and it has been very helpful.

And I used to just use physical disks and partitions. But with the VMs I started using volume manager. It became easier to grow and shrink storage.


well, now a lot of this is second nature. I can spin up a new "machine" for a project and it doesn't affect anything else. I have better backups. I can move a virtual machine.

yeah, there are extra layers of abstraction but hey.

0 replies

It's cool to cluster everything for some people (myself included). I see it more like a design constraint than a pure benefit.

3 replies

Related question, how does someone get into working with Ceph? Other than working somewhere that already uses it.

0 replies

The recommended way to set up Ceph is cephadm, a single-file Python script that is a multi-tool for both creating and administering clusters.

To learn about Ceph, I recommend you create at least 3 KVM virtual machines (using virt-manager) on a development box, network them together, and use cephadm to set up a cluster between the VMs. The RAM and storage requirements aren't huge (Ceph can run on Raspberry Pis, after all) and I find it a lot easier to figure things out when I have a desktop window for every node.

I recently set up Ceph twice. Now that Ceph (specifically RBD) is providing the storage for virtual machines, I can live-migrate VMs between hosts and reboot hosts (with zero guest downtime) anytime I need. I'm impressed with how well it works.

0 replies

Look into the Rook project

0 replies

You could start by installing Proxmox on old machines you have, it uses Ceph for its distributed storage, if you choose to use it.

2 replies

I have some experience with Ceph, both for work, and with homelab-y stuff.

First, bear in mind that Ceph is a distributed storage system - so the idea is that you will have multiple nodes.

For learning, you can definitely virtualise it all on a single box - but you'll have a better time with discrete physical machines.

Also, Ceph does prefer physical access to disks (similar to ZFS).

And you do need decent networking connectivity - I think that's the main thing people think of, when they think of high hardware requirements for Ceph. Ideally 10Gbe at the minimum - although more if you want higher performance - there can be a lot of network traffic, particularly with things like backfill. (25Gbps if you can find that gear cheap for homelab - 50Gbps is a technological dead-end. 100Gbps works well).

But honestly, for a homelab, a cheap mini PC or NUC with 10Gbe will work fine, and you should get acceptable performance, and it'll be good for learning.

You can install Ceph directly on bare-metal, or if you want to do the homelab k8s route, you can use Rook (

Hope this helps, and good luck! Let me know if you have any other questions.

1 replies

NUC with 10gbit eth - can you recommend any?

0 replies

If you want something cheap, you could go with Lenovo M720q's:

They have a PCIe slot and can take 8th/9th gen intel cpus (6 core, etc). That PCIe slot should let you throw in a decent network card (eg 10GbE, 25GbE, etc).

2 replies

Yes. I first tried it with Rook, and that was a disaster, so I shifted to Longhorn. That has had its own share of problems, and is quite slow. Finally, I let Proxmox manage Ceph for me, and it’s been a dream. So far I haven’t migrated my K8s workloads to it, but I’ve used it for RDBMS storage (DBs in VMs), and it works flawlessly.

I don’t have an incredibly great setup, either: 3x Dell R620s (Ivy Bridge-era Xeons), and 1GBe. Proxmox’s corosync has a dedicated switch, but that’s about it. The disks are nice to be fair - Samsung PM863 3.84 TB NVMe. They are absolutely bottlenecked by the LAN at the moment.

I plan on upgrading to 10GBe as soon as I can convince myself to pay for an L3 10G switch.

1 replies

Just get a 25G switch and MM fiber. 25G switches are cheaper, use less power and can work with 10 and 25G SFPs.

0 replies

The main blocker (other than needing to buy new NICs, since everything I have already came with quad 1/1/10/10) is I'm heavily invested into the Ubiquiti ecosystem, and since they killed off the USW-Leaf (and the even more brief UDC-Leaf), they don't have anything that fits the bill.

I'm not entirely opposed to getting a Mikrotik or something and it just being the oddball out, but it's nice to have everything centrally managed.

EDIT: They do have the PRO-Aggregation, but there are only 4x 25G ports. Technically it _would_ meet my needs for Ceph, and Ceph only.

2 replies

I run Ceph on some Raspberry Pi 4s. It's super reliable, and with cephadm it's very easy[1] to install and maintain.

My household is already 100% on Linux, so having a native network filesystem that I can just mount from any laptop is very handy.

Works great over Tailscale too, so I don't even have to be at home.

[1] I run a large install of Ceph at work, so "easy" might be a bit relative.

1 replies

What are your speeds? Do you rub ceph FS too?

I'm trying to do similar.

0 replies

It's been a while since I've done some benchmarks, but it can definitely do 40MB/s sustained writes, which is very good given the single 1GbE links on each node, and 5TB SMR drives.

Latency is hilariously terrible though. It's funny to open a text file over the network in vi, paste a long blob of text and watch it sync that line by line over the network.

If by "rub" you mean scrub, then yes, although I increased the scrub intervals. There's no need to scrub everything every week.

1 replies

The hardware minimums are real, and the complexity floor is significant. Do not deploy Ceph unless you mean it.

I started considering alternatives when my NAS crossed 100 TB of HDDs, and when a scary scrub prompted me to replace all the HDDs, I finally pulled the trigger. (ZFS resilvered everything fine, but replacing every disk sequentially gave me a lot of time to think.) Today I have far more HDD capacity and a few hundred terabytes of NVMe, and despite its challenges, I wouldn't dare run anything like it without Ceph.

0 replies

Can I ask what you use all that storage for on your NAS?

1 replies

There's a blog post they did where they setup Ceph on some rPI 4's. I'd say that's not significant hardware at all. [1]


0 replies

I think "significant" turns out to mean the number of nodes required.

0 replies

Proxmox makes Ceph easy, even with just one single server if you are homelabbing...

I had 4 NUCs running Proxmox+Ceph for a few years, and apart from slightly annoying slowness syncing after spinning the machines up from cold start, it all ran very smoothly.

0 replies

Works great, depending on what you want to do. Running on SBCs or computers with cheap sata cards will greatly reduce the performance. It's been running well for years after I found out the issues regarding SMR drives and the SATA card bottlenecks.

45Drives has a homelab setup if you're looking for a canned solution.

0 replies

I run Ceph in my lab. It's pretty heavy on CPU, but it works well as long as you're willing to spring for fast networking (at least 10Gb, ideally 40+) and at least a few nodes with 6+ disks each if you're using spinners. You can probably get away with far fewer disks per node if you're going all-SSD.

0 replies

I think you need 3 or was it 5 machines?

proxmox will use it - just click to install

0 replies

If you want decent performance, you need a lot of OSDs especially if you use HDD. But a lot of consumer SDDs will suffer terrible performance degradation with writes depending on the circumstances and workloads.

0 replies

I played around with it and it has a very cool web UI, object storage & file storage, but it was very hard to get decent performance and it was possible to get the metadata daemons stuck pretty easily with a small cluster. Ultimately when the fun wore off I just put zfs on a single box instead.

0 replies

I’ve ran Ceph in my home lab since Jewel (~8 years ago). Currently up to 70TB storage on a single node. Have been pretty successful vertically scaling, but will have to add a 2nd node here in a bit.

Ceph isn’t the fastest, but it’s incredibly resilient and scalable. Haven’t needed any crazy hardware requirements, just ram and an i7.

0 replies

I just set up a three-node Proxmox+Ceph cluster a few weeks ago. Three Optiplex desktops 7040, 3060, and 7060 and 4x SSDs of 1TB and 2TB mix (was 5 until I noticed one of my scavenged SSDs was failed). Single 1gbps network on each so I am seeing 30-120MB/s disk performance depending on things. I think in a few months I will upgrade to 10gbps for about $400.

I'm about 1/2 through the process of moving my 15 virtual machines over. It is a little slow but tolerable. Not having to decide on RAIDs or a NAS ahead of time is amazing. I can throw disks and nodes at it whenever.

22 replies

What router/switch one would use for such speed?

12 replies

Linked article says they used 68 machines with 2 x 100GbE Mellanox ConnectX-6 cards. So any 100G pizza box switches should work.

Note that 36 port 56G switches are dirt cheap on eBay and 4tbps is good enough for most homelab use cases

11 replies

So any 100G pizza box switches should work.

but will it be able to handle combined TB/s traffic?

6 replies

Even the bargain Mikrotik can do 1.2Tbps

4 replies

For those curious, a "bargain" on a 100gbps switch means about $1350

2 replies

On a cluster with more than $1M of NVMe disks, that does actually seem like a bargain.

(Note that the linked MikroTik switch only has 100gbe on a few ports, and wouldn't really classify as a full 100gbe switch to most people)

1 replies

Sure- I don't mean to imply that it isn't. I can absolutely see how that's inexpensive for 100gbe equipment.

That was more for the benefit of others like myself, who were wondering if "bargain" was comparative, or inexpensive enough that it might be worth buying one next time they upgraded switches. For me personally it's still an order of magnitude away from that.

0 replies
19h9m is the sweet spot right now for home users. Four 10g ports and a 1g, you can use the 1g for “uplink” to the internet and one of the 10g for your “big old Nortel gigabit switch with 10g uplink” and one for your Mac and two for your NAS and VM server. ;)

Direct cables are moderately cheap, and modules for 10g Ethernet aren’t insanely expensive.

0 replies

there's usually some used dx010 (32x100gbe) on ebay for less than $500

the cheapest new 100gbe switch I know of is the mikrotik CRS504-4XQ-IN (4x100gbe, around $650)

0 replies

TB != Tb..

2 replies

any switch which can't handle full load on all ports isn't worthy of the name 'switch', it's more like 'toy network appliance'

1 replies

I will forever be scarred by the "Gigabit" switches of old that were 2 gigabit ports and 22 100mb ports. Coworker bought it missing the nuance.

0 replies

Still happens, gotta see if the top speed mentioned is an uplink or normal ports.

0 replies

Yes. Most network switches can handle all ports at 100% utilization in both directions simultaneously.

Take for example the Mellanox SX6790 available for less than $100 on eBay. It has 36 56gbps ports. 36 * 2 * 56 = 4032gbps and it is stated to have a switching capacity of 4.032Tbps.

Edit: I guess you are asking how one would possibly sip 1TiB/s of data into a given client. You would need multiple clients spread across several switches to generate such load. Or maybe some freaky link aggregation. 10x 800gbps links for your client, plus at least 10x 800gbps links out to the servers.

7 replies

800Gbps via OSFP and QSFP-DD are already a thing. Multiple vendors have NICs and switches for that.

4 replies

16x PCIe 4.0 is 32GB/s 16x PCIe 5.0 should be 64 GB/s, how is any computer using 100 GB/s ?

3 replies

I was talking about Gigabit/s, not Gigabyte/s.

The article however actually talks about Terabyte/s scale, albeit not over a single node.

2 replies

800 gigabits is 100 gigabytes which is still more than PCIe 5.0 16x 64 gigabyte per second bandwidth.

You said there were 800 gigabit network cards, I'm wondering how that much bandwidth makes it to the card in the first place.

The article however actually talks about Terabyte/s scale, albeit not over a single node.

This does not have anything to do with what you originally said, you were talking about 800gb single ports.

0 replies

I'm not aware of any 800G cards, but FYI a single Mellanox card can use two PCIe x16 slots to avoid NUMA issues on dual-socket servers:

So the software infra for using multiple slots already exists and doesn't require any special config. Oh and some cards can use PCIe slots across multiple hosts. No idea why you'd want to do that, but you can.

0 replies

Yes, apparently I was mistaken about the NICs. They don't seem to be available yet.

But it's not a PCIe limitation. There are PCIe devices out there which use 32 lanes, so you could achieve the bandwidth even on PCIe5.

1 replies

can you show me a 800G NIC?

the switch is fine, I'm buying 64x800G switches, but NIC wise I'm limited to 400Gbit.

0 replies

fair enough, it seems I was mistaken about the NIC. I guess that has to wait for PCIe 6 and should arrive soon-ish.

0 replies

Given their configuration of just 4U spread across 17 racks, there's likely a bunch of compute in the rest of the rack, and 1-2 top of rack switches like this:

And then you connect the TOR switches to higher level switches in something like a Clos distribution to get the desired bandwidth between any two nodes:

22 replies

I wish someone would try to scale the nodes down. The system described here is ~300W/node for 10 disks/node, so 30W or so per disk. That’s a fair amount of overhead, and it also requires quite a lot of storage to get any redundancy at all.

I bet some engineering effort could divide the whole thing by 10. Build a tiny SBC with 4 PCIe lanes for NVMe, 2x10GbE (as two SFP+ sockets), and a just-fast-enough ARM or RISC-V CPU. Perhaps an eMMC chip or SD slot for boot.

This could scale down to just a few nodes, and it reduces the exposure to a single failure taking out 10 disks at a time.

I bet a lot of copies of this system could fit in a 4U enclosure. Optionally the same enclosure could contain two entirely independent switches to aggregate the internal nodes.

7 replies

here's a weird calculation:

this cluster does something vaguely like 0.8 gigabits per second per watt (1 terabyte/s * 8 bits per byte * 1024 gb per tb / 34 nodes / 300 watts

a new mac mini (super efficient arm system) runs around 10 watts in interactive usage and can do 10 gigabits per second network, so maybe 1 gigabit per second per watt of data

so OP's cluster, back of the envelope, is basically the same bits per second per watt that a very efficient arm system can do

I don't think running tiny nodes would actually get you any more efficiency, and would probably cost more! performance per watt is quite good on powerful servers now

anyway, this is all open source software running on off-the-shelf hardware, you can do it yourself for a few hundred bucks

2 replies

I think the Mac Mini has massively more compute than needed for this kind of work. It also has a power supply, and computer power supplies are generally not amazing at low output.

I’m imagining something quite specialized. Use a low frequency CPU with either vector units or even DMA engines optimized for the specific workloads needed, or go all out and arrange for data to be DMAed directly between the disk and the NIC.

1 replies

sounds like a DPU (mellanox bluefield for example), they're entire ARM systems with a high speed NIC all on a PCIe card, I think the bluefield ones can even directly interface over the bus to nvme drives without the host system involved

0 replies

That Bluefield hardware looks neat, although it also sounds like a real project to program it :).

I can imagine two credible configurations for high efficiency:

1. A motherboard with a truly minimal CPU for bootstrapping but a bit beefy PCIe root complex. 32 lanes to the DPU and a bunch of lanes for NVMe. The CPU doesn’t touch the data at all. I wonder if anyone makes a motherboard optimized like this — a 64-lane mobo with a Xeon in it would be quite wasteful but fine for prototyping I suppose.

2. Wire up the NVMe ports directly to the Bluefield DPU, letting the DPU be the root complex. At least 28 of the lanes are presumably usable for this or maybe even all 32. It’s not entirely clear to me that the Bluefield DPU can operate without a host computer, though.

1 replies

I checked selling prices of those racks + top end SSDs, this 1Tb/s achievement runs on $4 million worth of hardware cluster. Or more I didn't check the networking interface costs.

But yeah could run on commodity hardware. Not sure those highly efficient arm packaged for a premium from Apple would beat the Dell racks though regarding throughput relative to hardware investment costs.

0 replies

Dell’s list prices have essentially nothing to do with the prices that any competent buyer would actually pay, especially when storage is involved. Look at the prices of Dell disks, which are nothing special compared to name brand disks of equal or better spec and much lower list price.

I don’t know what discount large buyers get, but I wouldn’t be surprised if it’s around 75%.

1 replies

Trusting your maths, damn Apple did a great job on their M design.

0 replies

Didn't ARM (the company, that originally designed ARM processors) do most of that job and Apple pushed perf to consumption even further?

6 replies

I have always wanted to set up a ceph system with one drive per node. The ideal form factor would be a drive with a couple network interfaces built in. western digital had a press release about an experiment they did that was exactly this, but it never ended up with drive you could buy.

The hardkernel HC2 SOC was a nearly ideal form factor for this, and I still have a stack of them laying around that I bought to make a ceph cluster, but I ran out of steam when I figured out they were 32bit. not to say it would be impossible I just never did it.

3 replies

I used to use Ceph Luminous (v12) on these, they worked fine. Unfortunately, a bug in Nautilus (v14) prevented 32-bits and 64-bits archs from talking to each other. Pacific (v16) allegedly solves this, but I didn't try it:

If you want to try it with a more modern (and 64-bits) device, the hardkernel HC4 might do it for you. It's conceptually similar to the HC2 but has two drives. Unfortunately it only has double the RAM (4GB), which is probably not enough anymore.

2 replies

Looks so good, wish for a > 1gbit version, since HDDs alone can saturate that

1 replies

Did you look at their H3? It's pricier but it has two 2.5Gbits ports (along with a NVMe slot and an Intel CPU)

0 replies

I have one and love it! It bravely holds together my intranet dev services :)

For a ceph node would still consider a version with 10gbit eth

1 replies
0 replies

That would be perfect. Unfortunately, going by the data sheet it would not run ceph you would have to work with seagate's proprietary object store. I will note that as far as I can tell it is unobtainium. none of the usual vendors stock them, you probably have to prove to seagate that you are a "serious enterprise customer" and commit to a thousand units before they will let you buy some.

1 replies

I used to run a 5 node Ceph cluster on a bunch of ODROID-HC2's [0]. Was a royal pain to get installed (armhf processor). But once it was running it worked great. Just slow with the single 1Gb NIC.

Was just a learning experience at the time.


0 replies

Same here, but on PI 4b's. 6 node cluster with a 2tb hdd and 512 Tb ssd per node. CEPH made a huge impression on me, as in I didn't recognize how extensive the package was. I went up to 122mb/s and thought it's too little for my hack-NAS replacement :)

The functionality: mixing various pool types on the same set of SSD's, different redundancy types (erasure coded, replicated) was very impressive. Now I can't help but look down at a RAID NAS in comparision. Still, some extra packages like the NFS exporter were not ready for the arm architecture

0 replies

IIRC, WD has experimented with placing Ethernet and some compute directly onto hard drives some time back.

sigh I used to do some small-scale Ceph back in 2017 or so...

0 replies

10 Gbps is increasingly obsolete with very low cost 100 Gbps switches and 100Gbps interfaces. Something would have to be really tiny and low cost to justify doing a ceph setup with 10Gbps interfaces now... If you're at that scale of very small stuff you are probably better off doing local NVME storage on each server instead.

0 replies

There probably is a sweet spot for power to speed, but I think it's possibly a bit larger than you suggest. There's overhead from the other components as well. For example, the Mellanox NIC seems to utilize about 20W itself, and while the reduced numbers of drives might allow for a single port NIC which seems to use about half the power, if we're going to increase the number of cables (3 per 12 disks instead of 2 per 5), we're not just increasing the power usage of the nodes themselves put also possible increasing the power usage or changing the type of switch required to combine the nodes.

If looked at as a whole, it appears to be more about whether you're combining resources at a low level (on the PCI bus on nodes) or a high level (in the switching infrastructure), and we should be careful not to push power (or complexity, as is often a similar goal) to a separate part of the system that is out of our immediate thoughts but still very much part of the system. Then again, sometimes parts of the system are much better at handling the complexity for certain cases, so in those cases that can be a definite win.

0 replies

I think the chief source of inefficiency in this architecture would be the NVMe controller. When the operating system and the NVMe device are at arm's length, there is natural inefficiency, as the controller needs to infer the intent of the request and do its best in terms of placement and wear leveling. The new FDP (flexible data placement) features try to address this by giving the operating system more control. The best thing would be to just hoist it all up into the host operating system and present the flash, as nearly as possible, as a giant field of dumb transistors that happens to be a PCIe device. With layers of abstraction removed, the hardware unit could be something like an Atom with integrated 100gbps NICs and a proportional amount of flash to achieve the desired system parallelism.

0 replies

Is that a lot of overhead? The disk itself uses about 10W and high speed controllers use about 75W leaves pretty much 100W for the rest of the system including overhead of about 10%. Scale up the system to 16 disks and there’s not a lot of room for improvement

10 replies

There was a point in history when the total amount of digital data stored worldwide reached 1TiB for the first time. It is extremely likely this day was within the last sixty years.

And here we are moving that amount of data every second on the servers of a fairly random entity. We not talking of a nation state or a supranatural research effort.

6 replies

It’s at least 20ish years ago: I remember an old sysadmin talking about managing petabytes before 2003

3 replies

Those numbers seem reasonable in that context. I first started using BitTorrent around that time as well, and it wasn't uncommon to see many users long-term seeding multiple hundreds of gigabytes of Linux ISOs alone.

Here’s another usage scenario with data usage numbers I found a while back.

A 2004 paper published in ACM Transactions on Programming Languages and Systems shows how Hancock code can sift calling card records, long distance calls, IP addresses and internet traffic dumps, and even track the physical movements of mobile phone customers as their signal moves from cell site to cell site.

With Hancock, "analysts could store sufficiently precise information to enable new applications previously thought to be infeasible," the program authors wrote. AT&T uses Hancock code to sift 9 GB of telephone traffic data a night, according to the paper.

1 replies

I archived Hancock here over a decade ago, stumbled upon it via HN at the time if I’m not mistaken:

0 replies

That’s pretty cool. I remember someone on that repo from while back and was surprised to see their name pop up again. Thanks for archiving this!

Corinna Cortes et al wrote the paper(s) on Hancock and also the Communities of Interest paper referenced in the Wired article I linked to. She’s apparently a pretty big deal and went on to work at Google after her prestigious work at AT&T.

Hancock: A Language for Extracting Signatures from Data

Hancock: A Language for Analyzing Transactional Data Streams

Communities of Interest

0 replies

Yeah, at the other end of the scale, it sounds like Apple is now managing exabytes:

This is pretty mind-boggling to me.

0 replies

Must be much more than 20ish years, some 2400 ft reels in the 60s stored a few megabytes, you only need 100 000s of those to reach a terabyte.

a single 2400-foot tape could store the equivalent of some 50,000 punched cards (about 4,000,000 six-bit bytes).

In 1964 with the introduction of System/360 you are going a magnitude higher

It could store a maximum of 45MB on 2,400 feet

At this point you only need a few ten thousand reels in existence to reach a terabyte. So I strongly suspect the "terabyte point" was some time in the 1960s.

0 replies

I raised this to retro se and notes a TiB of digital data likely was reached in the 1930s with punch cards.

2 replies

That reminds me of a calculation I did which showed that my desktop PC would be more powerful than all of the computers on the planet combined in like 1978 :D

1 replies

My phone has more computation than anything I would have imagined owning, and I sometimes turn on the screen just to use as a quick flashlight.

0 replies

Haha.. imagine taking it back to 1978 and showing how it has more computing power than the entire planet and then telling them that you mostly just use it to find that thing you lost under the couch :D

7 replies

I wanted to see how 1 TiB/s compares to the actual theoretical limits of the hardware. So here is what I found:

The cluster has 68 nodes, each a Dell PowerEdge R6615 ( The R6615 configuration they run is the one with 10 U.2 drive bays. The U.2 link carries data over 4 PCIe gen4 lanes. Each PCIe lane is capable of 16 Gbit/s. The lanes have negligible ~3% overhead thanks to 128b-132b encoding.

This means each U.2 link has a maximum link bandwith of 16 * 4 = 64 Gbit/s or 8 Gbyte/s. However the U.2 NVMe drives they use are Dell 15.36TB Enterprise NVMe Read Intensive AG, which appear to be capable of 7 Gbyte/s read throughput ( So they are not bottlenecked by the U.2 link (8 Gbyte/s).

Each node has 10 U.2 drive, so each node can do local read I/O at a maximum of 10 * 7 = 70 Gbyte/s.

However each node has a network bandwith of only 200 Gbit/s (2 x 100GbE Mellanox ConnectX-6) which is only 25 Gbyte/s. This implies that remote reads are under-utilizing the drives (capable of 70 Gbyte/s). The network is the bottleneck.

Assuming no additional network bottlenecks (they don't describe the network architecture), this implies the 68 nodes can provide 68 * 25 = 1700 Gbyte/s of network reads. The author benchmarked 1 TiB/s actually exactly 1025 GiB/s = 1101 Gbyte/s which is 65% of the maximum theoretical 1700 Gbyte/s. That's pretty decent, but in theory it's still possible to be doing a bit better assuming all nodes can concurrently truly saturate their 200 Gbit/s network link.

Reading this whole blog post, I got the impression ceph's complexity hits the CPU pretty hard. Not compiling a module with -O2 ("Fix Three": linked by the author: can reduce performance "up to 5x slower with some workloads" ( is pretty unexpected, for a pure I/O workload. Also what's up with OSD's threads causing excessive CPU waste grabbing the IOMMU spinlock? I agree with the conclusion that the OSD threading model is suboptimal. A relatively simple synthetic 100% read benchmark should not expose a threading contention if that part of ceph's software architecture was well designed (which is fixable, so I hope the ceph devs prioritize this.)

2 replies

I think PCIe TLP overhead and NVMe commands account for the difference between 7 and 8 GB/s.

1 replies

You are probably right. Reading some old notes of mine when I was fine-tuning PCIe bandwith on my ZFS server, I had discovered back then that a PCIe Max_Payload_Size of 256 bytes limited usable bandwidth to about 74% of the link's theoretical max. I had calculated that 512 and 1024 bytes (the maximum) would raise it to respectively about 86% and 93% (but my SATA controllers didn't support a value greater than 256.)

0 replies

Mellanox recommends setting this from the default 512 to 4096 on their NICs.

2 replies

They're benchmarking random IO though, and the disks can "only" do a bit over 1000k random 4k read IOPS, which translates to about 5 GiB/s. With 320 OSDs thats around 1.6 TiB/s.

At least thats the number I could find. Not exactly tons of reviews on these enterprise NVMe disks...

Still, that seems like a good match to the NICs. At this scale most workloads will likely appear as random IO at the storage layer anyway.

1 replies

The benchmark were they accomplish 1025 GiB/s is for sequential reads. For random reads they do 25.5M iops or ~100 GiB/s. See last table, column "630 OSDs (3x)".

0 replies

Oh wow how did I miss that table, cheers.

0 replies

I wanted to chime in and mention that we've never seen any issues with IOMMU before in Ceph. We have a previous generation of the same 1U chassis from Dell with AMD Rome processors in the upstream ceph lab and they don't suffer from the same issue despite performing similarly at the same scale (~30 OSDs). The customer did say they've seen this in the past in their data center. I'm hoping we can work with AMD to figure out what's going on.

I did some work last summer kind of duct taping the OSD's existing threading model (double buffering the hand-off between async msgr and worker threads, adaptive thread wakeup, etc). I could achieve significant performance / efficiency gains under load, but at the expense of increased low-load latency (Ceph by default is very aggressive about waking up threads when new IO arrives for a given shard).

One of the other core developers and I discussed it and we both came to the conclusion that it probably makes sense to do a more thorough rewrite of the threading code.

6 replies

Ceph has an interesting history.

It was created at Dreamhost (DH), for their internal needs by the founders.

DH was doing effectively IaaS & PaaS before those were industry coined words (VPS, managed OS/database/app-servers).

They spun Ceph off and Redhat bought it.

3 replies

A bit more to the story is that it was created also at UC Santa Cruz, by Sage Weil, a Dreamhost founder, while he was doing graduate work there. UCSC has had a lot of good storage research.

0 replies

the fighting banana slugs

0 replies

Sage is one of the nicest, down to earth, super smart individuals I've met.

I've talked to him at a few OpenStack and Ceph conferences, and he's always very patient answering questions.

0 replies

I remember the first time I deployed ceph, would have been around 2010 or 2011, had some really major issues which would nearly resulted in data loss and due to someone else not realizing what "this cluster is experimental, do not store any important data here" meant, the data on ceph was the only copy of the irreplaceable data in the world, loosing the data would have been fairly catastrophic for us.

I ended up on the ceph IRC channel and eventually had Sage helping me fix the issues directly, helping me find bugs and writing patches to fix them in realtime.

Super amazingly nice guy that he was willing to help, never once chastised me for being so stupid (even though I was), also wicked smart.

1 replies

Yeah, as a customer (still one) I remember their "Hey, we're going to build this Ceph thing, maybe it ends up being cool" blog entry (or newsletter?) kinda just sharing what they were toying with. It was a time of no marketing copy and not crafting every sentence to sell you things.

I think it was the university project of one of the founders, and the others jumped in supporting it. Docker has a similar origins story as far as I know.

5 replies

I used to love doing experiments like this. I was afforded that luxury as a tech lead back when I was at Cisco setting up Kubernetes on bare metal and getting to play with setting up GlusterFS and Ceph just to learn and see which was better. This was back in 2017/2018 if I recall. Good ole days. Loved this writeup!

2 replies

I had to run a bunch of benchmarks to compare speeds of not just AWS instance types, but actual individual instances in each type, as some NVME SSDs have been more used than others in order to lube up some Aerospike response times. Crazy.

1 replies

Ad-tech, or?

0 replies

Yeah. Serving profiles for customized ad selection.

1 replies

A Heketi man! I had the same experience around the same years, what a blast. Everything was so new..and broken!

0 replies

Same here, still remember that time our Heketi DB partially corrupted and we had to fix it up by exporting it to a massive json file, fix it up by looking at the Gluster state and importing it again. I can't quite remember the details but I think it had to do with Gluster snapshots being out of sync with the state in the DB.

5 replies

Is modern Ceph appropriate for transactional database storage, how is the IO latency? I'd like to move to a cheaper cfs that can compete with systems like Oracle's clustered file system or DBs backed by something like Veritas. Veritas supports multi-petabyte DBs and I haven't seen much outside of it or ocfs that similarly scales with acceptable latency

2 replies

Not sure about putting DBs on CephFS directly, but Ceph RBD can definitely run RDBMS workloads.

You need to pay attention to the kind of hardware you use, but you can definitely get Ceph down to 0.5-0.6 ms latency on block workloads doing single thread, single queue, sync 4K writes.

Source, I run Ceph at work doing pretty much this.

1 replies

It is important to specify which kind of latency percentile this is. Checking on a customer's cluster (made from 336 SATA SSDs in 15 servers, so not the best one in the world):

  50th percentile = 1.75 ms
  90th percentile = 3.15 ms
  99th percentile = 9.54 ms
That's with 700 MB/s of reads and 200 MB/s of writes, or approximately 7000 reads IOPS and 9000 writes IOPS.

0 replies

These numbers may be good enough for your use case but from what’s possible with SSDs these numbers aren’t great. Please note, I mean well. Still a cool setup.

I’d like to see much more latency consistency and 99th even sub ms. Might want to set a latency target with fio and see what kind of load is possible until 99 hits 1ms.

However, I can say all of this but it’s all about context and depending on workload your figures may be totally fine.

1 replies

Latency is quite poor, I wouldn't recommend running high performance database loads there.

0 replies

From my dated experience, Ceph is absolutely amazing but latency is indeed a relative weak spot.

Everything has a trade-off and for Ceph you get a ton of capability but latency is such a trade-off. Databases - depending on requirements - may be better off on regular NVMe and not on Ceph.

3 replies

Ceph is interesting... open source software whose only purpose is to implement a distributed file system...

Functionally, Linux implements a file system (well, several!) as well (in addition to many other OS features) -- but (usually!) only on top of local hardware.

There seems to be some missing software here -- if we examine these two paradigms side-by-side.

For example, what if I want a Linux (or more broadly, a general OS) -- but one that doesn't manage a local file system or local storage at all?

One that operates solely using the network, solely using a distributed file system that Ceph, or software like Ceph, would provide?

Conversely, what if I don't want to run a full OS on a network machine, a network node that manages its own local storage?

The only thing I can think of to solve those types of problems -- is:

What if the Linux filesystem was written such that it was a completely separate piece of software, and a distributed file system like Ceph, and not dependent on the other kernel source code (although, still complilable into the kernel as most linux components normally are)...

A lot of work? Probably!

But there seems to be some software need for something between a solely distributed file system as Ceph is, and a completely monolithic "everything baked in" (but not distributed!) OS/kernel as Linux is...

Note that I am just thinking aloud here -- I probably am wrong and/or misinformed on one or more fronts!

So, kindly take this random "thinking aloud" post -- with the proverbial "grain of salt!" :-)

2 replies

what if I want a Linux ... that doesn't manage a local file system or local storage at all [but] operates solely using the network, solely using a distributed file system

Linux can boot from NFS although that's kind of lost knowledge. Booting from CephFS might even be possible if you put the right parts in the initrd.

1 replies
0 replies

NFS is an excellent point!

NFS (now that I think about it!) -- brings up two additional software engineering considerations:

1) Distributed file system protocol.

2) Software that implements that distributed (or at least remote/network) file system -- via that file system protocol.

NFS is both.

That's not a bad thing(!) -- but ideally from a software engineering "separation of concerns" perspective, this future software layer/level would ideally be decoupled from the underlying protocol -- that is, it might have a "plug-in" protocol architecture, where various 3rd party file system protocols (somewhat analogous to drivers) could be "plugged-in"...

But NFS could definitely be used to boot/run Linux over the network, and is definitely a step in the right direction, and something worth evaluating for these purposes... its source code is definitely worth looking at...

So, an excellent point!

2 replies

I wrote an intro to Ceph[0] for those who are new to Ceph.

It featured in a Jeff Geerling video briefly recently :-)

[0]: Understanding Ceph: open-source scalable storage

1 replies

Has anything important changed since 2018, when you wrote that? :)

0 replies

Conceptually not as far as I know.

2 replies

Where can I read about the rationale for ceph as a project? I'm not familiar with it.

0 replies
18h50m is a pretty good introduction. Basically you can take off-the-shelf hardware and keep expanding your storage cluster and ceph will scale fairly linearly up through hundreds of nodes. It is seeing quite a bit of use in things like Kubernetes and OpenShift as a cheap and cheerful alternative to SANs. It is not without complexity, so if you don't know you need it, it's probably not worth the hassle.

0 replies

Not sure how common the use-case is, but we're using Ceph to effectively roll our own EBS inside AWS on top of i3en EC2 instances. For us it's about 30% cheaper than the base EBS cost, but provides access to 10x the IOPS of base gp3 volumes.

The downside is durability and operations - we have to keep Ceph alive and are responsible for making sure the data is persistent. That said, we're storing cache from container builds, so in the worst-case where we lose the storage cluster, we can run builds without cache while we restore.

2 replies

Nice article! We've also recently reached the mark of 1TB/s at CERN, but with EOS (, not ceph:

Our EOS clusters have a lot more nodes, however, and use mostly HDDs. CERN also uses ceph extensively.

1 replies

Great! What's your take on ceph? Is the idea to migrate to EOS long term?

0 replies

EOS and ceph have different use cases at CERN. EOS holds physics data and user data in CERNBox, while ceph is used for a lot of the rest (e.g. storage for VMs, and other applications). So both will continue to be used as they are now. CERN has over 100PB on ceph.

1 replies

This is an insanely expensive cluster built to show a benchmark. 68 node cluster serving only 15TB storage in total.

0 replies

The purpose of the benchmarking was to validate the design of the cluster and to identify any issues before going into production, so it achieved exactly that objective. Without doing this work a lot of performance would have been left on the table before the cluster could even get out the door.

As per the blog, the cluster is now in a 6+2 EC configuration for production which gives ~7PiB usable. Expensive yes, but well worth it if this is the scale and performance required.

1 replies

I'm playing a lot with MicroCeph. Its aopinionated low TOS, friendly setup of Ceph. Looking forward additional comments. Planning to use it in production and replace lots of NAS servers.

0 replies

I think Ceph can be fine for NAS use cases, but be wary of latency and do some benchmarking. You may need more nodes/osds than you think to reach latency and throughput targets.

1 replies

Cool benchmark, and interesting, however it would have read a lot better if abbreviations are explained at first usage. Not everybody is familiar with all terminology used in the post. Nonetheless congrats with results.

0 replies

Thanks (truly) for the feedback! I'll try to remember for future articles. It's easy to forget how much jargon we use after being in the field for so long.

1 replies

This is a fascinating read. We run a Ceph storage cluster for persisting Docker layer cache [0]. We went from using EBS to Ceph and saw a massive difference in throughput. Went from a write throughput of 146 MB/s and 3,000 IOPS to 900 MB/s and 30,000 IOPS.

The best part is that it pretty much just works. Very little babysitting with the exception of the occasional fs trim or something.

It’s been a massive improvement for our caching system.


0 replies

Did something very similar almost 10 years ago, EBS costs were 10x+ the cost for same perfomance CEPH cluster on the node disks. Eventually we switched to our own racks and cut it almost in ten again. We developed the inhouse expertise for how to do it and we were free.

1 replies

Does someone knows how Ceph compares to other object storage engine like MinIO/Garage/...?

I would love to see some benchmarks there.

0 replies

This would be great, to have a universal benchmark of all available open source solutions for self-hosting. Links appreciated!

1 replies

I'm curious what the performance difference would be on a modern kernel.

0 replies

For context, I’ve been leading the work on this cluster client-side (not the engineer that discovered the IOMMU fix) with Clyso.

There was no significant difference when testing between the latest HWE on Ubuntu 20.04 and kernel 6.2 on Ubuntu 22.04. In both cases we ran into the same IOMMU behaviour. Our tooling is all very much catered around Ubuntu so testing newer kernels with other distros just wasn’t feasible in the timescale we had to get this built. The plan was < 2 months from initial design to completion.

Awesome to see this on HN, we’re a pretty under-the-radar operation so there’s not much more I can say but proud to have worked on this!

0 replies

My old company ran public and private cloud with Openstack and Ceph. We had 20 Supermicro (24 disks per server) storage nodes and total capacity was 3PB. We learnt some experiences, especially a flapping disk made whole system performance degraded. Solution was removing bad sector disk as soon as possible.

0 replies

Remember, random IOPs without latency is a meaningless figure.

0 replies

Sure would be nice if you defined some acronyms.

0 replies

The worst problems I've had with in-cluster dynamic storage were never strictly IO related, and were more the storage controller software in kubernetes having problems with real-world problems like pods dying and the PVCs not attaching until after very long timeouts expired, with the pod sitting in ContainerCreating until the PVC lock was freed.

This has happened in multiple clusters, using rook/ceph as well as Longhorn.