return to table of content

Erasure Coding for Distributed Systems

jeremycarter
8 replies
4d19h

When I was younger, I literally thought PAR's were magic files. I had no idea how they worked, and from a distance it was magic.

dragontamer
4 replies
4d15h

PAR files use ReedSolomon error correction which IMO might as well be magic.

Galois Fields are really awesome (and are related to CRC codes). The level of effort to learn is quite high. NASAs guide to ReedSolomon was amazing though

-----------

XOR codes are easier and sloppier. But are actually what's used today despite being imperfect.

Let's say you have A, B, C... Z as your data.

Parity#1 could be XOR(A, B, Z). If B were missing, Parity#1 XOR A XOR Z can back-calculate B.

Parity#2 can be a random set of all previous entries (including Parity#1). Etc. etc. etc.

Keep adding XOR parity codes until your probability of reconstruction is high enough to please you.

I believe this concept is called a Fountain Code.

halfcat
1 replies
4d14h

Also I’ve found that while most people (non-tech) don’t have a concept of XOR, they probably took basic algebra and understand 1+?=3

Arithmetic wouldn’t be a good implementation due to integer overflow (a problem XOR doesn’t have) but it’s helpful if you ever have to explain it to the less technical business person who you need to sign off on the purchasing decision.

genewitch
0 replies
4d13h

that's how i used to explain what a nonce was to explain what all the computers were doing to "mine" bitcoin. and then explain "they're instead trying to get a number that has a certain number of zeros in a specific place"

dragontamer
0 replies
4d3h

Yes, this is it.

It's a very shallow introduction to Galois Fields but it's just barely enough to reach Reed Solomon encoding and understand error correction codes.

As I said earlier: it's a lot of math, even in this simplified form. Abstract conceptual math. But I do find this kind of abstractness very fun.

cdumler
2 replies
4d17h

"Any sufficiently advanced technology is indistinguishable from magic." -- Arthur C. Clarke

I thought the same thing when using PAR files. They're still useful today if you save things on media that can be damaged (CD, DVD, Blue-Ray) or across multiple multiple media.

Eventually, I decided to dig into the math behind it. It is a surprisingly simple principle:

Given polynomial of a degree X and an array of data points of size X, there is one and only one solution to the polynomial's coefficients such that it will pass through those data points.

So, stripe the data into bands of arrays, compute the polynomial, and compute additional data points of the curve, and save it with the original data. If you have at least the array's size of data points ( original array and/or parity values) and know the place in the list for each data point (thus which data is missing), there is one and only one solution to the polynomial equation. Once you solve the polynomial again, you can compute any point, including the missing ones. Again, because there is one and only one solution for the curve.

The devil is the math necessary solve the polynomials, which is why it is so computationally intensive.

pwg
0 replies
4d1h

They're still useful today if you save things on media that can be damaged (CD, DVD, Blue-Ray)

For writable disk media, there is also dvdisaster:

https://dvdisaster.jcea.es/

kqr
0 replies
4d13h

"Any sufficiently advanced technology is indistinguishable from magic." -- Arthur C. Clarke

"If anything seems like magic you're not asking enough questions." -- Mario Romero Vega

masspro
4 replies
4d12h

I was unpleasantly surprised by but thankful to have found eclecticlight.co’s findings about PAR2. When I learned about PAR2 I immediately wanted to make par files for everything because bit rot scares me. But, from https://eclecticlight.co/2020/04/20/file-integrity-5-how-wel... :

This has serious implications for the use of Par2 with files much larger than 20 MB, and probably rules it out as a method of ECC for those larger than 1 GB.

I assumed 10% PAR file size == resistance to 10% of the input file being corrupted, but that’s not how it works. The article shows some nonlinear and non-obvious relationships between input file size, par file size, and maximum number of recoverable errors.

loeg
0 replies
4d

I think something about the test methodology in that article is severely flawed.

justsomehnguy
0 replies
4d

ecause bit rot scares me.

Use WinRAR/RAR recovery record for the important things.

There is one site what still mandates 5% RR for the archives because before the ubiquitous HTTPS the trashed in transit archives were the norm.

hcs
0 replies
4d10h

It's my understanding that par2 is designed for missing files (parts of a multi part archive), not the uniform random bit rot corruption used in that article. I think it can recover a much larger corrupted or missing block, approaching the size of the parity files.

But yeah if that's your data loss model then par2 isn't the right approach. (Not sure what is.)

dragontamer
0 replies
4d3h

Bitrot is reasonably handled by erasure codes by simply having CRC32 checksums (or similar) verifying the parts.

If a piece has bitrotted away, then you throw away the whole segment.

CRC32 is closely related to ReedSolomon / Galois Fields. It's basically a repeated division + remainders in Galois Field. And as we all know: Division is very good at mixing up bits (true in normal math as well as Galois Fields).

The real benefit of cyclical codes is the guarantee to catch any burst error of size 32 or less (for a CRC32). You only get a chance of false negatives if the error region is larger than the CRC size.

------

Indeed: the whole erasure code / correction code thing has complex math constructs so that these tight guarantees can be made. (Be it CRC32 or ReedSolomon, or any old school biterror algorithm).

halfcat
0 replies
4d14h

Yes, also RAID5 has been in use at least since the 1980’s

hinkley
9 replies
4d19h

Years back someone proposed a cute algorithm for erasure codes that depended not on spinning rust but on multipath networking.

I believe they called it network coding and the idea was in a network with multiple routes I might get a file faster by pulling an erasure code that used two parts of the file or even two files from one upstream instead of waiting for the entire file from the primary server.

sva_
3 replies
4d17h

I thought about something like that to make video calls more stable. For example I'll get completely different routes (sometimes noticeably lower/more predictable latency) when I use a VPN to connect to some peer in the US (from Europe.) Would be cool to combine different routes.

supertrope
1 replies
4d16h

It’s called SD-WAN

kayg04
0 replies
4d12h

can you explain? I tried looking it up but I didn't quite understand how it is called SD-WAN.

toast0
0 replies
4d3h

One difficulty with using multiple routes is you'll probably need to spend a lot of bytes on active probing, because the quality of a connection may change during the call and when the active connection loses quality it's not apparent what the other connections will do, so you need recent probing from them as well to make a smart change.

But in theory, you should have several routing options in a well supported calling service. I'll illustrate one direction, but the same options apply in the other direction, and there's no need for both peers to use the same connection to send (although WebRTC will)

Peer A -> Peer B

Peer A -> Relay near A -> Peer B

Peer A -> Relay near B -> Peer B

Peer A -> Relay near A -> Relay near B -> Peer B

If at least two of the four hosts mentioned have IPv4 and IPv6, you can also add those permutations. It's pretty typical to have different routing for v4 and v6.

nullc
1 replies
4d18h

Network coding is more than that, participants in the graph can synthesize new parts on the fly from parts they just got without having the whole thing.

FWIW, freenet at least uses fec-coded files so that you can have some flexibility in what parts you get and durability against a file becoming broken just because a single part gets lost.

StillBored
0 replies
4d17h

and usenet binaries with https://en.wikipedia.org/wiki/Parchive and even earlier with RAR's recovery block, and probably even earlier with BBSes, but my memory is failing me what I was using before it in the 1990s.

edit: I see immediately below while I was composing this, someone mentioned pararchive..

epistasis
1 replies
4d17h

This has been used in Ceph for a long time:

https://docs.ceph.com/en/latest/rados/operations/erasure-cod...

I would not be surprised if there was a lot of stuff like this behind S3 and other cloud storage systems too, at least in the lower-access tiers of storage, but I have no actual knowledge of AWS or GCP systems.

preisschild
0 replies
4d14h

Yeah, I use it in my homelab and it is really awesome to have "RAID(5/6)" basically work over the network.

nullc
6 replies
4d14h

unfortunately modern cpus are still pretty sparse on tools to make erasure codes extremely fast. E.g. no vector clmuls.

klauspost
5 replies
4d9h

Almost all (x86) CPUs sold have GFNI. That can pretty much saturate memory bandwidth on a single core or two. You can use SSSE3 pshufb for the rest which is about half the speed.

ARM has NEON and SVE/SVE 2. They also operate very fast.

So not sure what you are thinking of.

nullc
4 replies
4d

GFNI only does 8-bits and sticks you with a single modulus. But for some reason I'd forgotten it completely, you're totally right to give me a mystified response.

(FWIW, it's possible to project elements from one field to another isomorphic field, though it takes enough operations that for fast code like RS decoding the conversion is probably performance limiting).

For hybrid codes GFNI should be sufficient, though for things like using RS at 16/32 bit sizes it's not.

dragontamer
2 replies
3d14h

The matrix multiplication needed to correct grows at n^2 despite the code only growing at n. There are asymptotic faster matrix multiplications in theory than O(n^2) but in practice every algorithm is O(n^2).

As such, large ReedSolomon codes are impractical. If you need a larger code than what GF(2^8) can offer, you grow with 2-dimension codes, slicing or other features.

In practice, this sacrifices Minimum Distance property, meaning you should use a Turbo Code (or other XOR code) which are O(n) but imperfect.

---------

CRC32 can be implemented in GFNI. And AES is also GF(2^8).

----------

I don't think there are many algorithms where GF(2^16) or bigger are needed.

And if they did, it's possible to turn 8x8 into 16x16 or 32x32 anyway.

nullc
1 replies
3d10h

RS codes can be done in N log N time for both encoding and erasure decoding, and sometimes achieving MDS is useful for attack resistance... e.g. that you can always decode with N blocks, vs with non-minimum distance codes you can have adversarial subsets that will fail to decode even when you have significantly more blocks than needed.. Larger word sizes for RS codes is also useful for list decoding even when the number of data blocks isn't that great.

And sure, list decoding is slow. But I think there are two distinct groups of applications for good instruction sets: one is where you're trying to make a fast thing like an 8-bit RS code or something "free" by having it run at near memory, wire, or disk speeds (or make it consume little power). but the other is where you're doing something legitimately slow, including things that are O(N^2) (or worse). In those cases sometimes a small constant factor makes a big difference between usable and very much not usable.

dragontamer
0 replies
2d4h

In those cases, I know that ARM has PMULL / PMULL2 and x86 has PCLMULQDQ, which go up to 64x64 == 128-bit multiplication.

There is even an AVX512 version of PCLMULQDQ.

klauspost
0 replies
3d9h

In my experience, the hybrid approach is practically better due to cache locality anyway, so I don't think even native GF16 or GF32 would be of much help.

FWIW, I've ported rs-leopard to Go and found it very effective, except in the "recover one" scenario, where it is only at 50% speed of the "recover all" scenario, since it has to do several reconstructions to get one output. But even so, I am not too sure it would be much better with plain GF16, since you would still need to touch most shards to get one shard out.

AYBABTME
2 replies
4d14h

Aren't there patent problems with fountain codes?

jcalvinowens
1 replies
4d14h

Luby's original paper was published in 2002. Not sure about RaptorQ though...

Sanzig
0 replies
4d2h

IIRC, Qualcomm still holds patents on RaptorQ. They do provide a blanket license exemption for implementing RFC6330.

jumperabg
1 replies
4d11h

Are rateless fountain codes the better solution and if are there any systems that are using them?

toolslive
0 replies
4d6h

Amplidata did this https://en.wikipedia.org/wiki/Amplidata

It's a great solution (fast, storage overhead of about 1.2%) iff your data is immutable.

Luker88
0 replies
4d11h

I have implemented RaptorQ and RFC6330.

First, the rfc is pointlessly complex and optimized for files, not for streaming. if you want to play with it, manage blocks by yourself, ignore the asinine interleaving and block size management.

Second, the algorithm is actually split in two parts, and while the second (generation of repair blocks) is linear, the first is cubic on the number of messages that you put together in a block (~~ matrix gaussian elimination).

And while parts of both encoding and decoding can be cached, I think that "linear time" encoding for raptorq is actually just false marketing speak.

wahern
4 replies
4d19h

Has anybody used Wirehair in a project? https://github.com/catid/wirehair

I'm curious if it's well-defined enough to base a standard around--informally if not formally--for building a large file archiving/data recovery project I've been mulling over for nearly 10 years. It's the only large block erasure code I've found that has both the ideal (or nearly ideal) algorithmic performance and API. That makes it a nice blackbox for my use case, unlike something like RaptorQ, which leaks little details all over the place, driving up the complexity and rigidness of the rest of the stack. But Wirehair isn't a spec, it's an (experimental?) implementation of an idea. It seems stable, but unless/until I try writing a second implementation, or it's seen substantial use (exposing any sharp edges in the algorithm) I worry how easily it would translate to a reliable specification or second implementation.

nullc
3 replies
4d19h

We previously used it in Bitcoin Fibre (a fork of the Bitcoin node software with special enhancements for block relay). It's extremely nice.

Be aware that Qualcomm might claim that its covered by RaptorQ patents (it is conceptually related), though the earliest of those are about to expire (or just expired, haven't checked the file wrapper lately) and QC has made some commitment to not apply the RaptorQ patents outside of wireless (but that might be only for conforming implementations, I don't recall).

I've looked at what it would take to specify it-- which would be something that we'd want to do if using it in the bitcoin protocol proper and wasn't super excited about doing it-- even though myself and several other bitcoin developers are quite comfortable with number theory and error correcting codes. It's just that wirehairs structure has a fair amount of adhoc-ish details and knowing us we might get sucked into a trap of improving it. :)

There might be some renewed interest in bitcoin land at getting a fountain code into wide use, if so waiting a while might result in someone else writing a spec.

Depending on your exact application you might find https://github.com/catid/fecal interesting too... if your expected erasures count is very low it could be faster than wirehair.

Leopard is mentioned in the article-- it's not a fountain code but it has a pretty big block size. It has a nice advantage for specification: it's merely a very fast implementation of a boring RS code (so a spec would arguably only need to document the field and generator choice).

latchkey
2 replies
3d23h

Bitcoin Fibre

Wow, been a long time since I've heard that project named. What ever happened with that?

nullc
1 replies
3d17h

The bitcoin satellite broadcast stuff uses a fork of the fibre codebase with changes to the FEC to better adapt it to recovering from signal disruptions. Otherwise it's not maintained anymore.

Fibre contained a number of complementary optimizations. A couple have been merged into popular node software-- that's what compact blocks (BIP152) is. The rest are still valuable particularly now with miners frequently including never-relayed transactions, CPUs having become ever more faster relative to latency, and some of the underlying number theory toys from proposed but not ever implemented further fibre enhancements having since been implemented for other purposes. Though the unmerged parts were the harder parts to upstream.

But with many of the contributors who worked on that sort of stuff driven out by vexatious litigation and other forms of harassment, I dunno if anyone will continue that work any time soon. Although, I think there has been some interest. The remaining contributors that I think are more likely to work on things like that have been busy developing an improved combinitoric optimizer for transaction selection, but that work is almost finished.

The two major things fibre had that are missing now is using a fountain code to fill in missing data from blocks and transmission over UDP. The two can be implemented independently but they're much more powerful together. Both are independently kind of complicated to implement in a non-yolo way: the error correcting code because it seems rocketsciency and so it narrows the pool of people willing to try working on it, and the UDP because it's a whole separate P2P protocol arguably, and has a lot of connection and resource management questions that have to be competently answered (some of which were just punted on in fibre).

latchkey
0 replies
3d17h

Great answer @nullc, thank you for the context details. I always felt like having a subnetwork that was super fast optimized for sending transactions around the mempool would be really useful.

The optimizer is a fascinating knapsack problem to me. A lot of work in ETH land went on with this, especially with the work on Flashbots/MEV. But we hear less and less about it in the public these days.

Cheers

ggm
4 replies
4d18h

Am I right in thinking that products made during an M of N incident are coded differently to when all N are available? If so, you might want a bitflag to denote "needs to be re-encoded when the N is restored" or else you have some files with less than stellar recovery for a random loss in the N set.

jeffbee
3 replies
4d18h

Whenever you have a stripe with missing chunks they need to be re-encoded ASAP because those stripes will be lost if they lose enough chunks. Every distributed storage system needs some kind of librarian to go around grooming the stripes to keep them out of danger.

ggm
1 replies
4d17h

My point was specifically to new things created during the period. Resilvering happens on addition of a replacement drive in ZFS. I am less sure that otherwise valid, complete checksum states of files made during the loss period get uplifted to the wider stripe count when that is done.

I say that because I have seen some stuff which suggests when you grow the FS with new VDEV in ZFS there are circumstances where the balance is not fixed and you can have persisting unbalanced IO state.

jeffbee
0 replies
4d17h

In distributed systems you don't need to be constrained to writing to a set of devices one of which is not available. You can just write anywhere, and remember where.

markhahn
0 replies
4d16h

just a suggestion: let the curator's activity be called "preening" rather than "grooming"...

stevefan1999
1 replies
4d15h

Yep, this is the key tech behind Ceph's Erasure Code pool: https://docs.ceph.com/en/latest/rados/operations/erasure-cod...

This does not come without trade-offs though, you cannot update the coding parameters (k, m) afterwards, so you either have to be very sure that those parameters are going to work in a long time, or you have to start from scratch. This inelasticity is also the reason why replicas are still the dominant choice for HA fault tolerant data storage.

jsolson
0 replies
4d13h

Funny story, if you're using Rook Ceph and say "I'm going to try to update these parameters to see if it will let me (and trigger a re-encoding)" it absolutely will let you change them, but it does not trigger anything to re-encode.

It just uses --force and leaves you with a corrupt filesystem.

I suppose that's only really funny in a "you had to be there, and not be me" sense.

donavanm
1 replies
4d15h

If youre interested in EC you might want to consider larger multi dimensional cases. Think of encoding not just across spindles, but another failure domain like rack, room, DC, or region. The goal being to tolerate common component failures, and larger system failures (or partitions) as well. A nice intro https://chameleoncloud.org/blog/2023/12/12/design-considerat...

londons_explore
0 replies
4d12h

You also need to take into account other constraints, like recovery time. If one of your companies datacenters gets destroyed, then it's great to be able to recover from the other 6 using clever erasure codes, but if that recovery requires reading every byte of data in every other DC and sending it over the network, it's gonna take 6 months+ to transfer all that data over a cross ocean fiber which might only be 1 Tbps.

from-nibly
0 replies
4d19h

This is one of the replication strategies that ceph uses for its distributed blob stores

anonymousDan
0 replies
4d17h

Is it the case that it's really only practical for read-only or very read intensive workloads?