The SIMD code does come with one asterisk, though: because floating-point addition is not associative, and it performs the summation in a different order, it may not get the same result as straight-line code. In retrospect, this is likely why the compiler doesn't generate SIMD instructions to compute the sum!
Does this matter for your use case? Only you can know!
Anything is possible of course, and the typical way of writing a loop to sum up an array is literally telling the computer to go element by element and accumulate the values.
But it seems pretty unlikely that doing, say, four accumulations in parallel with simd and then summing them up at the end is any more wrong than going element by element.
Summing floats should by default be taken to have error bounds, and any answer in those bounds is valid. If you know something special about the floats you are inputting, the language should have some means to explicitly encode that. It shouldn’t be the most basic loop, that’s the default case, so it should give the best performance.
It turns into a serious problem if the floats are of different magnitude. Consider [1e50, -1e50, 1e3, 1e3]. Evaluating it as (((1e50 + -1e50) + 1e3) + 1e3) gives 2e3, evaluating it as ((1e50 + 1e3) + (-1e50 + 1e3)) gives 0.
You get similar issues if you're trying to add a large number of small-magnitude numbers to a single larger-magnitude number. (((1e3 + 1e3) + 1e3) ... + 1e50) is substantially different from (((1e50 + 1e3) + 1e3) ... + 1e3).
Adding and subtracting numbers of wildly different magnitudes is always cursed. If you're doing this I'd say the majority of the time you are making an error.
Your equation ends up being like "We calculated the diameter of the observable universe, but what if we added the Empire State Building to one side, how does that affect the result?"
It’s also what we do when incrementing a counter, if it’s a very big counter. Operations like that don’t have a physical meaning, but can happen in things like random number generators where it’s a way of mixing the bits. Or maybe allocating unique ids, if you started from a random number.
Of course it is just an example so we shouldn’t nitpick you too much, but that sounds more like a job for an integer.
Yes, absolutely. Though in JavaScript we often make do with the safe integer range. (I don’t know how fast BigInt is for 64 bit integers.)
I don’t know JavaScript so I’m sure to shoot my own foot here, but surely a BigInt who’s value can fit in 64 bits is just a normal 64 bit int (from hardware) with some extra bookkeeping around it on most platforms, right?
I can't speak for any other engines, but in SpiderMonkey a BigInt is a GC-managed object with a one-word header containing the sign bit, the length, and some GC bookkeeping bits, along with a second word that is either a single inline 64-bit digit, or a pointer to an array of 64-bit digits. So the memory overhead of a BigInt that fits in 64 bits is "only" 2x, but doing math with BigInt requires more work than a native integer because it needs to handle larger numbers, manually manage sign bits, and so on.
(https://searchfox.org/mozilla-central/source/js/src/vm/BigIn...)
If all you’ve got is JS then every problem looks like a double.
I don't remember the exact numbers, but I remember seeing a CAD file that was not working well a few years ago, and it turned out the coordinate system set up for the part was roughly equivalent to building a part the size of a man's fist but measuring it from the center of the Earth (in meters) while putting the part in Earth orbit.
But you haven't demonstrated that the SIMD approach produces a meaningfully different and more accurate approach than the naive method.
Just like processor magic made the theoretically more performant code perform worse, it's possible that the processor has facilities for bucketing the data such that results from such a dataset are the same or more accurate than the naive approach.
This is absolutely a situation that requires empirical testing.
Yeah, no.
If I know what my data look like, I can choose an order of summation that reduces the error. I wouldn't want the compiler by default to assume associativity and introduce bugs. There's a reason this reordering is disabled by default
Not the least that you should get the same result from your code every time, even if you compile it with a different compiler. Doesn't matter if that result is somehow better or worse or “difference is so small that it doesn't matter”; it needs to be exactly the same, every time.
I think there’s often a disconnect in these discussions between people that work on libraries and people that work on the application code.
If you have written a library, the user will provide some inputs. While the rounding behavior of floating point operations is well defined, for arbitrary user input, you can’t usually guarantee that it’ll go either way. Therefore, you need to do the numerical analysis given users inputs from some range, if you want to be at all rigorous. This will give results with error bounds, not exact bit patterns.
If you want exact matches for your tests, maybe identify the bits that are essentially meaningless and write them to some arbitrary value.
Edit: that said I don’t think anybody particularly owns rigor on this front, given that most libraries don’t actually do the analysis, lol.
It absolutely is not. Floating-point addition is completely defined (save for some of the bits in NaN results), and no CPU will start reordering your addition instructions in a way that changes the result. Even if it's an ordering that gives you better accuracy. Having SIMD do four (or more) of them in parallel does not change this.
Not realistically. The processor very much needs to do what you tell it.
Whether you care about the order or not depends on the intent of the programmer. I think the intent is usually to sum up the array for this sort of code. So, at first glance your first example looks more like a result of something like 0 +- 1e35 in double to me (added one to the exponent to avoid having to do math).
The intent of the programmer always has to be interpreted by the language. I’m saying the default interpretation should be one that doesn’t impose an ordering unless specifically requested. This is much more sympathetic to modern deeply parallel machines.
Very often there's no particular intent, other than to be consistent. You haven't lived until you've traced discrepancies in an ML model deployed across a bunch of platforms back to which SIMD instructions each happens to support (usually squirreled away in some dependency).
Isn't this the type of thing ffast-math is supposed to enable?
-ffast-math can be read as -fwrong-math if the accuracy if the answers matters. For some applications accuracy might not matter much. But the differences you get by pretending that floating point math is associative can be enormous, and scientific and engineering code is often carefully designed to take the likely scale of values into account, and rearranging it is not a good idea.
Rearranging, like sorting before, is actually a good idea to minimize errors.
For code that's been explicitly manually optimized already, indeed you wouldn't want to turn on -ffast-math; and I'd hope anyone who has actually learned what benefits from manual optimization and how to properly do that would also at some point have read something about not coupling that with -ffast-math. But I really doubt such meaningfully-manually-optimized code is at all frequent, and where it isn't it's a rather simple improvement (potentially with using some subset flags to allow NaN or infinity if desired).
Assuming associativity is not anti-accuracy in general, it just gets in the way of certain clever algorithms.
Maybe the “fun safe math” flag could be used?
Fast math does a lot of risky things in addition to using basic math rules.
Sorting the floats reduces the error. I think using multiple accumulators reduces the accuracy then. And sorted data is not uncommon.
I think there is always a correct answer and the compiler shouldn’t make changes at least by default that are wrong.
But ways for the programmer to express his intent more clearly are always welcome.
Multiple accumulators increases accuracy. See pairwise summation, for example.
SIMD sums are going to typically be much more accurate than a naive sum.
Not necessarily either. It's not particularly hard to create a vector where in-order addition is the most accurate way to sum its terms. All you need is a sequence where the next term is close to the sum of all prior ones.
There just isn't a one-size-fit-all solution to be had here.
But there is: https://news.ycombinator.com/item?id=40867842
That’s a one size fits some solution. Not more than 2x as slower than native floats is pretty slow for cases where you don’t need the extra precision.
If might be the case that it is the best solution if you do need the extra precision. But that’s just one case, not all.
Not necessarily. If the array is large and the values are similar, using accumulators would yield a more accurate result than a single one.
A lot of code relies on FP operations being deterministic, often within confines of specific ISA. Applying SIMD to loops with FP could have been a default but it breaks a lot of existing code, makes the output frequently straight-up non-deterministic and as a result is something that a programmer has to explicitly choose to use.
Moreover, many programmers may not be even aware of these, so when a `float Sum(float[] values)` starts to return something different, they may not have any way to know that it is the vectorization that makes it do so. Which is why, for example, .NET's standard library will use SIMD for `integers.Sum()` but not for `floats.Sum()`.
How much of this is true in practice? I vaguely recall reading about a hard-to-diagnose bug where a seemingly unrelated code change meant that an intermediate value was no longer stored in the x87 extended-precision register but in memory instead, leading to a different result. Wouldn't you run into stuff like that all the time?
A lot, but still quite the minority. People do run into situations where the same code with the same data produces slightly different results all the time because of changes in compiler, compiler settings, cpu arch, etc… They just don’t notice.
But, then they run into a situation where it matters and it’s a huge PITA to nail down.
Yes this is exactly it. The gold standard is bit-4-bit matching when shifting code between systems. If you move your code to a new system and get different results --> is it due to different summing order, or something else wrong? By ruling out the former you're more likely to get the bit-4-bit match and thus avoid a lot of debugging effort -- or spend it solving a real bug.
This is one of many reasons why we generally don't use the x87 registers anymore.
Surprisingly there are different algorithms for doing something as simple as summing up a list of numbers.
The naïve way of adding numbers one-by-one in a loop is an obvious way, but there are more sophisticated methods that give better bounds of the total accumulated error; Kahan summation[1] being one of the better-known ones.
Like most things, it can get complicated depending on the specific context. For streaming data, adding numbers one at a time may be the only option. But what if one could use a fixed-size buffer of N numbers? When a new number arrives what should be done? Take a partial sum of some subset of the N numbers in the buffer and add that to a cumulative total? If a subset if chosen, how? Are there provable (improved) error bounds for the subset selection method?
Fun stuff.
[1] https://en.wikipedia.org/wiki/Kahan_summation_algorithm
As far as I know, xsum [1] more or less solves the problem completely: order invariant and exact, at less than 2x the cost of the naive summation. Further speeding it up may be the only avenue left for improvement.
[1] https://gitlab.com/radfordneal/xsum
I was not aware of this (2015) work -- very nice!
A couple of pull-quotes from the paper to summarize:
Much work has been done on trying to improve the accuracy of summation. Some methods aim to somewhat improve accuracy at little computational cost, but do not guarantee that the result is the correctly rounded exact sum.
Many methods have been developed that instead compute the exact sum of a set of floating-point values, and then correctly round this exact sum to the closest floating-point value. This obviously would be preferable to any non-exact method, if the exact computation could be done sufficiently quickly
Exact summation methods fall into two classes — those implemented using standard floating point arithmetic operations available in hardware on most current processors, such as the methods of Zhu and Hayes (2010), and those that instead perform the summation with integer arithmetic, using a “superaccumulator”.
I present two new methods for exactly summing a set of floating-point numbers, and then correctly rounding to the nearest floating-point number. ... One method uses a “small” superaccumulator with sixty-seven 64-bit chunks, each with 32-bit overlap with the next chunk, allowing carry propagation to be done infrequently. The small superaccumulator is used alone when summing a small number of terms. For big summations, a “large” superaccumulator is used as well. It consists of 4096 64-bit chunks, one for every possible combination of exponent bits and sign bit, plus counts of when each chunk needs to be transferred to the small superaccumulator.
On modern 64-bit processors, exactly summing a large array using this combination of large and small superaccumulators takes less than twice the time of simple, inexact, ordered summation, with a serial implementation
Thanks for the summary. I kinda low-key love the idea of converting floats into a fixed point representation that covers the entire range represented by the float type. I mean the accumulator is only 32 KB, which is likely to be in L1 the entire time on modern hardware, and any given float is only going to need two 64 bit words, + 13 bits (12 bits for offset, and 1 for sign) to be represented in this scheme.
This is a lot of "should" for something that basically doesn't happen. The only information you give is the order of arithmetic in the original expression.
It is a total nightmare if arithmetic is not stable between builds. Rebuilding the software and running it on the same input should not produce different results.
(Long ago I encountered the special Intel version of this: because the FPU used 80-bit registers internally but 64-bit in memory, changing when a register fill/spill happened would change when your answer got rounded, and thereby change the results. You can set a global FPU flag at the start of the program to force rounding on every operation.)
This doesn’t do quite the same thing. It still uses the wider exponent range of the 80-bit type.
It does happen in languages and libraries with a higher level of abstraction. MKL for example will do whatever it wants for accumulations (which practically means that it’ll take advantage of SIMD because that’s a big reason why people use the library) unless you specifically request otherwise via there “Conditional Numerical Reproducibility” flag.
I think that was the right way to do it. BLAS made the right decision by defining these things in terms of sums and dot products instead of step-by-step instructions.
It will always be possible to write programs that run differently on different hardware or with different optimization levels. If somebody is writing code for floating point computations and expects exact bit patterns—it is possible, of course, all the rounding behavior is clearly specified. But usually this is an error.