Did John not write the code? Or did he copy paste it from somewhere?
If your computer was built after 1999, it probably supports the SSE instruction set. It contains the _mm_rsqrt_ps instruction, which is faster and will give you four reciprocal square roots at once: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
That being said, the techniques discussed here are not totally irrelevant (yet). There still exists some hardware with fast instructions for float/int conversion, but lacking rsqrt, sqrt, pow, log instructions, which can all be approximated with this nice trick.
Amusingly, (to me at least) there's also an SSE instruction for non-reciprocal square roots but it's so much slower than reciprocal square root that calculating sqrt(x) as x * 1/sqrt(x) is faster assuming you can tolerate the somewhat reduced precision.
I wouldn't be surprised if _mm_rsqrt_ps is actually implemented using the same bit level trick.
Same as Carmack's, we did a single step of Newton's method and it was definitely good enough.
Intel provides bit-accurate code. In older chips it used a faithful bipartite ROM:
https://web.archive.org/web/20120124193536id_/http://www.acs...
I don't recall the coprocessor having either reciprocal or reciprocal square root? I didn't do much Intel until later in my career though, so I might be missing something though.
Both _mm_rcp_ps (rcpps) and _mm_rsqrt_ps (rsqrtps) are only good for about half the bits.
I dunno about Intel and AMD, but ARM and RISC-V use lookup tables for rsqrt. Unlike AMD and Intel, those tables are precisely defined in their respective specs.
the vast, vast majority of computers do not support the sse instruction set or indeed any of the i386 and amd64 instruction set at all, and the fraction of computers that do support them (other than through emulation) is continually shrinking
gpu instruction sets, arm, risc-v, avr, pic, 8051, fpga... of course, in many cases, these do have a built-in approximate reciprocal square root operation, but probably implemented with this algorithm
In terms of CPUs where you would use an inverse square root (mainly games or scientific), x86 still reigns supreme.
gpus reign supreme for games and scientific, and none of them are x86; larrabee failed. and you'd be surprised at the 3-d kinematics and control systems that people do with microcontrollers
The SSE float reciprocal instructions have slightly different results between Intel and AMD, which can be a source of headaches for those expecting deterministic results between PCs. (see https://robert.ocallahan.org/2021/09/rr-trace-portability-di...)
Something interesting is I've seen this mythologized as evidenced of the "cracked engineer" theory, wherein random engineers will accidentally stumble upon incredibly complex discoveries in the course of their day to day work and expect no fanfare for this, besting the scientists and researchers who take the traditional route but aren't spurred by necessity. People, like xkcd, purported this was either Carmack or some random employee who figured it out in a few hours when stuck on a problem then waltzed onto the next one. In truth, as noted in the document, it has a history that goes back decades in academia, this was simply the most notable time it was implemented.
I think it speaks to an issue I see in the software engineering world where people assume that collaboration is for low-IQ people and all great innovation comes from some super-genius working on their own for long enough, and isn't required to share how they arrived at their findings. I'm sure the mythology of this algorithm is propelled somewhat by the enigmatic character of writing "what the fuck" next to adding the constant, implying a mystical element to its utility that was arrived at without needing to clarify it in some long boring research paper.
where people assume that collaboration is for low-IQ people
I've also seen a phenomenon where developers believe their code is so valuable and amazing that they don't share it; it's really ordinary code but useful to novices nonetheless. Some communities I participated in when I was younger (e.g. PS3 jailbreak scene) were very protective of their code for no reason. Executables were obfuscated, nothing was open source, and developers were very hostile or intentionally trolled you when asking questions about their software.
I've met some people like this before, and I think this phenomenon has a lot to do with having one's identity closely bound to the code. Any criticism of the code is interpreted as a personal attack. Conversely, the attitude of "I'm so awesome therefore my code must be super important and really important to protect".
Yeah I've seen this a lot too.
I've also seen the reverse - perhaps more. People need advice but don't want to share their code because they are embarrassed by it. They wrote it thinking "no one will ever see this".
By contrast I sell code, so I know it'll be seen by lots of people, so I tend to spend time making sure style is consistent, things are well named, and so on. But equally, to me, it's just code. Feel free to comment on it - there's always room for improvement.
That was all the console modding scenes.
Toxic, necessarily so.
I think the xkcd comic was more about academia vs product engineering in for profit entities. If anything, it is the collaborative effort of many people working on a product with deadlines that ends up with some of them coming up with really interesting solutions just due to the number of people looking. The lack of sharing between companies does increase the incidence of the outsider effect, at the massive cost of reinventing basic knowledge over and over. The only reason breakthroughs still happen despite that is just the large number of people trying random stuff that probably won't work motivated by unrealistic deadlines. They don't know it probably won't work though...
https://xkcd.com/664/ for the curious.
The point seems more to me that in engineering, breakthroughs worthy of academic repute aren't given much fanfare, though probably a bit of an exaggeration. It's hard to tell to what extent his comics reflect actual sentiments or rather an artifice of some abstraction taken to an absurd level.
Relevant topic: https://news.ycombinator.com/item?id=40485313
But that is what happened [1]. The article, which is probably the most definitive investigation to date, has Greg Walsh claim they invented the fast inverse square root while working at Ardent, which Wikipedia claims was founded in 1985 [2], to meet product performance benchmarks.
The Green Hills Software C compiler had a fast square root algorithm of similar form:
(x >> 1) + {magic constant}
dating between 1983-1985 if I recall correctly. Also implemented to maximize floating point performance benchmarks, and again not drawn from academia. If my recollection is correct, that is one of the earliest known examples of the general technique and predates even the official IEEE 754 floating point specification which was not formally ratified until 1985 (but the standard was in development since 1977 and already de facto adopted by the time it was formally ratified).[1] https://www.beyond3d.com/content/articles/15/
[2] https://en.wikipedia.org/wiki/Stardent_Inc.#Ardent_Computer_...
He posts c code and says that it computes the inverse square root, but the code he posted is undefined c, so while it could compute the inverse square root of it's input it could also do a long list of other stuff.
One of the many reasons you should stay away from real world languages when talking about algorithms unless you are an expert in the language.
If we're talking real-world languages, the code is perfectly well-defined. It's undefined behaviour in Abstract C, the C of the Specification, which to my knowledge nobody has ever implemented, but Quake 3 was not written in Abstract C. It was written in Visual C++ 2003, gcc 2.95, and some other specific C implementation.
The code snippet that the article claims is C:
int32_t compute_magic(void) {
double sigma = 0.0450465;
double expression = 1.5 * pow(2.0, 23.0) * (127.0 - sigma);
int32_t i = expression;
return i;
}
which, as far as I can tell, is perfectly well-defined Abstract C.Specifically, I think the original code will always do what you expect as long as sizeof(long) == sizeof(float), and the alignment is the same. In reality no compiler is gonna do weird stuff unless your hardware target is weird.
compiler engineers have been disappointing our 'surely no compiler would ever be so perverse as to do x' expectations for 25 years now, and it doesn't seem that they're likely to stop soon
Examples of such a weird hardware target include Linux and MacOS, which use LP64: https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_m...
If you use a union, your code won't read or write out of bounds, since it will save space for the larger type. Ideally you would also specify _Float32_t and int32_t for some pseudoportability (at least before endianness gets involved), although I'm aware this was not available in 1999.
Minor correction: Q3 came out in 99, so it was probably written in either VC++ 97 or VC6.
Otherwise, carry on.
The code he posted is straight from the Quake 3 source release, a program which has certainly been installed and used on hundreds on millions of real computers. UB pedantry makes no difference to the actual behavior of the program as demonstrated.
It’s a historical artifact being presented in its original form. There’s no reason this code can’t be updated with a more modern bitwise int-float conversion. Do that if you want to use it.
Whether it is bad practice or not, people used to do this all the time, and so the compilers all tend to support type conversion via pointer and type punning via union. A lot of critical code in the real world would break if it suddenly didn’t work.
C++ bit-cast is brand new (C++20) and until now the suggestion in C++ was to use memcpy, and hope and pray it gets elided. That might be correct but boy is it ugly and gross.
I collected a few of these here:
https://github.com/ncruces/fastmath/blob/main/fast.go
Also see this StackOverflow:
https://stackoverflow.com/questions/32042673/optimized-low-a...
That's great, thank you. I was just thinking about starting a collection of these to begin rewriting my old 3D engine from the late 80s.
Please, post the progress if you end up rewriting it.
I'd love to see some benchmarks on your fastmath package
It's… probably not worth it.
I just wrote this years ago (go.mod said 1.12) for the fun of it, thought I had it in a Gist/GitHub, and uploaded it yesterday in response to this post.
One thing I remember trying was coding this up in ASM… which makes it worse, because it prevents inlining. But I learned the Go ASM syntax that way.
Very interesting, thank you for posting it.
Time for nitpicks, sorry.
The formulas for the floats have typos. They should read (-1)^S, not -1^S (which always equals -1).
Interpreting the raw bit patterns isn't a piecewise linear approximation of the logarithm. The lines between the data points on the blue graph don't actually exist, it's not possible for a bit to be half set to 1. It's rather a discrete version of the logarithm: the only data points that exist - where the red and blue lines meet - are literally equal to the (scaled, shifted) logarithm.
Other than that, nice post!
I'm not sure I understand. Imagine a very small 6-bit float, with 1 bit sign, 2 bits exponent, and 3 bits mantissa. The interval [010000, 010111] contains the numbers 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 3.75.
But! These numbers' base two logarithms imply mantissae of
- .0000000 (equal to the float 000)
- .0010101 (not equal to the float 001)
- .0101001 (not equal to the float 010)
- .0111010 (not equal to the float 011)
- .1001010 (not equal to the float 100)
- .1011001 (not equal to the float 101)
- .1100111 (not equal to the float 110)
- .1110100 (not equal to the float 111)
because the floats in the [2,4) interval are linearly spaced, whereas the corresponding logarithms are not. In other words, the floats are a piecewise linear approximation of the logarithm – just as the article says.
You're right about the first part, of course. But don't you just need to truncate?
In any case, my point was more that it's not "piecewise linear". Piecewise linear means that the map is defined on some interval; it's not, it's defined on a discrete set of values. To take your example, the map isn't defined on e.g., 2.3. You can decide to interpolate linearly between the value at 2.25 and the value at 2.5, but that's a decision you make which isn't reflected in the code.
Or said differently: do you consider that any map defined on a discrete subset of R is really a piecewise linear map?
If your point is that it's more of an arithmetic sequence than a line segment, thanks to discretisation, then I agree but I don't have a good word for it other than "discrete piecewise linear".
I mean it's not continuous, but it's still on an interval scale.
It's not a continuous piecewise linear approximation of the logarithm, it's a discrete piecewise linear approximation of the logarithm. You are correct that the blue lines aren't continuous, but incorrect about their interpretation - there are 256 individual points evenly spaced along the that x axis that make up the blue graph, not just a handful at the intersections. (A full graph would have 2^32 options in a piecewise linear pattern, but that isn't what the OP has graphed.)
Given that the article is talking about operations on 32 bit integers and IEEE-754 32-bit floats, I personally think it's fine to leave out "discrete" in the description.
Here’s something new that isn’t mentioned and might be worth adding to the repo. (Edit I’m wrong, it is there, I just didn’t recognize it.) The magic number in the famous code snippet is not the optimal constant. You can do maybe 0.5% better relative error using a different constant. Maybe at the time it was infeasible to search for the absolutely optimal number, but now it’s relatively easy. I also went down this rabbit hole at some point, so I have a Jupyter notebook for finding the optimal magic number for this (1/x^2) and also for (1/x). Anyone wanna know what the optimal magic number is?
Anyone wanna know what the optimal magic number is?
Sure.
For no Newton iterations, I thought I found 0x5f37641f had the lowest relative error, and I measure it at 3.4211. Of course I’m not super certain, Lomont’s effort is way more complete than mine. Lomont’s paper mentions 0x5f37642f with a relative error of 3.42128 and Wikipedia and the paper both talk about 0x5f375a86 being best when using Newton iterations.
There’s a link at the bottom to a paper exploring that question.
Good point, I hadn’t read Lomont’s paper and should have. I read the section in the Wikipedia article talking about it, and did try the constant that it suggests, however it depends on doing extra Newton iterations and I looked at the relative error of the initial guess without Newton. I can see in the paper he found something within 1 bit of what I found. I’m not certain mine’s better, but my python script claims it is.
I thought the most interesting bit in this article was the link to "How Java's Floating-Point Hurts Everyone Everywhere":
https://people.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf
Written by William Kahan, also known as the Old Man of Floating-Point:
https://news.ycombinator.com/item?id=29042853 - An Interview with the Old Man of Floating-Point (1998)
Totally not-related to the topic, but I started reading that JAVAhurt PDF and find the typesetting god-awful. Just me? Feels like they imported some TeX package to space out words on a line far, far too far... and inconsistently. Like it was OCR'd from something else and put in extra spaces. Even the monospace sections have weird extra spacing.
Anyway, enough ranting for the day, but I found this really hard to focus on to read - almost felt like a science-kook type manifesto, though I know it is not.
(post edit time edit...)
I looked at some of his other writings and they all suffer this. I'm not sure what TeX template he's using, but it's bizarre. (IMO, of course.)
Noob here. Please ELI 5 what this algo used in quake 3 ?.
Solve lighting equations
Is it the light or laser effect when the gun is fired ?
When you render a pixel on screen (of a diffuse object), to determine how brightly it's lit, you need to compute the angle between the normal at the point on the surface and the light source. Computing this angle requires normalizing one or more vectors, which is done by dividing by the magnitude, or the square root of the sum of the squares of the components.
So the lights in this case are any lights in the scene, typically represented as a point of origin and a brightness.
To nitpick the article a bit:
It's important to note that this algorithm is very much of its time. Back when Quake 3 was released in 1999, computing an inverse square root was a slow, expensive process. The game had to compute hundreds or thousands of them per second in order to solve lighting equations, and other 3D vector calculations that rely on normalization. These days, on modern hardware, not only would a calculation like this not take place on the CPU, even if it did, it would be fast due to much more advanced dedicated floating point hardware.
Calculations like this definitely take place on the CPU all the time. It's a common misconception that games and other FLOP-heavy apps want to offload all floating-point operations to the GPU. In fact, it really only makes sense to offload large uniform workloads to the GPU. If you're doing one-off vector normalization--say, as part of the rotation matrix construction needed to make one object face another--then you're going to want to stay on the CPU, because the CPU is faster at that. In fact, the CPU would remain faster at single floating point operations even if you didn't take the GPU transfer time into account--the GPU typically runs at a slower clock rate and relies on parallelism to achieve its high FLOP count.
I think he refers to FPU, not GPU.
In the old days, FPU do async computation.
FPU is now a considered an integrated part of CPU.
I tried this in a raytracter today while learning rust and it made the scene render all white. Lol
This actually made me chuckle, although it took an embarrassingly long second.
I wrote a implementation in MMIX:
% Constants
FISRCON GREG #5FE6EB50C7B537A9
THREHAF GREG #3FF8000000000000
% Save half of the original number
OR $2,$0,0
INCH $2,#FFF0
% Bit level hacking
SRU $1,$0,1
SUBU $0,FISRCON,$1
% First iteration
FMUL $1,$2,$0
FMUL $1,$1,$0
FSUB $1,THREHAF,$1
FMUL $0,$0,$1
% Second iteration
FMUL $1,$2,$0
FMUL $1,$1,$0
FSUB $1,THREHAF,$1
FMUL $0,$0,$1
This implementation makes an assumption that the original number is greater than 2^-1021.This is a good post explaining a lot of interesting concepts, with a section that has surprisingly bad algebra.
The exact steps to go from the first form to this one are numerous to say the least, but I've included it in full for completeness.
The algebra included after that line has many unnecessary steps, as well as multiple cancelling sign errors. Most notably the second line to the third line does not distribute the negative sign correctly. Picking up after the second line, I would simply have:
y_n+1 = y_n + (1 - x * y_n^2) / y_n^2 * (y_n^3 / 2)
y_n+1 = y_n + (1 - x * y_n^2) * (y_n / 2)
y_n+1 = y_n + y_n / 2 - x * y_n^3/2
y_n+1 = 3/2 * y_n - x * y_n^3/2
y_n+1 = y_n (3/2 * y_n - x * y_n^2 / 2)
y_n+1 = y_n (1.5 * y_n - 0.5 * x * y_n * y_n)
This is fewer than 1/3 of the included steps (between the second line and the end), and has the advantage of being correct during the intermediate steps. I don't think any of my steps are anything other than self-evident to someone who understands algebra, but I'd be happy to entertain competing perspectives.
I really liked this video about it when I saw it a while ago: https://www.youtube.com/watch?v=p8u_k2LIZyo
If you're interested, Wikipedia also has a decent discussion of this function and it's history. https://en.wikipedia.org/wiki/Fast_inverse_square_root
I was building 3D engines a few years before the days of Quake, and having recently watched videos about optimizing the trig code in Super Mario 64, and these other faster algorithms, it makes me wonder how much more can be squeezed out of old hardware.
I was optimizing my assembler to the nth degree, but optimizing the algorithm is always going to be the real winner.
Thanks for stealing hours of my productive time going down this rabbit hole.
What’s up with the diagonal lines in the graph backgrounds?
Note that you’re actually using 3 distinct magic numbers - not only the single 0x5f3759df, but also 0.5 and 1.5;
So for further accuracy we can instead have:
conv.i = 0x5F1FFFF9 - ( conv.i >> 1 ); conv.f = 0.703952253f ( 2.38924456f - x * conv.f * conv.f ); return conv.f;
[from Wikipedia]
wikipedia says:
And at 3dfx he probably learned it from some ex-SGI guy, and that guy learned it doing his PhD at Stanford from someone who worked for ed Ed Catmull and so on. The lore goes deep with stuff like this. There's rarely just one author but many who improve the formula over time.
there's an unrecognized genius out there that realized you can just apply a bit flip to get that out there. That John Carmack hasn't claimed it as his invention when he could have, speaks volumes about his character.
I’m not sure anyone should be impressed by someone not stealing even if they had a chance to. That’s just a basic human expectation.
You'd think so.
Except he couldn't have even if he wanted to. There's always the risk some former 3dfx or SGI graybeard comes out with some paper binder with the original implementation, and calls you out on your bullshit if you do. When you're such a famous public figure no way you ever risk lying in public. Even if he were to getaway with it there was nothing for him to gain: he already has all the fame and money from honest work, there's no point risking it all to claim some piece of code.
Then you call it a parallel discovery!
The origin is really unclear. It's not certain which id employee added it to Quake 3 and where they got it from. John claims it wasn't him. Gary Tarolli, one of the founders of 3dfx, claims to remember tweaking the value of the hex constant to the value known today. But he says he didn't come up with the algorithm either. Maybe somebody else at 3dfx did, maybe he got it from somewhere else.
I believe this particular line of code comes from Terje Mathisen, who is still active on the IEEE 754 standards committee today.