Bonus quirk: there's BSF/BSR, for which the Intel SDM states that on zero input, the destination has an undefined value. (AMD documents that the destination is not modified in that case.) And then there's glibc, which happily uses the undocumented fact that the destination is also unmodified on Intel [1]. It took me quite some time to track down the issue in my binary translator. (There's also TZCNT/LZCNT, which is BSF/BSR encoded with F3-prefix -- which is silently ignored on older processors not supporting the extension. So the same code will behave differently on different CPUs. At least, that's documented.)
Encoding: People often complain about prefixes, but IMHO, that's by far not the worst thing. It is well known and somewhat well documented. There are worse quirks: For example, REX/VEX/EVEX.RXB extension bits are ignored when they do not apply (e.g., MMX registers); except for mask registers (k0-k7), where they trigger #UD -- also fine -- except if the register is encoded in ModRM.rm, in which case the extension bit is ignored again.
APX takes the number of quirks to a different level: the REX2 prefix can encode general-purpose registers r16-r31, but not xmm16-xmm31; the EVEX prefix has several opcode-dependent layouts; and the extension bits for a register used depend on the register type (XMM registers use X3:B3:rm and V4:X3:idx; GP registers use B4:B3:rm, X4:X3:idx). I can't give a complete list yet, I still haven't finished my APX decoder after a year...
Can you imagine having to make all this logic work faithfully, let alone fast, in silicon?
X86 used to be Intel's moat, but what a nightmarish burden to carry.
Did people just... do this by hand (in software), transistor by transistor, or was it laid out programmatically in some sense? As in, were segments created algorithmically, then repeated to obtain the desired outcome? CPU design baffles me, especially considering there are 134 BILLION transistors or so in the latest i7 CPU. How does the team even keep track of, work on, or even load the files to WORK on the CPUs?
They use EDA (Electronic Design Automation) software, there are only a handful of vendors, the largest probably being Mentor Graphics, now owned by Siemens. So, yes, they use automation to algorithmically build and track/resolve refactors as they design CPUs. CPUs are /generally/ block-type designs these days, so particular functions get repeated identically in different places and can be somewhat abstracted away in your EDA.
It's still enormously complex, and way more complex than the last time I touched this stuff more than 15 years ago.
I love that the EDA industry still uses Tcl heavily. Warms my heart.
It's written in an HDL; IIRC both Intel and AMD use verilog. A modern core is on the order of a million or so lines of verilog.
Some of that will be hand placed, quite a bit will just be thrown at the synthesizer. Other parts like SRAM blocks will have their cad generated directly from a macro and a description of the block in question.
To further expound on this. ASIC (like AMD CPUs) is a lot like software work. The engineers that create a lot of the digital logic aren't dealing with individual transistors, instead they are saying "give me an accumulator for this section of code" and the HDL provides it. The definition of that module exists elsewhere and is shared throughout the system.
This is how the complexity can be wrangled.
Now, MOST of the work is automated for digital logic. However, we live in an analog world. So, there is (As far as I'm aware) still quite a bit of work for analog engineers to bend the analog reality into digital. In the real world, changing current creates magnetic fields which means you need definitions limiting voltages and defining how close a signal line can be to avoid cross talk. Square waves are hard to come by, so there's effort in timing and voltage bands to make sure you aren't registering a "1" when it should have been a "0".
Several of my professors were intel engineers. From what they told me, the ratios of employment were something like 100 digital engineers to 10 analog engineers to 1 Physicist/materials engineer.
You might enjoy Ken’s latest article on some of this stuff, posted just the other day: https://news.ycombinator.com/item?id=40899393
It was entirely laid out by hand until the 286. Using standard cells in the 386 enabled the switch from microcode to a mostly hardwired core.
A fun thing is that e.g. "cmp ax, 0x4231" differs from "cmp eax, 0x87654321" only in the presence of the data16 prefix, and thus the longer immediate; and it's the only significant case (I think?) of a prefix changing the total instruction size, and thus, for some such instructions, the 16-bit version, sometimes (but not always!) is significantly slower. "but not always" as in, if you try to microbenchmark a loop of such, sometimes you can have entire microseconds of it consistently running at 0.25 cycles/instr avg, and sometimes that same exact code (in the same process!) will measure it at 3 cycles/instr (tested on Haswell, but uops.info indicates this happens on all non-atom Intel since Ivy Bridge).
The behavior/penalty of 66/67 length-changing prefixes is documented in the Intel Optimization Reference Manual (3.4.2.3):
Some interesting quotes around there:
So that's the order of funkiness to be expected, fun.
The "true" LCP stall for the F7 opcodes would be "test r16,imm16", but due to the split opcode info across the initial byte & ModR/M, the other F7's suffer too.
Probably, if the uops come from the uop cache you get the fast speed since the prefix and any decoding stalls don't have any impact in that case (that mess is effectively erased in the uop cache), but if it needs to be decoded you get a stall due to the length changing prefix.
Whether a bit of code comes from the uop cache is highly dependent on alignment, surrounding instructions, the specific microarchitecture, microcode version and even more esototeric things like how many incoming jumps target the nearby region of code (and in which order they were observed by the cache).
Yep, a lot of potential contributors. Though, my test was of a single plain 8x unrolled loop doing nothing else, running for tens of thousands of iterations to take a total of ~0.1ms, i.e. should trivially cache, and yet there's consistent inconsistency.
Did some 'perf stat'ting, comparing the same test with "cmp eax,1000" vs "cmp ax,1000"; per instruction, idq.mite_uops goes 0.04% → 35%, and lsd.uops goes 90% → 54%; so presumably sometimes somehow the loop makes into LSD at which point dropping out of it is hard, while other times it perpetually gets stuck at MITE? (test is of 10 instrs - 8 copies of the cmp, and dec+jne that'd get fused, so 90% uops/instr makes sense)
A lot of this is done in software (microcode). But even with that case, your statement still holds: "Can you imagine having to make all this logic work faithfully, let alone fast, in the chip itself?" Writing that microcode must be fiendishly difficult given all the functional units, out of order execution, register renaming...
The crazy parts that were mentioned in the parent comment are all part of the hot path. Microcode handles slow paths related to paging and segmentation, and very rare instructions. Not necessarily unimportant (many common privileged instructions are microcoded) but still rare compared to the usual ALU instructions.
But it's not a huge deal to program the quirky encoding in an HDL, it's just a waste of transistors. The really complicated part is the sequencing of micro operations and how they enter the (out of order) execution unit.
No, that's not the case, since >30 years. Microcode is only used for implementing some complex instructions (mostly system instructions). Most regular instructions (and the rest of the core) don't use microcode and their expansions into uOps are hardwired. Also the entire execution unit is hardwired.
There are typically some undocumented registers (MSRs on x86) that can control how the core behaves (e.g., kill switches for certain optimizations). These can then be changed by microcode updates.
It seems that the complexity of state management in a modern superscalar CPU is orders of magnitude more complex than even this.
It's a lot easier to more or less accidentally create something quirky than it is to create a second quirk-for-quirk compatible system.
On and off over the last year I have been rewriting QEMU's x86 decoder. It started as a necessary task to incorporate AVX support, but I am now at a point where only a handful of opcodes are left to rewrite, after which it should not be too hard to add APX support. For EVEX my plan is to keep the raw bits until after the opcode has been read (i.e. before immediates and possibly before modrm) and the EVEX class identified.
My decoder is mostly based on the tables in the manual, and the code is mostly okay—not too much indentation and phases mostly easy to separate/identify. Because the output is JITted code, it's ok to not be super efficient and keep the code readable; it's not where most of the time is spent. Nevertheless there are several cases in which the manual is wrong or doesn't say the whole story. And the tables haven't been updated for several years (no K register instructions, for example), so going forward there will be more manual work to do. :(
The top comment explains a bit what's going on: https://github.com/qemu/qemu/blob/59084feb256c617063e0dbe7e6...
(As I said above, there are still a few instructions handled by the old code predating the rewrite, notably BT/BTS/BTR/BTC. I have written the code but not merged it yet).
Thanks for the pointer to QEMU's decoder! I actually never looked at it before.
So you coded all the tables manually in C -- interesting, that's quite some effort. I opted to autogenerate the tables (and keep them as data only => smaller memory footprint) [1,2]. That's doable, because x86 encodings are mostly fairly consistent. I can also generate an encoder from it (ok, you don't need that). Re 'custom size "xh"': AVX-512 also has fourth and eighth. Also interesting that you have a separate row for "66+F2". I special case these two (CRC32, MOVBE) instructions with a flag.
I think the prefix decoding is not quite right for x86-64: 26/2e/36/3e are ignored in 64-bit mode, except for 2e/3e as branch-not-taken/taken hints and 3e as notrack. (See SDM Vol. 1 3.3.7.1 "Other segment override prefixes (CS, DS, ES, and SS) are ignored.") Also, REX prefixes that don't immediately preceed the opcode (or VEX/EVEX prefix) are ignored. Anyhow, I need to take a closer look at the decoder with more time. :-)
I came to the same conclusion that this is necessary with APX. The map+prefix+opcode combination identifies how the other fields are to be interpreted. For AVX-512, storing the last byte was sufficient, but with APX, vvvv got a second meaning.
Yes... especially for corner cases, getting real hardware is the only reliable way to find out, how the CPU behaves.
[1]: https://github.com/aengelke/fadec/blob/master/instrs.txt [2]: https://github.com/aengelke/fadec/blob/master/decode.c
FWIW here's all 700 lines of Blink's x86 decoder. https://github.com/jart/blink/blob/master/blink/x86.c
I don't want to be the person that has to add an instruction to blink...
It looks like someone started with Intel's XED code (which relies on custom tables to specify instructions, and compiles that to C tables at compile time) and hand-minimized the code into a single file. I'm guessing it's designed to never have any more code added to it.
I minimized it by hand to make it more maintainable, to me at least.
Yeah that's only for 0F38F0 to 0F38FF.
Also AVX for VPMOVSX and VPMOVZX but those are handled differently. I probably should check if xh is actually redundant... EDIT: it's only needed for VCVTPS2PH, which is the only instruction with a half-sized destination.
Interesting, I need to check how they interact with the FS/GS prefixes (64/65).
Oh, didn't know that!
For memory operations, they are ignored: 64-2e-65-3e gives 65 as segment override. (From my memory and the resulting implementation, I did some tests with hardware a few years back.)
I do need to check myself how 2e/3e on branches interact with other segment overrides, though.
The semantics of LZCNT combined with its encoding feels like an own goal: it’s encoded as a BSR instruction with a legacy-ignored prefix, but for nonzero inputs its return value is the operand size minus the return value of the legacy version. Yes, clz() is a function that exists, but the extra subtraction in its implementation feels like a small cost to pay for extra compatibility when LZCNT could’ve just been BSR with different zero-input semantics.
Yes, it's like someone looked at TZCNT and thought "let's encode LZCNT the same way", but it makes no sense.
LZCNT and TZCNT are corrections (originally introduced by AMD) for the serious mistake done by the designers of Intel 80386 when they have defined BSF and BSR.
Because on the very slow 80386 the wrong definition for the null input did not matter much, they have failed to foresee how bad it will become for the future pipelined and superscalar CPUs, where having to insert a test for null input can slow down a program many times.
Nevertheless, they should have paid more attention to the earlier use of such instructions. For instance Cray-1 had defined LZCNT in the right way almost ten years earlier.
LZCNT was introduced by Intel in BMI1, while TZCNT was introduced by AMD.
I'm not following: as long as you are introducing a new, incompatible instruction for leading zero counting, you'd definitely choose LZCNT over BSR as LZCNT has definitely won in retrospect over BSR as the primitive for this use case. BSR is just a historical anomaly which has a zero-input problem for no benefit.
What would be the point of offering a new variation BSR with different input semantics?
When it comes to TZCNT vs BSF, they are just compatible enough for a new compiler to use unconditionally (if we assume that BSF with a zero input leaves its output register unchanged, as it has for decades, and as documented by AMD who defined LZCNT): the instruction sequence
behaves identically on everything from the original 80386 up and is better on superscalars with TZCNT support due to avoiding the false dependency on ECX. The reason for that is that BSF with a nonzero input and TZCNT with a nonzero input have exactly the same output. That’s emphatically not true of BSR and LZCNT, so we’re stuck relegating LZCNT to compiler flags.TZCNT and BSF are not completely identical even for non-zero input: BSF sets the ZF when the input is zero, TZCNT sets ZF when the output is zero (i.e., the least significant bit is set).
Another bonus quirk, from the 486 and Pentium area..
BSWAP EAX converts from little endian to big endian and vice versa. It was a 32 bit instruction to begin with.
However, we have the 0x66 prefix that switches between 16 and 32 bit mode. If you apply that to BSWAP EAX undefined funky things happen.
On some CPU architectures (Intel vs. AMD) the prefix was just ignored. On others it did something that I call an "inner swap". E.g. in your four bytes that are stored in EAX byte 1 and 2 are swapped.
Also known as "bswap ax", and research shows that it does something surprising but consistent on almost all hardware: It zeros the register.
https://www.ragestorm.net/blogs/?p=141
https://gynvael.coldwind.pl/?id=268
However, this page, now gone, suggests that some CPUs (early 486s?) did something different: http://web.archive.org/web/20071231192014/http://www.df.lth....
Unfortunately I have not found any evidence nor reason to believe that this "inner swap" behaviour you mention exists in some CPU --- except perhaps some flawed emulators?