HN comments for: Things I learned while writing an x86 emulator (2013)

aengelke

35 replies

2024-07-10 17:46:41 UTC

Bonus quirk: there's BSF/BSR, for which the Intel SDM states that on zero input, the destination has an undefined value. (AMD documents that the destination is not modified in that case.) And then there's glibc, which happily uses the undocumented fact that the destination is also unmodified on Intel [1]. It took me quite some time to track down the issue in my binary translator. (There's also TZCNT/LZCNT, which is BSF/BSR encoded with F3-prefix -- which is silently ignored on older processors not supporting the extension. So the same code will behave differently on different CPUs. At least, that's documented.)

Encoding: People often complain about prefixes, but IMHO, that's by far not the worst thing. It is well known and somewhat well documented. There are worse quirks: For example, REX/VEX/EVEX.RXB extension bits are ignored when they do not apply (e.g., MMX registers); except for mask registers (k0-k7), where they trigger #UD -- also fine -- except if the register is encoded in ModRM.rm, in which case the extension bit is ignored again.

APX takes the number of quirks to a different level: the REX2 prefix can encode general-purpose registers r16-r31, but not xmm16-xmm31; the EVEX prefix has several opcode-dependent layouts; and the extension bits for a register used depend on the register type (XMM registers use X3:B3:rm and V4:X3:idx; GP registers use B4:B3:rm, X4:X3:idx). I can't give a complete list yet, I still haven't finished my APX decoder after a year...

[1]: https://sourceware.org/bugzilla/show_bug.cgi?id=31748

CoastalCoder

17 replies

23h34m

2024-07-10 18:56:14 UTC

Can you imagine having to make all this logic work faithfully, let alone fast, in silicon?

X86 used to be Intel's moat, but what a nightmarish burden to carry.

dx4100

6 replies

23h9m

2024-07-10 19:20:37 UTC

Did people just... do this by hand (in software), transistor by transistor, or was it laid out programmatically in some sense? As in, were segments created algorithmically, then repeated to obtain the desired outcome? CPU design baffles me, especially considering there are 134 BILLION transistors or so in the latest i7 CPU. How does the team even keep track of, work on, or even load the files to WORK on the CPUs?

tristor

1 replies

22h53m

2024-07-10 19:36:24 UTC

They use EDA (Electronic Design Automation) software, there are only a handful of vendors, the largest probably being Mentor Graphics, now owned by Siemens. So, yes, they use automation to algorithmically build and track/resolve refactors as they design CPUs. CPUs are /generally/ block-type designs these days, so particular functions get repeated identically in different places and can be somewhat abstracted away in your EDA.

It's still enormously complex, and way more complex than the last time I touched this stuff more than 15 years ago.

rikthevik

0 replies

22h21m

2024-07-10 20:08:26 UTC

I love that the EDA industry still uses Tcl heavily. Warms my heart.

monocasa

1 replies

22h50m

2024-07-10 19:39:32 UTC

It's written in an HDL; IIRC both Intel and AMD use verilog. A modern core is on the order of a million or so lines of verilog.

Some of that will be hand placed, quite a bit will just be thrown at the synthesizer. Other parts like SRAM blocks will have their cad generated directly from a macro and a description of the block in question.

cogman10

0 replies

21h22m

2024-07-10 21:08:19 UTC

To further expound on this. ASIC (like AMD CPUs) is a lot like software work. The engineers that create a lot of the digital logic aren't dealing with individual transistors, instead they are saying "give me an accumulator for this section of code" and the HDL provides it. The definition of that module exists elsewhere and is shared throughout the system.

This is how the complexity can be wrangled.

Now, MOST of the work is automated for digital logic. However, we live in an analog world. So, there is (As far as I'm aware) still quite a bit of work for analog engineers to bend the analog reality into digital. In the real world, changing current creates magnetic fields which means you need definitions limiting voltages and defining how close a signal line can be to avoid cross talk. Square waves are hard to come by, so there's effort in timing and voltage bands to make sure you aren't registering a "1" when it should have been a "0".

Several of my professors were intel engineers. From what they told me, the ratios of employment were something like 100 digital engineers to 10 analog engineers to 1 Physicist/materials engineer.

moosedev

0 replies

19h14m

2024-07-10 23:15:38 UTC

You might enjoy Ken’s latest article on some of this stuff, posted just the other day: https://news.ycombinator.com/item?id=40899393

bonzini

0 replies

22h30m

2024-07-10 19:59:51 UTC

It was entirely laid out by hand until the 286. Using standard cells in the 386 enabled the switch from microcode to a mostly hardwired core.

dzaima

4 replies

19h5m

2024-07-10 23:24:58 UTC

A fun thing is that e.g. "cmp ax, 0x4231" differs from "cmp eax, 0x87654321" only in the presence of the data16 prefix, and thus the longer immediate; and it's the only significant case (I think?) of a prefix changing the total instruction size, and thus, for some such instructions, the 16-bit version, sometimes (but not always!) is significantly slower. "but not always" as in, if you try to microbenchmark a loop of such, sometimes you can have entire microseconds of it consistently running at 0.25 cycles/instr avg, and sometimes that same exact code (in the same process!) will measure it at 3 cycles/instr (tested on Haswell, but uops.info indicates this happens on all non-atom Intel since Ivy Bridge).

aengelke

1 replies

11h54m

2024-07-11 06:35:37 UTC

The behavior/penalty of 66/67 length-changing prefixes is documented in the Intel Optimization Reference Manual (3.4.2.3):

Assembly/Compiler Coding Rule 19. (MH impact, MH generality) Favor generating code using imm8 or imm32 values instead of imm16 values.

dzaima

0 replies

5h40m

2024-07-11 12:50:03 UTC

Some interesting quotes around there:

The following alignment situations can cause LCP stalls to trigger twice:

· An instruction is encoded with a MODR/M and SIB byte, and the fetch line boundary crossing is between the MODR/M and the SIB bytes.

· An instruction starts at offset 13 of a fetch line references a memory location using register and immediate byte offset addressing mode.

So that's the order of funkiness to be expected, fun.

False LCP stalls occur when (a) instructions with LCP that are encoded using the F7 opcodes, and (b) are located at offset 14 of a fetch line. These instructions are: not, neg, div, idiv, mul, and imul. False LCP experiences delay because the instruction length decoder can not determine the length of the instruction before the next fetch line, which holds the exact opcode of the instruction in its MODR/M byte.

The "true" LCP stall for the F7 opcodes would be "test r16,imm16", but due to the split opcode info across the initial byte & ModR/M, the other F7's suffer too.

BeeOnRope

1 replies

15h28m

2024-07-11 03:02:11 UTC

Probably, if the uops come from the uop cache you get the fast speed since the prefix and any decoding stalls don't have any impact in that case (that mess is effectively erased in the uop cache), but if it needs to be decoded you get a stall due to the length changing prefix.

Whether a bit of code comes from the uop cache is highly dependent on alignment, surrounding instructions, the specific microarchitecture, microcode version and even more esototeric things like how many incoming jumps target the nearby region of code (and in which order they were observed by the cache).

dzaima

0 replies

14h5m

2024-07-11 04:25:05 UTC

Yep, a lot of potential contributors. Though, my test was of a single plain 8x unrolled loop doing nothing else, running for tens of thousands of iterations to take a total of ~0.1ms, i.e. should trivially cache, and yet there's consistent inconsistency.

Did some 'perf stat'ting, comparing the same test with "cmp eax,1000" vs "cmp ax,1000"; per instruction, idq.mite_uops goes 0.04% → 35%, and lsd.uops goes 90% → 54%; so presumably sometimes somehow the loop makes into LSD at which point dropping out of it is hard, while other times it perpetually gets stuck at MITE? (test is of 10 instrs - 8 copies of the cmp, and dec+jne that'd get fused, so 90% uops/instr makes sense)

gumby

2 replies

23h5m

2024-07-10 19:24:36 UTC

A lot of this is done in software (microcode). But even with that case, your statement still holds: "Can you imagine having to make all this logic work faithfully, let alone fast, in the chip itself?" Writing that microcode must be fiendishly difficult given all the functional units, out of order execution, register renaming...

bonzini

0 replies

22h55m

2024-07-10 19:34:27 UTC

The crazy parts that were mentioned in the parent comment are all part of the hot path. Microcode handles slow paths related to paging and segmentation, and very rare instructions. Not necessarily unimportant (many common privileged instructions are microcoded) but still rare compared to the usual ALU instructions.

But it's not a huge deal to program the quirky encoding in an HDL, it's just a waste of transistors. The really complicated part is the sequencing of micro operations and how they enter the (out of order) execution unit.

aengelke

0 replies

22h51m

2024-07-10 19:38:24 UTC

A lot of this is done in software (microcode).

No, that's not the case, since >30 years. Microcode is only used for implementing some complex instructions (mostly system instructions). Most regular instructions (and the rest of the core) don't use microcode and their expansions into uOps are hardwired. Also the entire execution unit is hardwired.

There are typically some undocumented registers (MSRs on x86) that can control how the core behaves (e.g., kill switches for certain optimizations). These can then be changed by microcode updates.

kimixa

0 replies

8h3m

2024-07-11 10:27:15 UTC

It seems that the complexity of state management in a modern superscalar CPU is orders of magnitude more complex than even this.

im3w1l

0 replies

22h39m

2024-07-10 19:50:21 UTC

It's a lot easier to more or less accidentally create something quirky than it is to create a second quirk-for-quirk compatible system.

bonzini

7 replies

22h47m

2024-07-10 19:43:13 UTC

On and off over the last year I have been rewriting QEMU's x86 decoder. It started as a necessary task to incorporate AVX support, but I am now at a point where only a handful of opcodes are left to rewrite, after which it should not be too hard to add APX support. For EVEX my plan is to keep the raw bits until after the opcode has been read (i.e. before immediates and possibly before modrm) and the EVEX class identified.

My decoder is mostly based on the tables in the manual, and the code is mostly okay—not too much indentation and phases mostly easy to separate/identify. Because the output is JITted code, it's ok to not be super efficient and keep the code readable; it's not where most of the time is spent. Nevertheless there are several cases in which the manual is wrong or doesn't say the whole story. And the tables haven't been updated for several years (no K register instructions, for example), so going forward there will be more manual work to do. :(

The top comment explains a bit what's going on: https://github.com/qemu/qemu/blob/59084feb256c617063e0dbe7e6...

(As I said above, there are still a few instructions handled by the old code predating the rewrite, notably BT/BTS/BTR/BTC. I have written the code but not merged it yet).

aengelke

6 replies

22h11m

2024-07-10 20:18:38 UTC

Thanks for the pointer to QEMU's decoder! I actually never looked at it before.

So you coded all the tables manually in C -- interesting, that's quite some effort. I opted to autogenerate the tables (and keep them as data only => smaller memory footprint) [1,2]. That's doable, because x86 encodings are mostly fairly consistent. I can also generate an encoder from it (ok, you don't need that). Re 'custom size "xh"': AVX-512 also has fourth and eighth. Also interesting that you have a separate row for "66+F2". I special case these two (CRC32, MOVBE) instructions with a flag.

I think the prefix decoding is not quite right for x86-64: 26/2e/36/3e are ignored in 64-bit mode, except for 2e/3e as branch-not-taken/taken hints and 3e as notrack. (See SDM Vol. 1 3.3.7.1 "Other segment override prefixes (CS, DS, ES, and SS) are ignored.") Also, REX prefixes that don't immediately preceed the opcode (or VEX/EVEX prefix) are ignored. Anyhow, I need to take a closer look at the decoder with more time. :-)

For EVEX my plan is to keep the raw bits until after the opcode has been read

I came to the same conclusion that this is necessary with APX. The map+prefix+opcode combination identifies how the other fields are to be interpreted. For AVX-512, storing the last byte was sufficient, but with APX, vvvv got a second meaning.

Nevertheless there are several cases in which the manual is wrong or doesn't say the whole story.

Yes... especially for corner cases, getting real hardware is the only reliable way to find out, how the CPU behaves.

[1]: https://github.com/aengelke/fadec/blob/master/instrs.txt [2]: https://github.com/aengelke/fadec/blob/master/decode.c

jart

3 replies

20h33m

2024-07-10 21:56:48 UTC

FWIW here's all 700 lines of Blink's x86 decoder. https://github.com/jart/blink/blob/master/blink/x86.c

bonzini

2 replies

19h40m

2024-07-10 22:49:48 UTC

I don't want to be the person that has to add an instruction to blink...

jcranmer

1 replies

19h20m

2024-07-10 23:09:32 UTC

It looks like someone started with Intel's XED code (which relies on custom tables to specify instructions, and compiles that to C tables at compile time) and hand-minimized the code into a single file. I'm guessing it's designed to never have any more code added to it.

jart

0 replies

15h24m

2024-07-11 03:05:33 UTC

I minimized it by hand to make it more maintainable, to me at least.

bonzini

1 replies

22h7m

2024-07-10 20:23:10 UTC

interesting that you have a separate row for "66+F2"

Yeah that's only for 0F38F0 to 0F38FF.

Re 'custom size "xh"': AVX-512 also has fourth and eighth

Also AVX for VPMOVSX and VPMOVZX but those are handled differently. I probably should check if xh is actually redundant... EDIT: it's only needed for VCVTPS2PH, which is the only instruction with a half-sized destination.

I think the prefix decoding is not quite right for x86-64: 26/2e/36/3e are ignored in 64-bit mode

Interesting, I need to check how they interact with the FS/GS prefixes (64/65).

REX prefixes that don't immediately preceed the opcode (or VEX/EVEX prefix) are ignored

Oh, didn't know that!

aengelke

0 replies

21h55m

2024-07-10 20:35:05 UTC

how they interact with the FS/GS prefixes (64/65)

For memory operations, they are ignored: 64-2e-65-3e gives 65 as segment override. (From my memory and the resulting implementation, I did some tests with hardware a few years back.)

I do need to check myself how 2e/3e on branches interact with other segment overrides, though.

mananaysiempre

6 replies

22h29m

2024-07-10 20:01:06 UTC

The semantics of LZCNT combined with its encoding feels like an own goal: it’s encoded as a BSR instruction with a legacy-ignored prefix, but for nonzero inputs its return value is the operand size minus the return value of the legacy version. Yes, clz() is a function that exists, but the extra subtraction in its implementation feels like a small cost to pay for extra compatibility when LZCNT could’ve just been BSR with different zero-input semantics.

bonzini

2 replies

19h36m

2024-07-10 22:53:29 UTC

Yes, it's like someone looked at TZCNT and thought "let's encode LZCNT the same way", but it makes no sense.

adrian_b

1 replies

4h10m

2024-07-11 14:20:02 UTC

LZCNT and TZCNT are corrections (originally introduced by AMD) for the serious mistake done by the designers of Intel 80386 when they have defined BSF and BSR.

Because on the very slow 80386 the wrong definition for the null input did not matter much, they have failed to foresee how bad it will become for the future pipelined and superscalar CPUs, where having to insert a test for null input can slow down a program many times.

Nevertheless, they should have paid more attention to the earlier use of such instructions. For instance Cray-1 had defined LZCNT in the right way almost ten years earlier.

bonzini

0 replies

2h10m

2024-07-11 16:20:01 UTC

LZCNT was introduced by Intel in BMI1, while TZCNT was introduced by AMD.

BeeOnRope

2 replies

15h16m

2024-07-11 03:13:30 UTC

I'm not following: as long as you are introducing a new, incompatible instruction for leading zero counting, you'd definitely choose LZCNT over BSR as LZCNT has definitely won in retrospect over BSR as the primitive for this use case. BSR is just a historical anomaly which has a zero-input problem for no benefit.

What would be the point of offering a new variation BSR with different input semantics?

mananaysiempre

1 replies

4h0m

2024-07-11 14:29:50 UTC

When it comes to TZCNT vs BSF, they are just compatible enough for a new compiler to use unconditionally (if we assume that BSF with a zero input leaves its output register unchanged, as it has for decades, and as documented by AMD who defined LZCNT): the instruction sequence

  MOV   ECX, 32
  TZCNT ECX, EAX ; i.e. REP BSF ECX, EAX

behaves identically on everything from the original 80386 up and is better on superscalars with TZCNT support due to avoiding the false dependency on ECX. The reason for that is that BSF with a nonzero input and TZCNT with a nonzero input have exactly the same output. That’s emphatically not true of BSR and LZCNT, so we’re stuck relegating LZCNT to compiler flags.

aengelke

0 replies

3h40m

2024-07-11 14:49:45 UTC

TZCNT and BSF are not completely identical even for non-zero input: BSF sets the ZF when the input is zero, TZCNT sets ZF when the output is zero (i.e., the least significant bit is set).

torusle

1 replies

21h30m

2024-07-10 21:00:18 UTC

Another bonus quirk, from the 486 and Pentium area..

BSWAP EAX converts from little endian to big endian and vice versa. It was a 32 bit instruction to begin with.

However, we have the 0x66 prefix that switches between 16 and 32 bit mode. If you apply that to BSWAP EAX undefined funky things happen.

On some CPU architectures (Intel vs. AMD) the prefix was just ignored. On others it did something that I call an "inner swap". E.g. in your four bytes that are stored in EAX byte 1 and 2 are swapped.

  0x11223344 became 0x11332244.

userbinator

0 replies

14h54m

2024-07-11 03:36:15 UTC

Also known as "bswap ax", and research shows that it does something surprising but consistent on almost all hardware: It zeros the register.

https://www.ragestorm.net/blogs/?p=141

https://gynvael.coldwind.pl/?id=268

However, this page, now gone, suggests that some CPUs (early 486s?) did something different: http://web.archive.org/web/20071231192014/http://www.df.lth....

Unfortunately I have not found any evidence nor reason to believe that this "inner swap" behaviour you mention exists in some CPU --- except perhaps some flawed emulators?

trollied

21 replies

23h22m

2024-07-10 19:07:50 UTC

Writing a CPU emulator is, in my opinion, the best way to REALLY understand how a CPU works

Hard disagree.

The best way is to create a CPU from gate level, like you do on a decent CS course. (I really enjoyed making a cut down ARM from scratch)

timmisiak

6 replies

21h18m

2024-07-10 21:11:28 UTC

I think both are useful, but designing a modern CPU from the gate level is out of reach for most folks, and I think there's a big gap between the sorts of CPUs we designed in college and the sort that run real code. I think creating an emulator of a modern CPU is a somewhat more accessible challenge, while still being very educational even if you only get something partially working.

WalterBright

3 replies

20h9m

2024-07-10 22:20:34 UTC

When I was at Caltech, another student in the dorm had been admitted because he'd designed and implemented a CPU using only 7400 TTL.

Woz wasn't the only supersmart young computer guy at the time :-)

(I don't know how capable it was, even a 4 bit CPU would be quite a challenge with TTL.)

nsguy

2 replies

18h2m

2024-07-11 00:28:10 UTC

I think the key word above was modern. I felt able to design a simple CPU when I finished my Computer Architecture course in university. I think I forgot most of it by now ;) There are a few basic concepts to wrap your head around but once you have them a simple CPU is doable. Doing this with TTL or other off the shelf components is mostly minimizing/adapting/optimizing to those components (or using a lot of chips ;) ). I have never looked at discrete component CPU designs, I imagine ROM and RAM chips play a dominant part (e.g. you don't just built RAM with 74x TTL flip-flops).

WalterBright

1 replies

17h0m

2024-07-11 01:29:25 UTC

He probably used off-the-shelf RAM chips, after all, RAM is not part of the CPU.

In the early 70s, before the internet, even finding the information needed would be a fair amount of work.

I learned how flip flops worked, adders, and registers in college, and that could be extended to an ALU. But still, that was in college, not high school.

I've read some books on computer history, and they are frustratingly vague about how the machines actually worked. I suspect the authors didn't actually know. Sort of the like the books on the history of Apple that gush over Woz's floppy disk interface, but no details.

nsguy

0 replies

16h49m

2024-07-11 01:40:35 UTC

Was doing some Googling and came across: https://en.wikipedia.org/wiki/Breakout_(video_game)

I never heard this story...

banish-m4

0 replies

20h35m

2024-07-10 21:54:41 UTC

This is an illusion and a red herring. RTL synthesis is the typical functional prototype stage reached which is generally sufficient for FPGA work. To burn an ASIC as part of an educational consortium run is doable, but it's uncommon.

akira2501

0 replies

16h44m

2024-07-11 01:45:38 UTC

and the sort that run real code

And the sort that are commercially viable in today's marketplace. The nature of the code has nothing to do with it. The types of machines we play around with today surpass the machines we used to land men on the moon. What's not "real code" about that?

commandlinefan

4 replies

23h8m

2024-07-10 19:21:56 UTC

OTOH, are you really going to be implementing memory segmenting in your gate-level CPU? I'd say actually creating a working CPU and _then_ emulating a real CPU (warts and all) are both necessary steps to real understanding.

mrspuratic

1 replies

9h34m

2024-07-11 08:55:42 UTC

This. I mean, why not start with wave theory and material science if you really want a good understanding :)

In my CS course I learned a hell of a lot from creating a 6800 emulator; though it wasn't on the course, building a working 6800 system was. The development involved running an assembler on a commercial *nix system and then typing the hex object code into an EPROM programmer. You get a lot of time to think about where your bugs are when you have to wait for a UV erase cycle...

commandlinefan

0 replies

4h20m

2024-07-11 14:10:01 UTC

why not start with wave theory and material science

You jest, but if I had infinite time...

trollied

0 replies

22h50m

2024-07-10 19:40:20 UTC

I agree.

monocasa

0 replies

22h49m

2024-07-10 19:41:11 UTC

OTOH, are you really going to be implementing memory segmenting in your gate-level CPU?

I have, but it was a PDP-8 which I'll be the first to admit is kind of cheating.

quantified

2 replies

22h6m

2024-07-10 20:23:56 UTC

Well, I think you're both right. It's satisfying as heck to sling 74xx chips together and you get a feel for the electrical side of things and internal tradeoffs.

When you get to doing that for the CPU that you want to do meaningful work with, you start to lose interest in that detail. Then the complexities of the behavior and spec become interesting and the emulator approach is more tractable, can cover more types of behavior.

IshKebab

1 replies

21h47m

2024-07-10 20:42:52 UTC

I think trollied is correct actually. I work on a CPU emulator professionally and while it gives you a great understanding of the spec there are lots of details about why the spec is the way it is that are due to how you actually implement the microarchitecture. You only learn that stuff by actually implementing a microarchitecture.

Emulators tend not to have many features that you find in real chips, e.g. caches, speculative execution, out-of-order execution, branch predictors, pipelining, etc.

This isn't "the electrical side of things". When he said "gate level" he meant RTL (SystemVerilog/VHDL) which is pretty much entirely in the digital domain; you very rarely need to worry about actual electricity.

trollied

0 replies

21h17m

2024-07-10 21:12:55 UTC

I write retro console emulators for fun, so agree with you 100% :)

banish-m4

2 replies

20h39m

2024-07-10 21:50:59 UTC

Seconded. A microcoded, pipelined, superscalar, branch-predicting basic processor with L1 data & instruction caches and write-back L2 cache controller is nontrivial. Most software engineers have an incomplete grasp of data hazards, cache invalidation, or pipeline stalls.

nsguy

1 replies

17h54m

2024-07-11 00:35:56 UTC

IIRC reading some Intel CPU design history some of their designers are from a CS/software background. But I agree. Software is naturally very sequential which is different than digital hardware which is naturally/inherently parallel. A clock can change the state of a million flip-flops all at once, it's a very different way of thinking about computation (though ofcourse at the theoretical level all the same) and then there's the physics and EE parts of a real world CPU. Writing software and designing CPUs are just very different disciplines and the CPU as it appears to the software developer isn't how it appears to the CPU designer.

banish-m4

0 replies

15h25m

2024-07-11 03:05:06 UTC

I'm not talking about Intel's engineers, I'm saying very few software engineers today understand how a processor works. I'm sure Intel hires all sorts of engineers for different aspects of each ecosystem they maintain. Furthermore, very few software engineers have ever even touched a physical server because they're sequestered away from a significant fraction of the total stack.

Speculative and out-of-order execution requires synchronization to maintain dataflow and temporal consistency.

Computer Organization is a good book should anyone want to dive deeper.

whobre

0 replies

23h17m

2024-07-10 19:13:00 UTC

Reading Petzold’s “Code” comes pretty close to, though and is easier.

snvzz

0 replies

15h31m

2024-07-11 02:59:04 UTC

CPU was a poor choice of words. ISA would have worked.

brailsafe

0 replies

20h42m

2024-07-10 21:47:28 UTC

So far on my journey through Nand2Tetris (since I kind of dropped out of my real CS course) I've found the entire process of working my way up from gate level, and just finished the VM emulator chapter which took an eternity. Now onto compilation.

dmitrygr

16 replies

1d1h

2024-07-10 17:20:44 UTC

I've written fast emulators for a dozen non-toy architectures and a few JIT translators for a few as well. x86 still gives me PTSD. I have never seen a messier architecture. There is history, and a reason for it, but still ... damn

trealira

11 replies

2024-07-10 17:35:36 UTC

Studying the x86 architecture is kind of like studying languages with lots of irregularities and vestigial bits, and with competing grammatical paradigms, e.g. French. Other architectures, like RISC-V and ARMv8, are much more consistent.

aengelke

8 replies

21h58m

2024-07-10 20:31:33 UTC

Other architectures, like [...] ARMv8, are much more consistent.

From an instruction/operation perspective, AArch64 is more clean. However, from an instruction operand and encoding perspective, AArch64 is a lot less consistent than x86. Consider the different operand types: on x86, there are a dozen register types, immediate (8/16/32/64 bits), and memory operands (always the same layout). On AArch64, there's: GP regs, incremented GP reg (MOPS extension), extended GP reg (e.g., SXTB), shifted GP reg, stack pointer, FP reg, vector register, vector register element, vector register table, vector register table element, a dozen types of memory operands, conditions, and a dozen types of immediate encodings (including the fascinating and very useful, but also very non-trivial encoding of logical immediates [1]).

AArch64 also has some register constraints: some vector operations can only encode register 0-15 or 0-7; not to mention SVE with it's "movprfx" prefix instruction that is only valid in front of a few selected instructions.

[1]: https://github.com/aengelke/disarm/blob/master/encode.c#L19-...

trealira

4 replies

20h48m

2024-07-10 21:42:20 UTC

Admittedly, I never wrote an assembler, but the encoding of x86-64 seems pretty convoluted [0] [1], with much of the information smeared across the some bits in the Mod/RM and SIB bytes and then extended by prefix bytes. It's more complicated than you would assume from having only written assembly in text and then having the assembler encode the instructions.

One of the things that sticks out to me is that on x86-64, operations on 32-bit registers implicitly zero out the upper 32-bits of the corresponding 64-bit register (e.g. "MOV EAX, EBX" also zeroes out the upper 32-bits of RAX), except for the opcode 0x90, which, logically, would be the accumulator encoding of "XCHG EAX, EAX" but is special-cased to do nothing because it's the traditional NOP instruction for x86. So you have to use the other encoding, 87 C0.

The fact that AArch64 has instructions that are almost always 4 bytes, and the fact that they clearly thought out their instruction set more (e.g. instead of having CMP, they just use SUBS (subtract and store condition codes) and store the result in the zero register, which is always zero), is what made me say that it seemed more regular. I hadn't studied it in as close detail as x86-64, though. But, you've written an an ARM disassembler and I haven't; you seem to know more about it than me, and so I believe what you're saying.

AArch64 also has some register constraints

x86-64 also has some, in that many instructions can't encode the upper 8 bits of a 16-bit register (AH, BH, DH, CH) if a REX prefix is used.

[0]: https://wiki.osdev.org/X86-64_Instruction_Encoding

[1]: http://www.c-jump.com/CIS77/CPU/x86/lecture.html

jcranmer

1 replies

18h34m

2024-07-10 23:56:17 UTC

One of the projects I work on every now and then is the "World's Worst X86 Decoder", where I'm trying to essentially automatically uncover x86 assembly semantics without ever having to build a list of x86 instructions, and this has forced me to look at the x86 ISA in a different way. The basic format I build for an opcode is this:

  enum Group1Prefix { Lock = 0xf0, Repnz = 0xf2, Repz = 0xf3 }
  enum Group2Prefix { Cs = 0x2e, Ds = 0x3e, Es = 0x26, Fs = 0x64, Gs = 0x65, Ss = 0x36 }
  enum Group3Prefix { OpSize = 0x66 }
  enum Group4Prefix { AddrSize = 0x67 }
  enum ModernPrefix {
    Rex { w: bool },
    Vex { w: bool, l: bool },
  }
  struct Opcode {
    pub group1: Option<Group1Prefix>,
    pub group2: Option<Group2Prefix>,
    pub group3: Option<Group3Prefix>,
    pub group4: Option<Group4Prefix>,
    pub modern_prefix: Option<ModernPrefix>,
    pub opcode_map: u8,
    pub opcode: u8,
  }

And all opcodes can optionally have an immediate of 1, 2, 3, 4, or 8 bytes and optionally have a ModR/M byte (which is a separate datastructure because I don't enter that information myself, I simply run a sandsifter-like program to execute every single opcode and work out the answer).

This isn't quite accurate to how an assembler would see it, as Intel will sometimes pack instructions with 0 or 1 operands into a ModR/M byte, so that, e.g., 0F.01 eax, [mem] is actually the SGDT instruction and 0F.01 ecx, [mem] is SIDT (and 0F.01 eax, ecx is actually VMCALL).

As long as you're internally making a distinction between the different operand forms of instructions (e.g., 8-bit ADD versus 16-bit versus 32-bit versus 64-bit, and register/register versus register/immediate versus register/memory), it's actually not all that difficult to deal with the instruction encoding, or even mapping IR to those instructions, at least until EVEX prefixes enter the picture.

trealira

0 replies

6h9m

2024-07-11 12:21:10 UTC

That's pretty cool; thanks for showing me. I guess x86 instruction encoding isn't as bad as I thought, then.

bonzini

0 replies

19h30m

2024-07-10 23:00:17 UTC

x86 has a lot of special cases and it's becoming harder and harder to find non-code material with all instructions in a nice tabular format. But overall the encoding isn't that complicated. What's complex in the architecture is the amount of legacy stuff that still lives in privileged code.

x86-64 also has some, in that many instructions can't encode the upper 8 bits of a 16-bit register (AH, BH, DH, CH) if a REX prefix is used.

You can't use both the high registers and the extended low registers in the same instruction, indeed. But in practice nobody is using the high registers anymore so it's a relatively rare limitation.

WalterBright

0 replies

19h59m

2024-07-10 22:31:14 UTC

you've written an an ARM disassembler

Here's my AArch64 disassembler work in progress:

https://github.com/dlang/dmd/blob/master/compiler/src/dmd/ba...

I add to it in tandem with writing the code generator. It helps flush out bugs in both by doing this. I.e. generate the instruction, the disassemble it and compare with what I thought it should be.

It's quite a bit more complicated than the corresponding x86 disassembler:

https://github.com/dlang/dmd/blob/master/compiler/src/dmd/ba...

WalterBright

1 replies

20h4m

2024-07-10 22:25:49 UTC

As I'm currently implementing an AArch64 code generator for the D language dmd compiler, the inconsistency of its instructions is equaled and worsened by the clumsiness of the documentation for it :-/ But I'm slowly figuring it out.

(For example, some instructions with very different encodings have the same mnemonic. Arrgghh.)

brendank310

0 replies

4h58m

2024-07-11 13:31:27 UTC

It's probably no longer maintained, but a former colleague of mine did some work on this for C++: https://github.com/ainfosec/shoulder. Obviously if the docs are lying it doesn't help much, but there was another effort he had https://github.com/ainfosec/scapula that tried to automate detecting behavior differences between the docs and the hardware implementation.

corsix

0 replies

10h50m

2024-07-11 07:40:17 UTC

For an implementation of logical immediate encoding without the loop, see https://github.com/LuaJIT/LuaJIT/blob/04dca7911ea255f37be799...

x0x0

1 replies

23h40m

2024-07-10 18:49:40 UTC

I think English may be a better example; we just stapled chunks of vulgar latin to an inconsistently simplified proto-germanic and then borrowed words from every language we met along the way. Add in 44 sounds serialized to the page with 26 letters and tada!

WalterBright

0 replies

19h52m

2024-07-10 22:37:57 UTC

The Norman conquest of England means English is a language with barbarian syntax and French nouns. It's a happy mess of cultural appropriation!

Squeezing the lot into 26 characters was simply genius - enabling printing with movable type, Morse code, Baudot code, ASCII, etc.

Of course, then icons came along and ruined everything.

jcranmer

2 replies

1d1h

2024-07-10 17:29:05 UTC

I have never seen a messier architecture.

Itanium. Pretty much every time I open up the manual, I find a new thing that makes me go "what the hell were you guys thinking!?" without even trying to.

snazz

1 replies

22h33m

2024-07-10 19:57:00 UTC

What sorts of projects are you working on that use Itanium?

jcranmer

0 replies

21h31m

2024-07-10 20:59:12 UTC

None, really. I just happened to get a copy of the manual and start idly reading it when my computer got stuck in a very long update-reboot cycle and I couldn't do anything other than read a physical book.

Arech

0 replies

1d1h

2024-07-10 17:27:27 UTC

Haha, man, I feel you :DD You probably should have started with it from the very beginning :D

waynecochran

15 replies

23h42m

2024-07-10 18:47:53 UTC

Intel architecture is loaded with historical artifacts. The switch in how segment registers were used as you went from real mode to protected mode was an incredible hardware hack to keep older software working. I blame Intel for why so many folks avoid assembly language. I programmed in assembly for years using TI's 84010 graphics chips and the design was gorgeous -- simple RISC instruction set, flat address space, and bit addressable! If during the earlier decades folks were programming using chips with more elegant designs, far more folks would be programming in assembly language (or at least would know how to).

hajile

6 replies

23h24m

2024-07-10 19:06:18 UTC

I blame Intel for why so many folks avoid assembly language.

x86 (the worst assembly of any of the top 50 most popular ISAs by a massive margin) and tricky MIPS branch delay slots trivia questions at university have done more to turn off programmers from learning assembly than anything else and it's not even close.

This is one reason I'm hoping that RISC-V kills off x86. It actually has a chance of once again allowing your average programmer to learn useful assembly.

LegionMammal978

3 replies

22h31m

2024-07-10 19:59:06 UTC

What do you find particularly problematic about x86 assembly, from a pedagogical standpoint? I've never noticed any glaring issues with it, except for the weird suffixes and sigils if you use AT&T syntax (which I generally avoid).

timmisiak

2 replies

21h10m

2024-07-10 21:19:58 UTC

I suspect the biggest issue is that courses like to talk about how instructions are encoded, and that can be difficult with x86 considering how complex the encoding scheme is. Personally, I don't think x86 is all that bad as long as you look at a small useful subset of instructions and ignore legacy and encoding.

LegionMammal978

1 replies

20h59m

2024-07-10 21:30:41 UTC

True, encoding is one thing that really sets x86 apart. But as you say, the assembly itself doesn't seem that uniquely horrible (at least not since the 32-bit era), which is why I found the sentiment confusing as it was phrased.

Maybe it's the haphazard SIMD instruction set, with every extension adding various subtly-different ways to permute bytes and whatnot? But that would hardly seem like a beginner's issue. The integer multiplication and division instructions can also be a bit wonky to use, but hardly unbearably so.

hajile

0 replies

2h2m

2024-07-11 16:28:08 UTC

An ordinary developer cannot write performant x86 without a massive optimizing compiler.

Actual instruction encoding is horrible. If you’re arguing that you can write a high-level assembly over the top, then you aren’t so much writing assembly as you are writing something in between.

When you need to start caring about the actual assembly (padding a cache line, avoiding instructions with too many uops, or choosing between using APX 32 registers and more normal shorter instructions, etc) rather than some high level abstraction, the experience is worse than any other popular ISA.

akira2501

1 replies

16h38m

2024-07-11 01:51:37 UTC

I think that's putting the cart before the horse. I think it wouldn't matter which architecture you choose as there will always be deep performance considerations that must be understood in order to write efficient software.

Otherwise your statement might amount down to "I hope there is an ISA that intentionally wastes performance and energy in deference to human standards of beauty."

It's why the annals of expertise rarely makes for good dinner table conversation.

hajile

0 replies

2h22m

2024-07-11 16:07:50 UTC

x86 has a parity flag. It only takes the parity of the lowest 8 bits though. Why is it there? Because it was in the 8086 because it was in the 8080 because it was in the 8008 because Intel was trying to win a contract for the Datapoint 2200.

Sometimes the short instruction variant is correct, but not if it makes a single instruction break down into many uops as the microcode is 1000x slower.

Oh, but you need to use those longer variants without extra uops to align functions to cache boundaries because they perform better than NOP.

Floats and SIMD are a mess with x87, AMX (with incompatible variants), SSE1-4 (with incompatible variants), AVX, AVX2, and AVX512 (with incompatible variants) among others.

The segment and offset was off dealing with memory is painful.

What about the weird rules about which registers are reserved for multiply and divide? Half the “general purpose” registers are actually locked in at the ISA level.

Now APX is coming up and you get to choose between shorter instructions with 16 registers and 2 registers syntax or long instructions with 32 registers ands 3 registers instructions.

And this just scratches the surface.

RISCV is better in every way. The instruction density is significantly higher. The instructions are more simple and easier to understand while being just as powerful. Optimizing compilers are easier to write because there’s generally just one way to do things and is guaranteed to be optimized.

russdill

2 replies

22h29m

2024-07-10 20:01:18 UTC

What's crazy is that depending on how deep you want to go, a lot of the information is not available in documents published by Intel. Fortunately, if it matters for emulators it typically can be/has been reverse engineering.

Max-q

1 replies

21h12m

2024-07-10 21:17:32 UTC

Wow, is that true? And the documentation is thousands of pages! I can't understand how Intel keep these processors consistent during development. It must be a really, really hard job.

It was a nice period in history when proper RISC was working well, when we could get stuff to run faster, but getting more transistors was expensive. Now we can't get it to run faster but we can get billions of transistors, making stuff more complicated being the way to more performance.

I wonder if we ever again will get a time where simplification is a valid option...

russdill

0 replies

20h49m

2024-07-10 21:40:45 UTC

There's some info here:

http://www.rcollins.org/secrets/ http://www.rcollins.org/ddj/ddj.html

Notably things like undocumented opcodes, processor modes, undocumented registers, undocumented bits within registers, etc. It's not uncommon to release a specific set of documentation to trusted partners only. Back in the day intel called these "gold books".

bheadmaster

2 replies

23h5m

2024-07-10 19:25:12 UTC

If Terry A. Davis is to be trusted, as long as you ignore the legacy stuff, x64 assembly is nice to work with.

256_

1 replies

7h13m

2024-07-11 11:17:15 UTC

Where did he say this? In any case, I can confirm it from personal experience. x86 isn't that annoying if you ignore most of it. The worst thing is the edge cases with how some instructions can only use some registers, like shifting/rotating using cl.

bheadmaster

0 replies

3h51m

2024-07-11 14:38:57 UTC

There was a video of him talking about it, but I can't recall the exact words or the name of the video.

But his operating system, TempleOS, was based on HolyC which was compiled to pure x64, and had support for inline x64 assembly.

jecel

1 replies

22h17m

2024-07-10 20:12:31 UTC

Wouldn't that be the 34010?

waynecochran

0 replies

21h50m

2024-07-10 20:39:52 UTC

Yes. TMS 34010 and 34020 ... my bad.

sdsd

2 replies

1d1h

2024-07-10 17:14:21 UTC

What a cool person. I really enjoy writing assembly, it feels so simple and I really enjoy the vertical aesthetic quality.

The closest I've ever come to something like OP (which is to say, not close at all) was when I was trying to help my JS friend understand the stack, and we ended up writing a mini vm with its own little ISA: https://gist.github.com/darighost/2d880fe27510e0c90f75680bfe...

This could have gone much deeper - i'd have enjoyed that, but doing so would have detracted from the original educational goal lol. I should contact that friend and see if he still wants to study with me. it's hard since he's making so much money doing fancy web dev, he has no time to go deep into stuff. whereas my unemployed ass is basically an infinite ocean of time and energy.

actionfromafar

1 replies

2024-07-10 18:14:09 UTC

You should leverage that into your friend teaching you JS, maybe.

djaouen

0 replies

16h50m

2024-07-11 01:40:19 UTC

It’s like my friend 0x86 always said: “Stay away from JavaScript. But stay away from TypeScript harder.”

was_a_dev

1 replies

23h49m

2024-07-10 18:40:48 UTC

Off topic, but I like this blog style/layout. I can imagine it isn't everyones taste, but it just works for me

timmisiak

0 replies

21h16m

2024-07-10 21:13:34 UTC

Glad you like it. I used m10c, with a few tweaks: https://github.com/vaga/hugo-theme-m10c

fjfaase

1 replies

23h3m

2024-07-10 19:26:34 UTC

Interesting read. I have a lot of respect for people who develop emulator for x86 processors. It is a complicated processor and from first hand experience I know that developing and debugging emulators for CPU's can be very challenging. In the past year, I spend some time developing a very limited i386 emulator [1] including some system calls for executing the first steps of live-bootstrap [2], primarily to figure out how it is working. I learned a lot about system calls and ELF.

[1] https://github.com/FransFaase/Emulator/

[2] https://github.com/fosslinux/live-bootstrap/

banish-m4

0 replies

20h31m

2024-07-10 21:58:24 UTC

Most of the complexities lie in managing the various configurations of total system compatibility emulation, especially for timing, analog oddities, and whether to include bugs or not and for which steppings. If you want precise and accurate emulation, you have to have real hardware to validate behavior against. Then there are the cases of what not to emulate and offering better-than-original alternatives.

AstroJetson

1 replies

1d1h

2024-07-10 17:11:56 UTC

Check out Justine Tunney and her emulator. https://justine.lol/blinkenlights/

The docs are an amazing tour of how the cpu works.

zarathustreal

0 replies

23h50m

2024-07-10 18:39:38 UTC

Astonishing.. they never cease to amaze me

t_sea

0 replies

16h4m

2024-07-11 02:25:36 UTC

Writing a CPU emulator is, in my opinion, the best way to REALLY understand how a CPU works.

The 68k disassembler we wrote in college was such a Neo “I know kung fu” moment for me. It was the missing link that let me reason about code from high-level language down to transistors and back. I can only imagine writing a full emulator is an order of magnitude more effective. Great article!

pm2222

0 replies

1d1h

2024-07-10 17:06:13 UTC

Prior discussion here https://news.ycombinator.com/item?id=34636699

Cannot believe it’s been 16months. How time flies.

lifthrasiir

0 replies

12h42m

2024-07-11 05:47:26 UTC

I recently implemented a good portion of x86(-64) decoder for some side project [1] and kinda surprised how it got even more complicated in recent days. Sandpile.org [2] was really useful for my purpose.

[1] Namely, a version of Fabian Giesen's disfilter for x86-64, for yet another side project which is still not in public: https://gist.github.com/lifthrasiir/df47509caac2f065032ef72e...

[2] https://sandpile.org/

jmspring

0 replies

18h35m

2024-07-10 23:54:53 UTC

Apparently my memory is false, I thought originally the salsa20 variants and machine code were on cryp.to in my memory, but Dan Berstein's site is - https://cr.yp.to/

While at a startup when we were looking at data at rest encryption, streaming encryption and other such things. Dan had a page with different implementations (cross compiled from his assembler representation) to target chipsets and instruction sets. Using VMs (this was the early/mid 2000s) and such, it was interesting to see what of those instruction sets were supported. In testing, there would be occasional hiccups where an implementation wasn't fully supported though the VM claimed such.

djaouen

0 replies

16h51m

2024-07-11 01:38:29 UTC

“I don’t believe in emulatores.” - 0x86

boricj

0 replies

19h54m

2024-07-10 22:35:55 UTC

It's funny to me how much grief x86 assembly generates when compared to RISC here, because I have the opposite problem when delinking code back into object files.

For this use-case, x86 is really easy to analyze whereas MIPS has been a nightmare to pull off. This is because all I mostly care about are references to code and data. x86 has pointer-sized immediate constants and MIPS has split HI16/LO16 relocation pairs, which leads to all sorts of trouble with register usage graphs, code flow and branch delay instructions.

That should not be constructed as praise on my end for x86.

ale42

0 replies

11h32m

2024-07-11 06:58:18 UTC

Shouldn't it be (2023) rather than (2013)?

SunlitCat

0 replies

1d1h

2024-07-10 17:30:00 UTC

Haha! Writing an x86 emulator! I still remember writing a toy emulator which was able to execute something around the first 1000-ish lines of a real bios (and then it stuck or looped when it started to access ports or so, can't remember it was too long ago and I didn't continue it as I started to get into DirectX and modern c++ more).

Sparkyte

0 replies

1h24m

2024-07-11 17:05:34 UTC

The footnotes are glorious. "He was convinced that using a shift would work and didn’t believe me when I said it wasn’t possible."

Quekid5

0 replies

21h56m

2024-07-10 20:34:02 UTC

Just as an adjacent aside from a random about learning by doing:

Implementing a ThingDoer is a huge learning experience. I remember doing co-op "write-a-compiler" coursework with another person. We were doing great, everything was working and then we got to the oral exam...

"Why is your Stack Pointer growing upwards"?

... I was kinda stunned. I'd never thought about that. We understood most of the things, but sometimes we kind of just bashed at things until they worked... and it turned out upward-growing SP did work (up to a point) on the architecture our toy compiler was targeting.