Reptar

Is it even possible to design a cpu with out-of-order and speculative execution that would have no security issue? Is the future leads to a swarm of disconnected A55 cores each running a single application?

This vulnerability was not caused by OoO or speculative execution. It was caused by the fact that x86 was designed 45 years ago, and has had feature after feature piled on the same base, which has never been adequately rebuilt.

The more proximate cause is that some instructions with multiple redundant prefixes (which is legal, but pointless) have their length miscalculated by some Intel CPUs, which results in wrong outcomes.

Not entirely pointless, redundant prefixes are occasionally the useful method for alignment.

A more sensible approach for that use-case would be IMO to have well-defined specialized prefixes for padding, instead of relying on the case-by-case behavior of redundant prefixes. (However I understand that there's almost certainly a good historical reason why this was not the way it was done)

Are new ISA solving this? Time to move to Risc V?

RISC V is not great at this either, with the compression extension being common and variable length.

ARM 64 gets this right, with fixed length 32 bit instructions.

ARM 64 gets this right, with fixed length 32 bit instructions.

At the expense of code density, yet RISC-V is easy to decode, with implementations going up to 12-way decode (Veyron V2) despite variable length.

ARM64 hardly "gets it right".

I wouldn't say ARM64 gets it wrong either, I think both are viable approaches.

Both approaches are viable, but RISC-V's approach is better, as it provides higher code density without imposing a significant increase in complexity in exchange.

Higher code density is valuable. E.g.:

- The decoders can see more by looking at a window of code of the same size, or we can have a narrowed window.

- We can have less cache and save area and power. We can also clock the cache higher, enabled by it being smaller, lowering latency cycles.

- Smaller binaries or rom image.

Soon to be available (2024) large, high performance implementations will demonstrate RISC-V advantages well.

N/A and No.

The prefixes areredundantso it's not really case-by-case behavior. You're just repeating the prefix you would be using anyway in that location.

Using specialized prefixes wastes encoding space for no real gain. You realize on most common processors NOP itself is a pseudo-instruction? Even the apparently meme-worthy (see sibling comment) RISC-V, it's ADDI x0, x0, 0.

And then there are CPUs that retcon behavioral changes onto nops.

Moving a register to itself is functionally a nop, but the processor overloads it to signal information about priority.

https://devblogs.microsoft.com/oldnewthing/20180809-00/?p=99...

A program can voluntarily set itself to low priority if it is waiting for a spin lock

What does this even mean? How can a program do this when thread priority is an OS thing? It's seems just weird.

It's an SMT CPU that dynamically assigns decode, registers, etc.https://course.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?med...

Hardware threads as in SMT means thread priority is also a hardware thing.

The easiest way of doing padding is to add a bunch of `nop` instructions which are one byte each.

If you read the manual, Intel encourages minor variations of the `nop` instructions that can be lengthened into different number of bytes (like `nop dword ptr [eax]` or `nop dword ptr [eax + eax*1 + 00000000h]`).

It is never recommended anywhere in my knowledge to rely on redundant prefixes of random non-nop instructions.

NOPs are not generally free.

It's a pretty old and well known technique:

https://stackoverflow.com/questions/48046814/what-methods-ca...

Note that this technique is really only legitimate where the used prefix already has defined behavior with the given instruction ("Use of repeat prefixes and/or undefined opcodes with other Intel 64 or IA-32 instructions is reserved; such use may cause unpredictable behavior."), and of course the REX prefix has special limitations. The key is redundant, not spurious. It is not a good idea to be doing rep add for example. But otherwise, there is no issue.

Usually, the historical reason is that adding the logic to do something well-defined when unexpected prefixes are used is going to cost ten more transistors per chip, which is going to add to cost to handle a corner case that almost nobody will try to be in anyway. Far better to let whatever the implementation does happen as long as what happens doesn't break the system.

The issue here is their verification of possible internal CPU states didn't account for this one.

(There is, perhaps, an argument to be made that the x86 architecture has becomesocomplex that the emulator between its embarrassingly stupid PDP-11-style single-thread codeflow and the embarrassingly parallel computation it does under the hood to give the user more performance than a really fast PDP-11cannotbe reliably tested to exhaustion, so perhaps something needs to give on the design or the cost of the chips).

Because they cost no/less cycles compared to NOPs?

Seehttp://repzret.org/p/repzret/

It was caused by the fact that x86 was designed 45 years ago, and has had feature after feature piled on the same base,which has never been adequately rebuilt.

Itanic would like to object! Unfortunately it can’t get through the door.

I think formal methods could help designing of such machine, if you can write a mathematical statement that amounts to "there is no side channel between A and B"

Or at least put a practical bound on how many bits per second at most you can from any such side channel (the reasoning being, if you can get at most a bit for each million years, you probably don't have an attack)

Then you verify if a given design meets this constraint

Formal methods are widely used in processor design. It is hard to formalize specs to assert behaviors that bugs we haven't thought about don't exist. At least hard while also preserving the property of being a Turing machine.

I know. I mean applying formal methods to this specific problem of proving side channels don't exist (which seems a very hard thing to do and might even require to modify the whole design to be amenable to this analysis)

As a tidbit, this was part of how one of the teams involved in the original Spectre paper found some of the vulnerabilities. Basically the idea was to design a small CPU that could be formally shown to be free of certain timing attacks. In the process they found a bunch of things that would have to change for the analysis to work... maybe in a small system those wouldn'tactuallylead to vulnerabilities, but they couldn't prove it (or it would require lots of careful analysis). And in big systems, those features do lead to vulnerabilities.

That's amazing!

Do you have some link about the designed CPU?

What would be the typical size of such a constraint-based problem, and do we have the compute power to translate the rules into an implementation? And what if one forgot a rule somewhere… Deeply interesting subject.

I think you'd want it to be a theorem (in Lean, Coq, Isabelle/HOL or whatever) instead of a constraint problem. So it would be more limited by developer effort than by computational power.

Theoretically you can do this from software down to (idealized) gates, but in practice the effort is so great that it's only been done in extremely limited systems.

A program is itself a formal specification of what an algorithm does.

Well, the bug in this specific case (based on the article by Tavis O. linked elsewhere in comments) looks to be the regular kind -- probably an off-by-one in a microcode edge case. That is, here it'snotthe case that the CPU functions correctly but leaves behind traces of things that should be private in timing side channels, as was the case for Spectre.

Yeh just a fun bug rather than anything too fundamental. Still, it is a fun bug.

    Is the future leads to a swarm of disconnected A55 
    cores each running a single application?

don't you dare tease me like that

And programmed in… Forth!

Is it even possible to design a cpu with out-of-order and speculative execution that would have no security issue?

Yes, of course. But we'd have to put actual effort in, and realistically people wouldn't pay enough extra to make it worthwhile.

I noticed the Intel advisory [1] says the following

Intel would like to thank Intel employees:[...] for finding this issue internally.

Intel would like to thank Google Employees: [...] for also reporting this issue.

[1]https://www.intel.com/content/www/us/en/security-center/advi...

I wonder how much sooner than google did intel employees found this issue

but what I am really wondering about is how much money (if any) was the vulnerability worth up the moment when google also discovered this?

As described it's just a CPU crash exploit that requires local binary execution. Getting to a vulnerability would require understanding exactly how the corrupted microcode state works, and that seems extremely difficult outside of Intel.

So as described, this isn't a "valuable" bug.

It's not super-valuable yet, but it would keep you mount a really nasty DoS on cloud providers by triggering hard resets of the physical machines. Some people would probably pay for that, though it's obviously more interesting to push on privilege or exfiltration.

Particularly since the MCEs triggered could prevent an automatic reboot. Would depend what the hardware management system did - do machines presenting MCEs get pulled?

If I'm a cloud provider and somebody's workflow is hard resetting lots of my physical machines, I'm going to give them free access to single tenant machines at the very minimum. If they keep crashing the machines that only they run on, I guess that's ok.

You can exploit this from a single core shared instance.

So you go and find yourself a thousand cheap / free tier accounts, spin up an instance in a few regions each, and boom, you've taken out 10k physical hosts. And run it in a lambda at the same time, and see how well the security mechanisms identify and isolate you.

Causing a near simultaneous reboot of enough hosts is likely to take other parts of the infrastructure down.

I'm curious what part of this scheme involves "not ending up in jail"? Needless to say you can't do this without identifying yourself. To make this an exploitable DoS attack you need to be able to run arbitrary binaries on a few thousand cloud hoststhat you didn't lease yourself.

I'm curious what part of this scheme involves "not ending up in jail"? Needless to say you can't do this without identifying yourself.

Stolen credit cards are a dime a dozen, and nation state actors can just use their domestic banks or agents in the banks of other countries in a pinch to deflect blame or lay false trails.

If I were Russia or China, I'd investa lotof money into researching all kinds of avenues on how to take out the large three public cloud providers if need be: take out AWS, Google, Microsoft and on the CDN side Cloudflare and Akamai and suddenly the entire Western economy grinds to a halt.

The only ones who will not be affected are the US government cloud services in AWS, as this runs separate from other AWS regions - that is, unless the attacker gets access to credentials that allow them executions on the GovCloud regions...

If I were Russia or China, I'd invest a lot of money into researching all kinds of avenues on how to take out the large three public cloud providers

This subthread started with "is this issue a valuable exploit". Needless to say, if you need to invoke superpower-scale cyber warfare to find an application, the answer is "no". Russia and China have plenty of options to "take out" western infrastructure if they're willing to blow things up[1] at that scale.

[1] Figuratively and literally

Countries have proven far more reticent to use kinetic options vs. cyberattacks. Or, put differently, we're all hacking each other left and right and the responses have thus far mostly remained in the digital realm.

See, e.g.,https://madsciblog.tradoc.army.mil/156-what-is-the-threshold...

responses are usually proportional to and in the same domain as the provocation

Or, put differently, we're all hacking each other left and right and the responses have thus far mostly remained in the digital realm.

Which is both good and bad at the same time. Cyber warfare has been significantly impacting our economies and our citizens - anything from scam callcenters over ransomware to industrial espionage - to the tune of many dozens of billions of dollars a year. And yet, no Western government has ever held the bad actors publicly accountable, which means that they will continue to be a drain on our resources at best and a threat to national security at worst (e.g. the Chinese F-35 hack).

I mean, I'm not calling for nuking Bejing, that would be disproportionate - but even after all that's happened, Russia and China are still connected to the global Internet, no sanctions, nothing.

it's not superpower-scale

some bored kid with a couple of hundred stolen credit cards can bring down a significant chunk of AWS/GCP/...

If clouds use shared servers to run their management workloads and if very important companies use shared servers to run their workloads, they would deserve it.

But I don't believe it. People are not that stupid.

If clouds use shared servers to run their management workloads and if very important companies use shared servers to run their workloads, they would deserve it.

Why target the management plane? Fire off payloads to take down the physical VM hosts and suddenly any cloud provider has a serious issue because the entire compute capacity drops.

I mean, you kinda can. There's a depressingly thriving market for stolen cards and things like compromised accounts. A card is a couple of dollars. There are many jurisdictions that turn a blind eye to hacking us companies. Look at how hard it's been to rein in the ransomware gangs and even 'booter' (ddos-for-rent) services.

DoS isn't as lucrative as other things; I assume that most state actors would far prefer to find a way to turn this into a privilege escalation. But being able to possibly take out a cloud provider for a while is still monetizable.

there exist people outside of your jurisdiction

e.g. the GRU

So Replit, Godbolt, and whatever other cloud-hosted compilers are there?

This assumes that either 1. partners and interested sponsor-state state actors aren't kept abreast Intel's microcode backend architecture, or 2. that there hasn't been at least one leak of this information from one of these partners into the hands of interested APT developers. I wouldn't put strong faith in either of these assumptions.

It does, but the same is true for virtually any such crash vulnerability. The question was whether this was a "valuable exploit", not whether it might theoretically be worse.

The space of theoretically-very-bad attacks is much larger than practical ones people will pay for, c.f. rowhammer.

>Getting to a vulnerability would require understanding exactly how the corrupted microcode state works, and that seems extremely difficult outside of Intel.

Intel knows exactly how their ROB works.

Therefore Intel knows the possible consequences of this bug and how to trigger them.

Ifthere is a privilege execution path from this, Intel knows. And anyone Intel chose to share it with knew.

Thankfully, since it's public now, the value of that decreases and customers can begin to mitigate.

If there is a privilege execution path from this, Intel knows. And anyone Intel chose to share it with knew.

No, or at least not yet. I mean, I've written plenty of bugs. More than I can count. How many of them were genuine security vulnerabilities if properly exploited? Probably not zero. But... I don't know. And I wrote the code!

Intel said it can be used for escalation if that answers your question.

Did they confirm that it can definitely be used for escalation? The description I saw was "may allow an authenticated user to potentially enable escalation of privilege and/or information disclosure and/or denial of service via local access" which sounds like they're covering all their bases and may not actually know what is and isn't possible.

The blogpost describes that unrelated sibling SMT threads can become corrupted and branch erratically. If you can get a hypervisor thread executing as your SMT sibling and you can figure out how to control it (this is not an if so much as a when), that's a VM escape. The Intel advisory acknowledges this too when they say it can lead to privilege escalation. This is hardly a useless bug, in fact it's awfully powerful!

Reading this makes me realize how little I know of the hardware that runs my software

Prefixes allow you to change how instructions behave by enabling or disabling features

Why do we need “prefixes” to disable or enable features? Is this for dynamically toggling feature so you don’t have to go into BIOS?

Readhttps://wiki.osdev.org/X86-64_Instruction_Encoding#Legacy_Pr...

The REP prefixes are the most common; they just let you perform the same instruction a variable number of times. It looks in the CX register for the count. This makes many common loops really, really short, especially for moving objects around in memory. The memcpy function is often inlined as a single REP MOVS instruction, possibly with an instruction to copy the count into CX if it isn’t already there.

I suppose the REX (operand size) prefix is pretty common too, since 64–bit programs will want to operate on 64–bit values and addresses pretty frequently.

None of the prefixes toggle things that can be set globally, by the BIOS or otherwise. They all just specify things that the next instruction needs to do.

The ModR/M and SIB prefixes are probably the most common prefixes in instructions. They are so common that assemblers elide their existence when you read code. REX is in the same boat: so common that it's usually elided. The VEX prefix is also really common (all of the V* AVX instructions, like VMOVDQ), and then the LOCK prefix (all atomics).

After all of those, REP is not that uncommon of a prefix to run into, although many people prefer SIMD memcpy/memset to REP MOVSB/REP STOSB. It is slightly unusual.

ModRM and SIB are not a prefix, they're part of the opcode (second and third byte after all the prefixes and the 0Fh/0F38h/0F3Ah opcode map selectors)

More specifically, they're affixed tocertainopcodes that require them. There are a number of byte-sized opcodes that do not require a ModRM or SIB byte (although a number of those got gobbled up to make the REX prefix, but that's another story).

TL;DR Weeee! Intel machine language is crazy!

This isn't correct. ModR/M and SIB arenotprefixes. They are suffixes and essentially part of the core instruction encoding for certain memory and register access instruction. they are the primary means of encoding the myriad addressing modes of the x86. And their existence is not elided in any meaningful way, their value is explicitly derived from the instruction operands (SIB is scale, index, base), so when you see an instruction like:

mov BYTE PTR [rdi+rbx*4],0x4

SIB is determined by the register indices of rdi, rbx, and 4, all right there in the instruction. Likewise, Mod R/M encodes the addressing mode, which is clear from the operands in the assembler listing. Though x86 is such as mess that there are cases where you can encode the same instruction in either a Mod R/M form or a shorter form, eg PUSH/POP.

REX is a prefix, but it is a bit special as it must be the last one, and repeats are undefined. It is not elided because of commonality but because its presence and value is usually implied from the operands, it is therefore redundant to list it.

For instance, PUSH R12 must use a REX prefix (REX.B with the one byte encoding).

There's a good reason for using vector instructions over REP: Until relatively recently that was how you got maximum performance in small, tight loops. REP is making a comeback precisely because of ERMS and FSRM, so unfortunately this will become a bigger problem going forward.

x86 was designed in 78, basically for the purpose of running a primitive laser printer (or other similar workloads). The big problem with this is that the encoding space for instructions was "efficiently utilized". When new instructions, or worse, additional registers were later added, you had to fit the new instruction variants in somehow, and you did this by tacking on prefixes.

Nah, x86 goes even earlier in its heritage - it was, effectively, a bolt-on on Intel's way older designs, as a huge part of the 8086 was being ASM source-compatible with the older 8xxx chips, even as the instruction set itself changed [1]. What utterly amazes me is that the original 8086 was mostly designedby handby a team of not even two dozen people - and today, we got hundreds if not thousands of people working on designing ASICs...

[1]https://en.wikipedia.org/wiki/Intel_8086#The_first_x86_desig...

Acckkghtually, if you go back far enough you end up at the Datapoint 2200. If you want to understand where some of the crazier parts of the 8086 originate from, Ken Shirriff has a nice read:http://www.righto.com/2023/08/datapoint-to-8086.html

It is because testing plays a bigger part today than back then. The complexity has also increased (people do not design at transistor level anymore).

x86 was designed in 78, basically for the purpose of running a primitive laser printer

It's interesting that ASCII is transparently just a bunch of control codes for a physical printer/typewriter, combining things like "advance the paper one line", "advance the paper one inch", "reset the carriage position", and "strike an F at the carriage position", all of which are different mechanical actions that you might want a typewriter to do.

But now we have Unicode, which is dedicated to the purpose of assigning ID numbers to visual glyphs, and ASCII has been interpreted as a bunch of glyph references instead of a bunch of machine instructions, and there are the control codes with no visual representation, sitting in Unicode, being inappropriate in every possible way.

It's kind of like if Unicode were to incorporate "start microwave" as part of a set with "1", "2", "3", etc.

ASCII was used by teletypes, not typewriters. They were "cylinder" heads, as compared to IBM's golfball typewriters.

The endless CR/LF/CRLF line ending problem would have been solved if the RS (Record Separator) ASCII code was used instead of the physical CR = carriage return, ie move print head back to start of line, and LF = line feed, ie rotate paper up one line.

But Unix decided on LF, Apple used CR, Windows used CRLF, and even today, I had to get a guy to stop setting his system to "Windows" because he was screwing up a git repo with extraneous CRs.

Prefixes are modifiers to specific instructions executed by the processor, e.g. to control the size of the operands or enable locking for concurrency.

It's just because x86 as an ISA has accreted over the course of 40+ years, and has variable-length instructions. Every time they extend the ISA they carve out part of the opcode space to squeeze in a new prefix. This will only continue, considering that Intel has proposed another new scheme this year.

You got some great answers already, but to your first point check out Hennessey and Patterson's books, namely Computer Architecture and Computer Organization and Design.

The latter is probably more suited to you unless you wanna go on a dive into computer architecture itself. There's older editions available for free (authorized by the authors) on the web.

I first read the 3rd edition of Computer Architecture and besides being one of the most clear textbooks I've ever read it vastly improved my understanding of what's going on in there in relation to OoO speculative execution, etc.

That's a very poor summary of what prefixes are. My advice, just skip the original article which isn't very good or interesting and read taviso's blog that is linked in the top comment (it gives a few concrete examples of these prefixes). They are modifiers that are part of the CPU instruction.

"Prefixes" in this case mostly expand the instruction encoding space.

So rarely-used addressing modes get a "segment prefix" that causes them to use a segment other than DS. Or x86_64 added a "REX" prefix that added more bits to the register fields allowing for 16 GPRs. Likewise the "LOCK" prefix (though poorly specified originally) causes (some!) memory operations to be atomic with respect to the rest of the system (c.f. "LOCK CMPXCHG" to effect a compare-and-set).

All these things are operations other CPU architectures represent too, though they tend to pack them into the existing instruction space, requiring more bits to represent every instruction.

Notably the "REP" prefix in question turns out to be the one exception. This is a microcoded repeat prefix left over from the ancient days. But it represents operations (c.f. memset/memmove) that are performance-sensitive even today, so it's worthwhile for CPU vendors to continue to optimize them. Which is how the bug in question seems to have happened.

It looks like Intel was cutting corners to be faster than AMD and now all those thigs come out. How much slower will all those processors be after multiple errata? 10%? 30%? 50%?

In a duopoly market there seems to be no real competition. And yes I know that some (not all) bugs also happen for AMD.

And yes I know that some (not all) bugs also happen for AMD.

Some of these novel side-channel attacks actually even apply in completely unrelated architectures such as ARM [1] or RISC-V [2].

I think the problem is not (just) a lack of competition (although you're right that the duopoly in desktop/laptop/non-cloud servers for x86 brings its own serious issues, I've written and ranted more often than I can count [3]), it rather is that modern CPUs and SoCs have simply become so utterly complex and loaded with decades worth of backwards-compatibility baggage that it is impossible for any single human, even a small team of the best experts you can bring together, to fully grasp every tiny bit of them.

[1]https://www.zdnet.com/article/arm-cpus-impacted-by-rare-side...

[2]https://www.sciencedirect.com/science/article/pii/S004579062...

[3]https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

So no saving grace from the ISA… humans just lost ground on CPU design, and I suspect the situation will worsen when AI will enter the picture.

and I suspect the situation will worsen when AI will enter the picture.

For now, AI lacks the contextual depth - but an AI that can actuallydesigna CPU from scratch (and not just rehashing prior-art VHDL it has ... learned? somehow), if that happens we'll be at a Cambrian Explosion-style event anyway, and all we can do is stand on the sides, munch popcorn and remember this tiny quote from Star Wars [1].

[1]https://www.youtube.com/watch?v=Xr9s6-tuppI

Once AI can create itself, we will most likely be redundant.

Some of these novel side-channel attacks actually even apply in completely unrelated architectures such as ARM [1] or RISC-V [2].

Possible? Yes. But far less likely.

Complexity carries over and breeds bugs. RISC-V is an order of magnitude simpler than ARM64, which in turn is an order of magnitude simpler than x86.

And it is so w/o disadvantage[0], positioning itself as the better ISA.

0.https://news.ycombinator.com/item?id=38272318

It's not clear to me this fix will have any performance impact. I strongly suspect it will be negligible or zero.

This seems like a "simple" bug of the type that people write every day, not deep architectural problems like Spectre and the like, which also affected AMD (in roughly equal measure if I recall correctly).

Parent commenter might be thinking of Meltdown, a related architectural bug that only bit Intel and IBM PPC. Everything with speculative execution has Spectre[0], but you only have Meltdown if you speculate acrosssecurity boundaries.

The reason why Meltdown has a more dramatic name than Spectre, despite being the same vulnerability, is that hardware privilege boundaries are the only defensible boundary against timing attacks. We already expect context switches to be expensive, so we're allowed to make them a littlemoreexpensive. It'd be prohibitively expensive to avoid leaking timing from, say, one executable library to a block of JIT compiled JavaScript code within the same browser content process.

[0]https://randomascii.wordpress.com/2018/01/07/finding-a-cpu-d...

Not sure what other errata you're referring to, but this looks like an off-by-one in the microcode. I would expect the fix to have zero or minimal penalty.

Konrad Magnusson from Paradox Interactive (Victoria 3) team found something related to that and mimalloc ->https://github.com/microsoft/mimalloc/issues/807

Not sure if fully related, but possibly.

Seems unlikely unless they somehow emitted redundant prefixes

The article mentions

This fact is sometimes useful; compilers can use redundant prefixes to pad a single instruction to a desirable alignment boundary.

so I imagine that could happen under the right optimization mode.

Why would a compiler prefer a redundant prefix over a nop for alignment?

It can be faster (at runtime).

so basically you're saying that the cpu frontend missed the opportunity to ignore the 0x90 because it was an actual instruction which would be converted into an actual nop uop?

Is this still the case or modern intel CPUs optimize out the nop in the frontend decoder?

Some compiler writers thought that was the case, if [0] is related to OP. I don't have a "modern" (after 6th gen) Intel CPU to test it on, but note that most programs are compiled for a relatively generic CPU.

[0]:https://github.com/microsoft/mimalloc/issues/807

tedunangst down in the comments linkedhttps://repzret.org/p/repzret/:

"Looking in the old AMD optimisation guide for the then-current K8 processor microarchitecture (the first implementation of 64bit x86!), there is effectively mention of a “Two-Byte Near-Return ret Instruction”.

The text goes on to explain in advice 6.2 that “A two-byte ret has a rep instruction inserted before the ret, which produces the functional equivalent of the single-byte near-return ret instruction”.

It says that this form is preferred to the simple ret either when it is the target of any kind of branch, conditional (jne/je/...) or unconditional (jmp/call/...), or when it directly follows a conditional branch.

Basically, when the next instruction after a branch is a ret, whether the branch was taken or not, it should have a rep prefix.

Why? Because “The processor is unable to apply a branch prediction to the single-byte near-return form (opcode C3h) of the ret instruction.” Thus, “Use of a two-byte near-return can improve performance”, because it is not affected by this shortcoming."

...

" If a ret is at an odd offset and follows another branch, they will share a branch selector and will therefore be mispredicted (only when the branch was taken at least once, else it would not take up any branch indicator %2B selector). Otherwise, if it is the target of a branch, and if it is at an even offset but not 16-byte aligned, as all branch indicators are at odd offsets except at byte 0, it will have no branch indicator, thus no branch selector, and will be mispredicted.

Looking back at the gcc mailing list message introducing repz ret, we understand that previously, gcc generated: nop, ret

But decoding two instructions is more expensive than the equivalent repz ret.

The optimization guide for the following AMD CPU generation, the K10, has an interesting modification in the advice 6.2: instead of the two byte repz ret, the three-byte ret 0 is recommended

Continuing in the following generation of AMD CPUs, Bulldozer, we see that any advice regarding ret has disappeared from the optimization guide."

TLDR: Blame AMD K8! First x64 CPU. This GCC optimization is outdated and should only be used when specifically optimizing for K8.

Can someone give a TL;DR for non-CPU experts? All technical articles seem pretty long and/or complex.

x86 has a builtin memory copy instruction, provided by the combination of the movsb instruction and a repprefix byte, that says you want the instruction to run in a loop until it runs out data to copy. This is "rep movsb". This instruction is fairly old, meaning a lot of code still has it, even though there's faster ways to copy memory in x86.

Intel added two features to modern x86 chips that detects rep movsb and accelerates it to be as fast as those other ways. However, those features have a bug. You see, because rep is a prefix byte, you can just keep adding more prefix bytes to the instruction (up to a maximum of 16 AFAIK). x86 has other prefix bytes too, such as rex (used to access registers 8-16), vex, evex, etc. The part of the processor that recognizes a rep movsb does NOT account for these other prefix bytes, which makes the processor get confused in ways that are difficult to understand. The processor can start executing garbage, take the wrong branch in if statements, and so on.

Most disturbingly, when multiple physical cores are executing these "rep rep rep rep movsb" instructions at the same time, they will start generating machine check exceptions, which can at worst force a physical machine reboot. This is very bad for Google because they rent out compute time to different companies and they all need to be able to share the same machine. They don't want some prankster running these instructions and killing someone else's compute jobs. We call this a "Denial of Service" vulnerability because, while I can't read someone else's computations or change them, Icankeep them from completing, which is just as bad.

they all need to be able to share the same machine

Do they ? As these issues keep piling up, it just seems that it's not worth the hassle, and they should instead never do sharing like this...

To some extent, anyone with a web browser is sharing their machine with other people. That's Javascript.

If you ever download untrustworthy code and run it in a VM to protect your main set of data, that's another case.

The success of cloud computing is from the idea that multiple people can share the same computer. You only need one core, but CPUs come with 128, but with the cloud you can buy just that one core and share 1/128th of the power supply, rack space, motherboard, ethernet cable, sysadmin time, etc. and that reduces your costs. That assumption is all based on virtualization working, though; nobody wants 1/128th of someone else's computer, they want their own computer that's 1/128th as fast. Bugs like these demonstrate that you're just sharing a computer with someone, which is bad for the business of cloud providers.

My point is that for a sufficiently large user, you can probably use enough of the 128 cores by yourself alone, that it's more worthwhile to do that and turn off these mitigations : both because it removes a whole class of threats, and also because the mitigations tend to have a non-negligible performance impact, especially when first discovered, on chips that haven't been designed to protect against them.

If you don't want to share GCP and AWS both offer ways to rent machines that aren't shared with other users. But for most people the cost isn't worth it because shared machines work well enough and provide much better resource utilization.

Some x86 instructions can have prefixes that modify their behavior in a meaningful way. Such a prefix can be applied generally to any instruction, but it's expected to have no effect when applied to an instruction it doesn't make sense with. But it turns out the CPU actually misbehaves in some cases when this is done. Intel released a CPU firmware update to fix it.

    ...our validation pipeline produced an interesting assertion...

What is a validation pipeline?

It's described one paragraph earlier.

I’ve written previously about a processor validation technique called Oracle Serialization that we’ve been using. The idea is to generate two forms of the same randomly generated program and verify their final state is identical.

Sounds like the real story should be that Google solved the halting problem. :-P

You're free to solve the halting problem for restricted sets of programs, that doesn't break any rules of the universe.

They also could be just discarding any where it runs for longer than X time, or a bunch of other possibilities.

They might be generating programs that they know will halt. Like: applications with finite loops and such. There are not enough details.

The blog has a link tohttps://lock.cmpxchg8b.com/zenbleed.html#discoverywhich presents the concept.

Intel is a known partner of the NSA. If Intel was intentionally creating backdoors at the behest of the NSA, how would they look different from this vulnerability and the many other discovered vulnerabilities before it?

But so is Google. It would be some very crafty theatrics if it's all coordinated.

Only the people inserting the backdoor or using it would need to be bound by a National Security Letter's gag order. I doubt anyone at Google (including those subject to NSL gag orders) was made aware of this specific vulnerability.

# Google’s commitment to collaboration and hardware security

## As Reptar, Zenbleed, and Downfall suggest, computing hardware and processors remain susceptible to these types of vulnerabilities. This trend will only continue as hardware becomes increasingly complex. This is why Google continues to invest heavily in CPU and vulnerability research. Work like this, done in close collaboration with our industry partners, allows us to keep users safe and is critical to finding and mitigating vulnerabilities before they can be exploited.

There's a tension between the NSA wanting backdoors and service providers (CPU designers + Cloud hosting) wanting secure platforms. It's possible that by employing CPU and security researchers, Google can tip the scales a bit further in their favor.

How would you distinguish this backdoor from one inserted by an unknown partner of the NSA?

My guess is that it would be something that could be exploited via JavaScript. And no JIT would emit an instruction like the one that causes this bug.

the backdoor would just be an encrypted stream of "random" data flowing right out the RNG. there's some maxim of crypto that encrypted data is indistinguishable from random bytes.

Would be possible to describe a modern CPU in something like TLA+ to find all non-electrical problems like these?

There are still bit flipping tricks like rowhammer for RAM, I wouldn't be surprised if there are such vulnerabilities in some CPUs.

Rowhammer is an electrical vulnerability though. PP specified non-electrical vulns.

I've heard Intel does use TLA+ extensively for specifying their designs and verifying their specs. But TLA+ specs are extremely high-level, so they don't capture implementation details that can lead to bugs. And model checking isn't a formal proof, only (tractably small) finite state spaces can be checked with TLC. And even there, you're only checking the invariants you specified.

That said, I'm sure there's some verification framework like SPARK for VHDL, and this feels like exactly the kind of thing it should catch.

CPU designers are so professional about verification and specification that they _dwarf_ software. There's just no comparison.

Formal methods have been used in CPU design for nearly 40 years [1] but not yet for everything, and the methods tend to not have "round-trip-engineering" properties (e.g. TLA+ is not actually proving validity of the code you will run in production, just your description of its behavior and your idea of exhaustive test cases).

[1]https://www.academia.edu/60937699/The_IMS_T_800_Transputer

Benchmarking is always problematic -- what is a good representative workload? All the same, I'd be curious if the ucode update that plugs this bug has affected CPU performance, eg, it diverts the "fast short rep move" path to just use the "bad for short moves but great for long moves" version.

It's a shame that Google didn't publish numbers. They have very good profiling across all of their servers and probably have incredibly high confidence numbers for the real-world impact on this. (Assuming that your world is lots of copying protocol buffers in C++ and Java)

In the article by Tavis O. linked elsewhere in comments, he suggests disabling the FSRM CPU featureonly as an expensive workaroundto be taken only if the microcode can't be updated for some reason. That suggests to me that he, at least, expects the update to do better.

That would be the conservative thing to do. If there's no limit on microcode updates, if I was Intel, I'd consider doing that first and then speeding it up again later. Based on the 5-second guess that people who update everything regularly will care that we did the right thing for security, and people who hate updates won't be happy anyway, so at least the first update will be secure if they never get the next one.

(I think there is a limit on microcode, they seem conservative to release new ones - I don't remember the details)

Any reason to why its named after the dinosaur from the cartoon Rugrats? Or was that what was on TV at the time?

Maybe I should start hacking while watching Teenage Mutant Ninja Turtles.

I think from the memey line "Halt! I am Reptar!" Plus the rep prefix

If you discover a major processor vulnerability and wanna name it Shredder/Krang/Bebop/Rocksteady, I feel like you will have earned that right!

rep is an assembly instruction prefix

I don't understand "ERMS" and "FSRM" and there seems to be nothing good on google about it.

Are these just CPUID flags that tell you that you can use a rep movsb for maximum performance instead of optimized SSE memcpy implementations? Or is it a special encoding/prefix for rep movsb to make it faster? In case of the later, why would that be necessary? How does one make use of fsrm?

Found this [1], which also links to the Intel Optimization Manual [2].

Seems like ERMS was a cheaper replacement for AVX and FSRM was a better version, for shorter blocks.

Cheapest versions of later processors - Kaby Lake Celeron and Pentium, released in 2017, don't have AVX that could have been used for fast memory copy, but still have the Enhanced REP MOVSB. And some of Intel's mobile and low-power architectures released in 2018 and onwards, which were not based on SkyLake, copy about twice more bytes per CPU cycle with REP MOVSB than previous generations of microarchitectures.

Enhanced REP MOVSB (ERMSB) before the Ice Lake microarchitecture with Fast Short REP MOV (FSRM) was only faster than AVX copy or general-use register copy if the block size is at least 256 bytes. For the blocks below 64 bytes, it was much slower, because there is a high internal startup in ERMSB - about 35 cycles. The FSRM feature intended blocks before 128 bytes also be quick.

[1]https://stackoverflow.com/a/43837564

[2]http://www.intel.com/content/dam/www/public/us/en/documents/...

The flags just tell you that, on this CPU, rep movsb is fast so you don't need to use an SSE/AVX-optimized implementation.

FSRM is just the name of a cpu optimization that affects existing code.

Choosing an optimal instruction choice and scheduling can be done statically during compile time or dynamically (via chosing one of several library functions at runtime, or jitting).

In order to be able to detect which is the optimal instruction scheduling at runtime you need to know the actual CPU. You could have a table of all cpu models or you could just ask your OS whether the CPU you run on has that optimization implemented.

Linux had to be patched so that it can _report_ that a CPU does implement that optimization.

https://www.phoronix.com/news/Intel-5.6-FSRM-Memmove

Can we get a better title for this? "Reptar - new CPU vulnerability" or something. I thought it was some random startup ad until I picked up the name somewhere else.

If it is changed to what you suggested a question mark would be warranted, because it is not yet clear what can be done with this"glitch"(as the article calls it).

Intel says

A potential security vulnerability in some Intel® Processors may allow escalation of privilege and/or information disclosure and/or denial of service via local access.

https://www.intel.com/content/www/us/en/security-center/advi...

Dupe:https://news.ycombinator.com/item?id=38268043

(As of this writing, this post has more votes, the other has more comments)

We'll merge that one hither. Please stand by!

It's going to be a pain for cloud and shared hosting.

Most likely dedicated resources on demand will be the future. Some companies already offer it.

GCP and AWS both offer non-shared hardware. If people want the extra isolation they just need to pay for it.

See also Intel’s advisory, which has a description of impact:https://www.intel.com/content/www/us/en/security-center/advi...

Sequence of processor instructions leads to unexpected behavior for some Intel(R) Processors may allow an authenticated user to potentially enable escalation of privilege and/or information disclosure and/or denial of service via local access.

'Some' appears to be almost any Intel x86 CPU made in the last 6 years.

In this new Intel microcode bug, Tavis writes:

"We know something strange is happening, but how microcode works in modern systems is a closely guarded secret."

My question: How likely is it that this is an intentional bug door that was added into the microcode by Intel and its government partners?

I don't know enough about microcode and CPU's to be able to answer this myself, so backed-up opinions welcome!

0%.

This isn't how anyone would backdoor a CPU. An actual backdoor would be done via some instruction sequence that is basically impossible to trigger by accident and hard to detect even when triggered.

The date on the article is for tomorrow?

Cereal Killer: Check this out, it's a memo about how they're gonna deal with those oil spills on the 14th.

Acid Burn: What oil spills?

Lord Nikon: Yo, brain dead, today's the 13th.

Cereal Killer: Whoa, this hasn't happened yet!

This is such an interesting read, right in the league of "Smashing the stack" and "row hammer". As someone with very little knowledge of security I wonder if CPU designers do any kind of formal verification of the microcode architecture?

Yes.

the processor would begin to report machine check exceptions and halt.

I get ithttps://www.youtube.com/watch?v=dXekDCcw2FE

... it literally took me all Goddamn day. Well done.

Credit where credit is due: Google has some of the best codenames.

Nice find. That indeed sounds terrible for anyone executing external code in what they believe to be sandboxes. Good thing it can be patched (and AFAICT, it seems to be a good fix, rather than a performance-affecting workaround.)

Their diagnosis reminds me of what happened when qemu ran into repz ret.https://repzret.org/p/repzret/

The REX prefix is redundant for 'movsb', but not 'movsd'/'movsq' (moving either 32- or 64-bit words, depending on the prefix). That may have something to do with the bug, if there is any shared microcode between those instructions?

If the problem really is that the processor is confused about instruction length, I'm impressed that this problem can be fixed in microcode without a huge performance hit: my intuition (which could be totally wrong) is that computing the length of an instruction would be something synthesized directly to logic gates.

Actually, come to think of it, my hunch is that the uOP decoder (presumably in hardware) is actually fine and that the microcoded optimized copy routine is trying to infer things about the uOP stream that just aren't true --- "Oh, this is a rep mov, so of course I need to go backward two uOPs to loop" or something.

I expect Intel's CPU team isn't going to divulge the details though. :-)

The most awesome part:

This bug was independently discovered by multiple research teams within Google, including the silifuzz team and Google Information Security Engineering.

This is very well written. I know little about assembly programming and Intel's ISA, let alone their microarchitectures, but I could follow the explanation and feel like I have a rough understanding of what is going on here.

Does anyone know if AMD CPUs are affected?

This was a lot more fun than the Google puff piece.

I wonder which MCEs are being taken when this is triggered?

(viahttps://news.ycombinator.com/item?id=38268043, but we merged the comments hither)

Interesting write up. The submission needs a better and more accurate title though

Is what? Another useless title.

Uhm.. Why not padding using NOP ? Looks much more safer that slapping around random prefixes.