As long as we're getting efficiency cores and such, maybe we need some "crypto cores" added to modern architectures, that make promises specifically related to constant time algorithms like this and promise not to prefetch, branch predict, etc. Sort of like the Itanium, but confined to a "crypto processor". Given how many features these things wouldn't have, they wouldn't be much silicon for the cores themselves, in principle.
This is the sort of thing that would metaphorically drive me to drink if I were implementing crypto code. It's an uphill battle at the best of times, but even if I finally get it all right, there's dozens of processor features both current and future ready to blow my code up at any time.
Speaking as a cryptography implementer, yes, these drive us up the wall.
However, crypto coprocessors would be a tremendously disruptive solution: we'd need to build mountains of scaffolding to allow switching to and off these cores, and to share memory with them, etc.
Even more critically, you can't just move the RSA multiplication to those cores and call it a day. The key is probably parsed from somewhere, right? Does the parser need to run on a crypto core? What if it comes over the network? And if you even manage to protect all the keys, what if a CPU side channel leaks the message you encrypted? Are you ok with it just because it's not a key? The only reason we don't see these attacks against non-crypto code is that finding targets is very application specific, while in crypto libraries everyone can agree leaking a key is bad.
No, processor designers "just" need to stop violating assumptions, or at least talk to us before doing it.
I don’t think the security community is also going to become experts in chip design, these are two full skill sets that are already very difficult to obtain.
We must stop running untrustworthy code on modern full-performance chips.
The feedback loop that powers everything is: faster chips allow better engineering and science, creating faster chips. We’re not inserting the security community into that loop and slowing things down just so people can download random programs onto their computers and run them at random. That’s just a stupid thing to do, there’s no way to make it safe, and there never will be.
I mean we’re talking about prefetching. If there was a way to give ram cache-like latencies why wouldn’t the hardware folks already have done it?
I almost gave you up an upvote until your third paragraph, but I have to now give a hard disagree. We're running more untrusted code than ever, and we absolutely should trust it less than ever and have hardware and software designed with security in mind. Security should be priority #1 from here on out. We are absolutely awash in performance and memory capacity but keep getting surprised by bad security outcomes because it's been second fiddle for too long.
Software is now critical infrastructure in modern society, akin to the power grid and telephone lines. It's a strategic vulnerability to neglect security, and it must happen at all levels of the software and hardware stack. Meaning, trying to crash an enemy's entire society by bricking all of its computers and send them back to the dark ages in milliseconds. I fundamentally don't understand the mindset of people who want to take that kind of risk for a 10% boost in their games' FPS[1].
Part of that is paying back the debt that decades of cutting corners has yielded us.
In reality, the vast majority of the 1000x increase in performance and memory capacity over the past four decades has come from shrinking transistors and increasing clockspeeds and memory density--the 1 or 5 or 10% gains from turning off bounds checks or prefetching aren't the lion's share. And for the record, turning off bounds checks is monumentally stupid, and people should be jailed for it.
[1] I'm exaggerating to make a point here. What we trade for a little desktop or server performance is an enormous, pervasive risk. Not just melting down in a cyberwar, but the constant barrage of intrusion and leaks that costs the economy billions upon billions of dollars per year. We're paying for security, just at the wrong end.
Me either. But, lots of engineers are out there just writing single threaded matlab and python codes with lots of data-dependencies and just hoping the system manages to do a good job (for those operations that can’t be offloaded to BLAS). So I'm glad gamer dollars subsidize the development of fast single threaded chips that handle branchy codes well.
I disagree, modern designs include deep pipelines, lots of speculation, and complex caches because that’s the only way to spend that higher transistor budget for higher clocks and compensate for the fact that memory latencies haven’t kept up.
It will be tough, but yeah, server and mainframe users need to roll back the decision to repurpose consumer focus chips like the x86 and arm families. RISC-V is looking good though and seems open enough that maybe they can pick-and-choose which features they take.
I’m not too worried about votes on this post; this site has lots of web devs and cloud users, pointing out that the ecosystem they rely on is impossible to secure is destined to get lots of downvotes-to-disagree.
How is RISC-V going to solve anything here?
It is significantly less complex yet without compromising anything. This means a larger portion of a chip's design effort can be put elsewhere such as into preventing side channel attacks.
I don't really see how the design of RISC-V avoids the need to have a DMP
Because it does not. I also do not see where, if at all, such claim was made.
Perhaps you should explain how this design effort spent preventing side-channel attacks is spent, then?
Anything specific?
It isn’t a sure thing. Just, since it is a more open ecosystem, maybe the designers of chips that need to be able to safely run untrusted code can still borrow some features from the general population.
I think it is basically impossible to run untrusted code safely or to build sand-proof sandboxes, but I thought the rest of my post was too pessimistic.
Turning off bounds checks is like a 5% performance penalty. Turning off prefetching is like using a computer from twenty years ago.
Turning off prefetching while running crypto code would be a performance gain before you can implement the algorithms safely without even more expensive and fragile software mitigations. Just give me the option of configuring parts of the caches (at least data + instructions + TLBs) as scratchpad and and a "run without timing side-channels pretty please" bit with a clearly defined API contract and accessible (by default) to unprivileged userspace code. Lots of cryptographic algorithms have such small working sets that they would profit from a constant-time accessible scratchpad in the L1d cache if they get to use data dependent addresses into it again.
Happily there are mechanisms to do just that, specifically for the purpose of implementing cryptography (I commented at the top level and don't want to just spamming the url).
I agree that hardware/software codesign are critical to solving things like this, but features like prefetching, speculation, and prediction are absolutely critical to modern pipelines and broadly speaking are what enable what we think of as "modern computer performance." This has been true for over 20 years now. In terms of "overhead" it's not in the same ballpark -- or even the same sport, frankly -- as something like bounds checking or even garbage collection. Hell, if the difference was within even one order magnitude, they'd have done it already.
To be clear that includes what we're all doing by downloading and running Javascript to read HN.
Maybe I can say "don't run adversarial code on my same CPU" and only care about over-the-network CPU side-channels (of which there are still some), because I write Go crypto, but it doesn't sound like something my colleagues writing browser code can do.
Is this exploitable through JavaScript?
In general from what I've seen, most of these JS-based CPU exploits didn't strike me as all that practical in real world conditions. I mean, it is a problem, but not really all that worrying.
Why wouldn't it be?
Because JS/html provides APIs to perform cryptography(I can't recall whether the cryptography specs are part of ES or HTML/DOM) - if you try to implement constant time cryptography in JS you will run into a world of hurt due to the entire concept of "fast JS" being dependent on heavy speculation, and lots of exciting variations in timing of even "primitive" operations.
No, the attack would be implemented in JS, not the victim code (though, that too, but that's not what's interesting here).
Ah, you’re concerned about person using js to execute the side channel portion of the attack, not the bit creating the side channel :)
How is JavaScript going to run a chosen-input attack against one of your cores for an hour?
If you leave a tab open that's running that JS..
Unfortunately somebody has tricked users to leaving JavaScript on for every site, it is a really bad situation.
Security and utility are in opposing balances often. The safest possible computer is one buried far underground without any cables in a faraday cage. Not very useful.
Setting aside JavaScript, you can see this today with cloud computers which have largely displaced private clouds. These run untrusted code on shared computers. Fundamentally that’s what they’re doing because that’s what you need for economies of scale, durability, availability, etc. So figuring out a way to run untrusted code on another machine safely is fundamentally a desirable goal. That’s why people are trying to do homomorphic encryption - so that the “safely” part can go both ways and both the HW owner and the “untrusted” SW don’t need to trust each other to execute said code.
Speak for yourself; I've got JavaScript disabled on news.ycombinator.com and it works just fine.
Note that in the vast majority of cases, crypto-related code isn't what we spend compute cycles on. If there was a straightforward, cross-architecture mechanism to say, "run this code on a single physical core with no branch prediction, no shared caches, and using in-order execution" then the real-world performance impact would be minimal, but the security benefits would be huge.
I’m in favor of adding some horrible in-order, no speculation, no prefetching, 5 stage pipeline architectures 101 core which can be completely verified and bulletproof to chips.
But the presence of this bulletproof core would not solve the problem of running bad code on modern hardware, unless all untrusted code is run on it.
No, you don't get to say processor designers need to stop violating your assumptions. You need to stop making assumptions about behaviour if that behavior is important (for cryptographic or other reasons). Your assumptions being faulty are not a valid justification, because that would mean no one could have ever added any caches or predictors at any point because that would be "violating your assumptions". Also lets be real here: even if "not violating your assumptions" was a reasonable position to take, it is not reasonable in any way to make any kind of assumption about modern processors (<30 years old) processors not caching, predicting, buffering, or speculating anything.
If you care about constant time behaviour you should either be writing your code such that it is timing agnostic, or you could read the platform documentation rather than making assumptions. The apple documentation tells you how to actually get constant time behavior, rather than making assumptions.
Have you even read the paper? Especially the part where the attack applies to everyone’s previous idea of “timing agnostic” code, and the part where Apple does not respect the (new) DIT flag on M1/M2?
No, the paper targets "constant time" operations, not timing agnostic.
The paper even mentions that blinding works, and that to me is the canonical "separate the time and power use of the operation from the key material" solution. The complaint about this approach in the paper being is that it would be specific to these prefetchers, but it seems this type of prefetcher is increasingly prevalent across multiple cpus and architectures so it seems unlikely to be apple specific for long. The paper even mentions new intel processors have these prefetchers and so necessarily provide functionality to disable them there too. This is all before we get to the numerous prior articles showing that key extraction via side channels is already possible with these constant time algorithms (a la last months(I think?) "get the secrets from the power led" paper). The solution is to use either specialized hardware (as done for AES) or timing agnostic code.
Trying to create side channel free code by clever construction based on assumptions about power and performance of all hardware based on a simple model of how CPUs behave is going to just change the side channels, not remove them. If it's a real attack vector that you are really concerned about you should probably just do best effort and monitor for repeated key reuse or the like, and then start blinding at some threshold.
Processor designers are very unlikely to do that for you, because everyone not working on constant time crypto gives them a whole lot of money to keep doing this. The best you might get is a mode where the set of assumptions they violate is reduced.
"Security" rarely (almost never) seems to be part of any commercially-significant spec.
Almost as if by design...
Wouldn't that "just" allow someone to see if a key was present (and any information that informs) but dramatically help prevent secret key extraction?
Isn't that the entire point of the secure enclave[1]?
https://support.apple.com/guide/security/secure-enclave-sec5...
The secure enclave is not a general-purpose/user-programmable processor. It only runs Apple-signed code, and access is only exposed via the Keychain APIs, which only support a very limited set of cryptographic operations.
Presumably latency for any operation is also many orders of magnitude higher than in-thread crypto, so that just doesn't work for many applications.
If you look at the cryptokit API docs the Secure Enclave essentially only supports P-256. Which is maybe why they didn’t include ECC crypto in the examples.
One option would be for people to stop downloading viruses and then running them.
Except when these vulnerabilities are exploitable from JavaScript in your web browser.
I think what's more likely is "mode switching" in which you can disable these components of the CPU for a certain section of executing code (the abstraction would probably be at the thread level).
Many modern architectures have crypto extensions, usually to accelerate a few common algorithms, maybe it would be good to add a few crypto-primitives instructions to allow new algorithms?
see DIT and DOIT flags referenced in the paper and in the faq question about mitigations. newer CPUs apparently provide functions to do just that.
Encrypted bus mmu have existed since the 1990's.
However, the trend to consumer-grade hardware for cost-optimized cloud architecture ate the CPU market.
Thus, the only real choice now is consumer CPUs even in scaled applications.