HN comments for: Zen 5's 2-ahead branch predictor: how a 30 year old idea allows for new tricks

IvanAchlaqullah

44 replies

1d21h

2024-07-26 21:12:21 UTC

It's always interesting to see decades old papers, sometimes published with little to no fanfares, suddenly becomes "state of the art" because hardware have become powerful enough.

For example Z-buffers[1]. It's used by 3d video games. When it's first published on paper, it's not even the main topic of the paper, just some side notes because it requires expensive amount of memory to run.

Turn out megabytes is quite cheap few decades latter, and every realtime 3d renderer ended up using it.

[1] https://en.wikipedia.org/wiki/Z-buffering

bee_rider

25 replies

1d21h

2024-07-26 21:20:01 UTC

I sometimes wonder if there’s an academic career hidden in there for an engineer: go to the library and read what the CS folks were publishing on physical papers, maybe there are some ideas that can actually be implemented now that weren’t practical back then.

Terr_

10 replies

1d19h

2024-07-26 22:56:41 UTC

In a series of books by David Brin [0] there is a galaxy-wide institution known as the library, and civilizations regularly mine its millions of years of data for suddenly-relevant-again techniques and technologies.

I remember one bit where a species had launched some tricky fleet-destroying weapon to surprise their enemies with esoteric physics, only to have it reversed against them, possibly because the Librarian that once helped their research-agent wasn't entirely unbiased.

[0] https://en.m.wikipedia.org/wiki/Uplift_Universe

CoastalCoder

4 replies

1d18h

2024-07-26 23:47:27 UTC

Mind editing that to give a spoiler alert?

CoastalCoder

2 replies

23h23m

2024-07-27 19:06:17 UTC

Hey, could someone clarify why all the downvotes?

I'd have thought asking for a spoiler alert would be pretty acceptable.

samatman

1 replies

22h34m

2024-07-27 19:55:19 UTC

In a series of books by David Brin [0] there is

Is a spoiler alert.

Also:

Please don't comment about the voting on comments. It never does any good, and it makes boring reading.

https://news.ycombinator.com/newsguidelines.html

CoastalCoder

0 replies

19h20m

2024-07-27 23:09:35 UTC

Thanks.

Terr_

0 replies

1d18h

2024-07-26 23:52:14 UTC

Don't worry, it's nowhere near the main plot or characters, just a small "meanwhile, elsewhere" vignette. Basically emphasizing the "why bother everything's already invented" mentality of most client-races, and how deep access and query-secrecy have big impacts.

AceJohnny2

4 replies

1d19h

2024-07-26 23:25:43 UTC

Also in Vinge's Deepness in the Sky, there aren't really "programmers" as we know them anymore, but "programmer-archeologists" that just search the archives for code components to reuse.

The_Colonel

1 replies

1d14h

2024-07-27 04:26:41 UTC

That's pretty much what any (decent) programmer does today as well. You first search your code base to see if the application already does something like it, if not, whether there is a published library. Where this starts to fail is the idea that connecting those already written peaces together is easy.

manojlds

0 replies

1d6h

2024-07-27 11:40:55 UTC

..And then ask an LLM for code

abecedarius

0 replies

1d3h

2024-07-27 15:23:08 UTC

I think that's unfair to the programmer-archaeologists: young Pham wanted to write things from scratch (and took advantage when he got a chance to), and other characters said he was a talented hacker, but as they also said, it was just way more productive most of the time to cobble together ancient code.

Terr_

0 replies

1d18h

2024-07-26 23:50:19 UTC

Also: In the Destiny mythic sci-fi franchise, the human golden age ended with a mysterious apocalypse, leaving "cryptarchs" (crypto-archeologists) to try to rebuild from arcane fragments of encrypted data or unknown formats.

pornel

3 replies

1d20h

2024-07-26 22:02:47 UTC

Image and video compression are like that. Ideas for mainframes in the '80s are realtime algorithms now.

eru

2 replies

1d15h

2024-07-27 02:33:19 UTC

I guess they are _soft_ realtime algorithms now?

taneq

1 replies

1d13h

2024-07-27 04:47:56 UTC

'Realtime' isn't a specific thing, there's just 'fast enough'. Oldschool "render each frame during the scan line staying ahead of the electron beam" graphics was pretty hardcore though.

eru

0 replies

1d12h

2024-07-27 05:46:59 UTC

'Realtime' actually has multiple meanings.

At least one of them is very, very specific and is the one that Wikipedia uses in https://en.wikipedia.org/wiki/Real-time_computing

Real-time computing (RTC) is the computer science term for hardware and software systems subject to a "real-time constraint", for example from event to system response.[1] Real-time programs must guarantee response within specified time constraints, often referred to as "deadlines".[2]

Strictly speaking, this definition doesn't say anything about how tight those deadlines are, as long as you can guarantee some deadlines.

There's also 'soft' real time where you try to hit your deadlines often-enough, but there's no guarantees and a missed deadline is not the end of the world. Games are good example of that, including the example of chasing the electron beam.

ABS brakes are an example of a 'hard' real time system: the deadlines aren't nearly as tight as for video game frames, but you really, really can't afford to miss them.

findthewords

2 replies

1d20h

2024-07-26 21:33:11 UTC

Yes, "read 10 year old papers as a source of ideas ripe for commercialization" IS common advice in universities.

chrisbrandow

1 replies

1d20h

2024-07-26 21:56:50 UTC

A post-doc in my chemistry lab had the saying, “two weeks in the lab will save you a day in the library”

scns

0 replies

23h1m

2024-07-27 19:29:05 UTC

Weeks of work can save hours of planning.

toast0

1 replies

1d12h

2024-07-27 05:42:33 UTC

Heck. Look at 10 year old product launch PRs from big tech. Anything that Google launched 10 years ago and killed, but seems like a good idea today is probably also easier to do. And if you look 5-10 years before that, you can find the Yahoo launch PR where they did the same thing ;p

dehrmann

0 replies

2024-07-27 18:12:10 UTC

I always like to point out Wordle for nailing the timing. Some version of it has been viable since 2000, mass smart phone adoption helped with how it spread, but it could have been viable in 2010. What it did was landed at the right moment.

fragmede

1 replies

1d14h

2024-07-27 04:27:25 UTC

The whole AI trend is in part due to things that are now possible on GPU supercomputers with gigabytes of RAM backed by petabytes of data and at the top speed of GPUs. Some of the algorithms date back to before the ai winter and it's just that we can now do the same thing with a ton more data and faster.

zaptrem

0 replies

1d13h

2024-07-27 04:45:34 UTC

All of the main algorithms do (multi-layer perceptrons and stochastic gradient descent are from the 50s and 60s!). Basically the only thing that changed is we decided to multiply some of the outputs of the multi-layer perceptrons by each other and softmax it (attention) before passing them back into more layers. Almost all of the other stuff is just gravy to make it converge faster (and run faster on modern hardware).

lispwitch

0 replies

1d1h

2024-07-27 17:01:59 UTC

this seems to be how PEG parsing became popular during the last couple of decades, for example; see https://bford.info/pub/lang/peg/ (peg.pdf p11: "This work is inspired by and heavily based on Birman’s TS/TDPL and gTS/GTDPL systems [from the 1970s ...] Unfortunately it appears TDPL and GTDPL have not seen much practical use, perhaps in large measure because they were originally developed and presented as formal models [...]")

kvemkon

0 replies

1d16h

2024-07-27 01:51:02 UTC

Academic? Perhaps, applied:

"When Soviets Won the Cold War: Wading Through Pyotr Ufimstev's Work on Stealth" (26.03.2024)

https://news.ycombinator.com/item?id=39830671

In the early 1970s, Lockheed Martin engineer Denys Overholser discovered the key to stealth technology hidden in a stack of translated Soviet technical papers. Disregarded by the Soviet academic elite, and unheard of in the United States, Pyotr Ufimstev had worked out calculations that would help win ...

Not sure of the details of this story, but in general having there enough people to grant them time just to search for something interesting seems not unrealistic.

eru

0 replies

1d15h

2024-07-27 02:32:08 UTC

I wonder if current AI training's effort to hover up all the training data they can find will accidentally give us most of the benefits of that?

A human can only read so much, so has to discriminate. But our machines are omnivorous readers.

twoodfin

6 replies

1d18h

2024-07-27 00:18:07 UTC

On the software side, garbage collection was well explored by academia for more than two decades before the JVM brought it to wide commercial adoption ca. 1995.

fulafel

2 replies

1d9h

2024-07-27 08:39:57 UTC

Not academia, GC (and lots of other PLT) was mature and productized then, just missed by the majority of the industry living the dark ages. In the 90s, enterprisey architecture astronaut cultures went Java/C++ and hackerish cultures went Perl/Python seasoned with C.

fragmede

1 replies

14h4m

2024-07-28 04:25:53 UTC

given how much C++ underpins the video game industry, I don't think "enterprisey architecture astronaut" is entirely fair to video game programmers

fulafel

0 replies

11h39m

2024-07-28 06:50:59 UTC

Sure, it's a gross generalization, games and embedded have an excuse. Though I don't think C++ was that big in the mid-90s on games side yet, it was more C.

bhouston

2 replies

1d17h

2024-07-27 01:09:58 UTC

Lisp and Smalltalk and BASIC (to name a few) were popular languages before the JVM that used GC.

tasty_freeze

1 replies

1d14h

2024-07-27 03:42:15 UTC

When I was in high school, I wrote a Z80 disassembler in BASIC and printed out the entire TRS-80 Model 3 ROM on a thick stack of fan folded greenline paper. I spent a lot of free time figuring out what was what and commenting the code by pen in the right margin.

The (string) garbage collector was amazingly frugal and correspondingly slow. Because the garbage collector typically ran only when memory was critically low it used almost no memory, just 4 bytes I think. (it was also possible to invoke it any time with the FRE(0) function call IIRC)

The routine would sweep all of the symbol table and expression stack looking for string pointers, and would remember the lowest (highest? I forget which) string base address which hadn't already been compacted. After each pass it would move that one string and fix the symbol table pointer (or expression stack str pointer) then do another pass. As a result, is was quadratic with the number of live strings -- so if you had a large number of small strings, it was possible for your program to lock solidly for 30 seconds or more.

bhouston

0 replies

1h30m

2024-07-28 16:59:27 UTC

Your description closely matches what is described here: https://en.wikipedia.org/wiki/Garbage_collection_(computer_s...

pcwalton

3 replies

1d13h

2024-07-27 05:24:49 UTC

Another example is Rust's borrow checker, which has roots in substructural type system papers from decades earlier. Many academics considered substructural type systems dead (killed by GC, more or less) until Rust resurrected the idea by combining it with some new ideas from C++ of the time.

protomolecule

2 replies

1d10h

2024-07-27 07:46:23 UTC

"with some new ideas from C++ of the time"

Could you elaborate on that?

masklinn

1 replies

1d9h

2024-07-27 09:12:49 UTC

I assume it's the deterministic, implicit, synchronous, single-use destructor.

Affine logic only says what you're able to do with a value, it doesn't say anything about what happens when that value goes out of scope.

protomolecule

0 replies

1d3h

2024-07-27 14:37:50 UTC

Hmm, that's possible

crest

2 replies

1d8h

2024-07-27 09:54:58 UTC

Z-buffers don't just require about an other frame buffer worth of memory, but also lots of read and write bandwidth per pixel. This extra memory bandwidth requirement made them expensive to implement well. High end implementations used dedicated RAM channels for them, but on lower end hardware they "stole" a lot of memory bandwidth from a shared memory interface e.g. some N64 games optimised drawing a software managed background/foreground with the Z-buffer disabled to avoid the cost of reading and updating the depth information.

rasz

0 replies

1d2h

2024-07-27 16:23:36 UTC

Power of Z-buffer lies in letting you skip all of the expensive per pixel computations. Secondary perk is avoidance of overdraw.

Early 3D, especially gaming related, implementations didnt have any expensive per pixel calculations. Cost of looking up and updating depth in Z-buffer was higher than just rendering and storing pixels. Clever programming to avoid overdraw (Doom sectors, Duke3D portals, Quake sorted polygon spans) was still cheaper than Z-Buffer read-compare-update.

Even first real 3D accelerators (3Dfx) treated Z-buffer as an afterthought used at the back of fixed pipeline - every pixel of every triangle was being brute force rendered, textured, lighted and blended just to be discarded by Z-buffer at the very end. Its very likely N64 acted in same manner and enabling Z-buffer didnt cut the cost of texturing and shading.

Z-buffer started to shine with introduction of Pixel Shaders where per pixel computations became expensive enough (~2000 GF/Radeon).

kevingadd

0 replies

1d1h

2024-07-27 16:34:41 UTC

Some modern algorithmic smarts (like hierarchical Z) helped a lot to reduce the memory bandwidth demands, to be fair.

BitPirate

1 replies

1d12h

2024-07-27 06:01:52 UTC

The EEVDF scheduling algorithm is also a good example. Designed back in 1995 and it's the default process scheduler of Linux now.

throwaway2037

0 replies

4h22m

2024-07-28 14:07:59 UTC

I never heard about this before your post.

Wiki tells me: <<Earliest eligible virtual deadline first (EEVDF) is a dynamic priority proportional share scheduling algorithm for soft real-time systems.>>

Ref: https://en.wikipedia.org/wiki/Earliest_eligible_virtual_dead...

abainbridge

0 replies

1d10h

2024-07-27 07:37:53 UTC

Another example is Low Density Parity Check Codes [1]. Discovered in 1962 by Robert Gallager but abandoned and forgotten about for decades due to being computationally impractical. It looks like there was a 38 year gap in the literature until rediscovered by David MacKay [2].

The first mainstream use was in 2003. It is now used in WiFi, Ethernet and 5G.

[1] https://en.wikipedia.org/wiki/Low-density_parity-check_code

[2] https://scholar.google.com/scholar?q=%22low+density+parity+c...

The_Colonel

0 replies

1d14h

2024-07-27 04:17:59 UTC

suddenly becomes "state of the art" because hardware have become powerful enough.

I'd rather say that we were capable of such a design for several decades, but only now the set of trade-offs currently in-place made this attractive. Single core performance was stifled in the last 2 decades by prioritizing horizontal scaling (more cores), thus the complexity / die area of each individual core became a concern. I imagine if this trend did not take place for some reason, and the CPU designers primarily pursued single core performance, we'd see implementation much sooner.

Regarding the Z-buffer, I kind of get that this would appear as a side note, it's a simple concept. Perhaps even better example is ray tracing - the concept is even quite obvious to people without 3D graphics background, but it was just impractical (for real-time rendering) in terms of performance until recently. What I find interesting is that we haven't found a simpler approach to approximate true to life rendering and need to fallback on this old, sort of naive (and expensive) solution.

Szpadel

29 replies

1d21h

2024-07-26 21:07:00 UTC

that's probably bad idea but I would like to learn why:

why when we have a conditional branch we cannot just fetch and prepare instructions for both possible branches and then discard the incorrect one?

is this that much harder or there are other reasons that makes this not worth it

swatcoder

13 replies

1d20h

2024-07-26 21:48:24 UTC

Because it's rare for a branch result to be random. The compiler/runtime/cpu/etc can often guess which result is more likely, and correctly not do the extra work in the first place, and so that's usually the better strategy than spending silicon and heat on the wrong answer just in case.

I think a lot of people don't have an intuition about how accurate branch prediction can be, but if you look at your own code, you'll quickly realize "well, yeah, control flow is almost always going to go this way and we just have this branch so we can handle the exceptional case" -- compilers can often deduce this pretty well themselves now, and cpus/jits/runtimes can develop some pretty impressive heuristics as well, and when all those fail you can often add explicit hints in your code that tell your compiler/etc what you expect if they can't guess.

branko_d

12 replies

1d20h

2024-07-26 22:11:53 UTC

it's rare for a branch result to be random

How rare, though?

QuickSort has fundamentally unpredictable branches, and it’s a pretty widely used algorithm. Binary search, B-trees also come to mind.

kllrnohj

6 replies

1d20h

2024-07-26 22:20:43 UTC

Binary searching is quite slow and should be used sparingly but not because of branch misprediction necessarily but because of memory stalls - you're almost always guaranteed to have a cache miss during the search. Similarly for B-trees it's going to be memory stalls that you're probably more focused on addressing, not branch mispredicts.

emn13

4 replies

1d19h

2024-07-26 22:37:57 UTC

This probably depends on the size of the area to be searched, and just how hot that region is. After all, if it's fairly small, there won't be any cache misses, and the data structure does use less memory than a typical hash table, which is itself an advantage.

orf

3 replies

1d19h

2024-07-26 22:45:58 UTC

If the size of the data is small, a linear search through a contiguous array is going to be far faster than anything more complex.

emn13

1 replies

1d18h

2024-07-26 23:54:02 UTC

Yep; though we'd have to test a few cases to figure out what the cutoffs are here, and if there's any middle ground left for a divide-and-conquer strategy.

It's also definitely going to depend on the cost of the hash function and comparison function - for something like strings, where those can be quite expensive, binary search probably has a better chance of applicability than for guid's say.

kllrnohj

0 replies

1d4h

2024-07-27 13:48:34 UTC

For things with non-trivial comparison functions you're all but certainly better off with those in something else like a hashmap. After all, the more expensive the compare, the more expensive the sort & reordering that a binary search requires. And then for trivial comparison objects, binary searching is still slower even at "huge" sizes like 10,000. It's really hard to find a good use of binary searching an array on modern CPUs.

eru

0 replies

1d15h

2024-07-27 02:35:10 UTC

Well, it depends on how expensive your comparison is compared to a cache miss.

barrkel

0 replies

1d19h

2024-07-26 23:10:53 UTC

B-trees are cache friendly, spending more time doing linear scans at each level to keep the tree shallow, and thus indirection less frequent. They're designed for high latency indirection - loading pages from spinning rust.

sjburt

0 replies

1d14h

2024-07-27 03:58:10 UTC

In benchmarks, branch predictors guess correctly 90-95% of the time.

saagarjha

0 replies

13h27m

2024-07-28 05:02:57 UTC

Most code with branches is not “algorithms”. (Bear with me.) It’s simple loops and function calls and vtable indirection. Those can be predicted with very high accuracy, and loops in particular by their very nature dominate execution. For every unpredictable branch in your quicksort there are several very predictable branches that do things like check bounds, the recursion base case, and return address.

jayd16

0 replies

1d17h

2024-07-27 00:43:40 UTC

Loops are usually much more one way than the other.

gpderetta

0 replies

1d7h

2024-07-27 11:27:27 UTC

Qsort can be implemented branchless by converting the branch into a data dependency.

eyegor

0 replies

1d18h

2024-07-26 23:54:38 UTC

Also happens to be why quicksort loses to almost anything else on small arrays, even bubble sort

dymk

3 replies

1d21h

2024-07-26 21:10:39 UTC

Transistor count; now you have to duplicate all the decode and speculative execution circuitry for both possible branches

immibis

2 replies

1d20h

2024-07-26 21:49:34 UTC

No, the same circuits would execute them interleaved just like they execute multiple hardware threads now

nomel

1 replies

1d20h

2024-07-26 22:28:50 UTC

With the SMT core count having to be one less.

immibis

0 replies

6h1m

2024-07-28 12:28:36 UTC

No.

wmf

1 replies

1d21h

2024-07-26 21:19:16 UTC

It's a huge waste of energy and in some cases it would even be slower because you'd execute more instructions overall. If the branch mispredict rate is around 1% it's simply not worth paying a penalty 99% of the time to get a gain 1% of the time. Maybe it would be worth doing on low-confidence branches.

kolbe

0 replies

1d17h

2024-07-27 00:31:38 UTC

The vast vast vast majority of instruction pipelines are sparse. You can pack in non-dependent instructions essentially for free.

0x000xca0xfe

1 replies

1d20h

2024-07-26 21:58:59 UTC

- Side effects; how do you handle two different writes to memory?

- Double the execution units; very expensive for wide vector units

- Massive waste of energy as half the resources will always be wasted no matter what

- Bad scaling, i.e. four branches ahead would require 16x the resources

isotypic

0 replies

1d19h

2024-07-26 22:39:10 UTC

Handling two different writes to memory is not really a concern - existing speculative/out of order processors already solve this issue by completing (perform architectural side effects) instructions in-order. So even if two writes are made, one in each branch, by the time the write is meant to be completed, the prior branch is resolved and we know which write is actually meant to be made and the bad one can be discarded.

Doubling the execution units also isn't strictly needed - you can use the existing out-of-order core to send two sets of instructions through the same functional units. There will be more contention for the resources, possibly causing stalls, but you don't need to fully double everything.

Things similar to this idea are already done in processors - simultaneous multithreading, early branch resolution, conditional instructions, are all ideas that have similar implementation difficulties. So the reason this specific idea is not done is more in line with your last two points rather than the first two.

sapiogram

0 replies

1d21h

2024-07-26 21:21:46 UTC

Disclaimer: I don't work in this field, just an enthusiast.

As far as I can tell, branch predictors have always been too good for it to be worth it. Moderns CPUs have instruction reorder buffers that are hundreds of instructions deep, so even if only 8 of those instructions are conditional jumps, there's 256 different paths your program could take. If your branch predictor predicts all 8 correctly >50% of the time (It does), doing 256x the work to cover your ass is not worth it.

phire

0 replies

1d11h

2024-07-27 07:24:25 UTC

It's a sub-optimal strategy.

A modern TAGE branch predictor is correct well over 99% of the time. So those extra instructions for the other side of the branch are almost always discarded.

Worse, the frontend is fetching dozens of branches ahead of where the backend can actually confirm which direction to take. What are you going to do at the next branch, start decoding four possible branches? then 8, 16, 32 possible branches? Remember, most of the time you are going to throw it away.

If you actually have the hardware to fetch from multiple instruction streams in parallel (which Intel's Gracemont/Goldmont/Skymont and now AMDs Zen 5 do), the better strategy is to assume your branch predictor is actually correct 100% of the time. Follow one side of the branch, then the one after it.

Intel's Skymont actually decodes the next 3 branch targets in parallel because it has three decoders, each 3-wide. Intel actually introduce fake branches to break up large blocks of code, so that all three decoders are always active decoding different part of the upcoming instruction stream. The three uop streams are later merged, allowing Skymont to maintain an effect decode bandwidth of 9 instructions per cycle.

If you executed both sides of the branch, you are only slightly reducing the branch misspredit delay in the rare case the branch prediction was wrong. Instead, by executing one side the next two or three predictions, Intel and AMD can make multiple decoders do work in parallel. Intel are doing 9-wide with three simpler 3-wide decoders, and AMD can do 8-wide with two simpler 4-wide decoders.

ithkuil

0 replies

1d19h

2024-07-26 23:08:47 UTC

That's called speculative execution and IIRC all modern CPUs are doing it.

It requires more silicon to hold more microarchitectural state and more execution units to fully exploit the technique, but superscalar CPUs already have those since they are essential to exploit instruction level parallelism in non-branchy code. The rest is "just" a lot of headaches to handle complicated stuff such as aliasing, interrupts, ... But hardware engineers are such wizards they can do these things too.

Turns out however that speculative execution opens up a possibility of abusing a cache timing side channel to extract information from data touched by branches of code that has been only speculatively executed but whose architectural side effects were not committed (i.e. not "really" executed).

Which includes code that had been explicitly not executed because of a conditional check (e.g. permissions, ...)

A familiar instance of such an attack is Spectre [1]

1: https://en.m.wikipedia.org/wiki/Spectre_(security_vulnerabil...

gpderetta

0 replies

1d7h

2024-07-27 11:26:24 UTC

Divergence. On an OoO pipeline fetch might be ahead of retire by hundreds of instructions. That's dozen of branches, and if each causes execution to speculatively fork, it wouldn't scale.

You could potentially do it only for predicted "unpredictable" branches. Now the tradeoff is wasted power and execution units for dead work and so far the tradeoff was just not worth it.

Some form of this has been experimented with. In the late '90 SPRC experimented with scouting threads and as mentioned else thread the Efficient Intel cores can fetch (but not execute) across branches.

eigenform

0 replies

1d20h

2024-07-26 22:07:34 UTC

Most branches are biased one way or the other. "Fetching down both paths" means not exploiting any information you might have gathered about a branch being biased - I think that would be equivalent to randomly predicting the branch (except for it would cost more than a random predictor because you're actually fetching both ways instead of just one).

bsder

0 replies

1d20h

2024-07-26 22:09:23 UTC

why when we have a conditional branch we cannot just fetch and prepare instructions for both possible branches and then discard the incorrect one?

We actually do that. It's called a GPU. And it sucks for general code.

MBCook

0 replies

1d20h

2024-07-26 22:12:34 UTC

We reached 90% accuracy decades ago. Depending on workload modern chips can do way better.

So basically it’s just nowhere near worth it. Much better to use those chip resources for another thread or core.

mrlonglong

22 replies

1d18h

2024-07-26 23:41:13 UTC

Speculative predictors have been subjected to a number of attacks to weasel out private data. Given that so many of the common ISAs are vulnerable, are they taking steps to reduce the impact of such attacks?

kmeisthax

18 replies

1d15h

2024-07-27 02:34:16 UTC

The vulnerability is speculative execution, not branch prediction. The branch predictor is the thing you have to trick to force the processor to speculatively execute code in the victim program. Furthermore you also need a valid timing source to read out the results of the speculative execution.

As for how to stop that, short of boiling the ocean[0], you don't. Speculative execution is so valuable for performance that a computer without it is completely unusable. If you really want a processor without it, buy an old first-gen Pentium.

Actual practical mitigations for speculative execution vulnerabilities are varied, but at a minimum you have to ensure process separation between a victim process holding secrets and any potential attackers that may have the opportunity to influence victim process execution. Intel was caught with their hands in their pants speculating across rings, which is why you could read kernel or hypervisor memory from userspace, but on not-poorly-designed CPUs the main victim you have to worry about is HTML iframes. Different origins aren't allowed to make HTTP requests to one another[1], but they can transclude[2] one another without permission. That traditionally loaded information from the origin into the attacker's process, which could be exfiltrated with timing attacks.

The web's solution to this was actually not to process-separate iframes, at least not initially, but to take away shared-memory multithreading entirely. If you deny the attacker a timing reference then it doesn't matter what they can make the victim speculatively execute. But to do this you have to take away multithreading because otherwise a thread can just repeatedly write known data in a loop to create a clock.

[0] https://hackaday.com/2013/08/02/the-mill-cpu-architecture/

[1] At least not without the target origin allowing it via CORS

[2] e.g. hotlink images or embed iframes

topspin

11 replies

1d11h

2024-07-27 07:20:34 UTC

Speculative execution is so valuable for performance that a computer without it is completely unusable.

Jim Keller's view aligns with this and goes further. My interpretation of his thinking is that predictors and speculation are the only meaningful features of CPUs today. ISA doesn't matter anymore because the power of modern compilers makes high performance software highly portable and all CPUs end up bottlenecked on the quality and capacity of the predictors, regardless of the ISA. For example, the burden of x86 complexity no longer matters because it amounts to a "tax" small enough to be lost in the noise.

That's from a designer making high performance RISC-V CPUs.

The_Colonel

4 replies

1d9h

2024-07-27 08:48:26 UTC

This is so annoying about the hype around ARM, for which even smart people fall. Yes, Apple Silicon is good, but it's not because of ARM ISA. I still keep hearing that ARM is RISC which is wrong since the 1990s.

hajile

3 replies

17h9m

2024-07-28 01:20:52 UTC

What about ARM64 is not RISC?

RISC was not just about total instructions, but the complexity of those instructions. It's fundamental conceit is that every instruction should ideally take just one cycle to execute. Intel's iAPX432 is one of the most CISCy designs out there and a great case that ISA does matter. It had instructions for stuff like data structures, OOP, and garbage collection. These things could take LOADS of cycles to execute.

In contrast, pretty much everything on ARM64 integer spec is going to execute in just a cycle or two.

The_Colonel

2 replies

14h1m

2024-07-28 04:29:05 UTC

Thumb-2, Neon, VFPv4-D16, VFPv4 are mandatory instruction extension sets for ARM64, but there are also optional, widely implemented extensions like AES and SVE. Ain't nothing "reduced" about the ARM CPUs' instruction sets anymore.

In contrast, pretty much everything on ARM64 integer spec is going to execute in just a cycle or two.

integer spec is the keyword here. ARM CPUs these days implement much more than integer spec. "If I ignore everything else, it's RISC" is not a good argument IMHO.

hajile

0 replies

5h10m

2024-07-28 13:19:29 UTC

ARM64 threw out the book and started over. Stuff like Thumb-2 or VFPvX are for legacy ISA. There aren't variable instruction lengths and NEON is baked into the core of the ISA.

AES in ARM (and x86 as this was after their attempt to be more RISC-like) was designed to be RISC. A CISC approach would have just ONE instruction that would take the data pointer and number of rounds then do everything at one time. In contrast, x86 and ARM use a couple simple setup instructions, then an AES round instruction (1-2 cycles) that gets called repeatedly then a couple more finalizing instructions.

If you look at the code on any machine, something like 95+% of it is integer code. Integers are fundamental and everything else is a performance optimization.

The float/vector pipeline is a different thing, but it still takes a RISC appoach. 1-5 cycles for most things, a few are 6-7, and a couple like divide and some types of sqrt can take up to a dozen or so cycles (there's a reason they are avoided). This all happens in its own separate pipeline with its own scheduler. All the ideas of being superscalar and avoiding bubbles apply here too. Even the worst case is a far cry from CISC chips where instructions could take dozens or even hundreds of cycles.

cesarb

0 replies

7h2m

2024-07-28 11:27:39 UTC

Thumb-2, [...] are mandatory instruction extension sets for ARM64

Isn't Thumb-2 for 32-bit ARM only? AFAIK, having 32-bit ARM is optional, you can have 64-bit only ARM nowadays.

(And 64-bit ARM no longer has what is IMO the least RISC instructions from 32-bit ARM: the "load multiple" and "store multiple" instructions, which had a built-in loop doing memory accesses to an arbitrary set of registers, were replaced by "load pair" and "store pair", which do a single memory access and work with a fixed number of registers. If you look at https://www.righto.com/2016/01/more-arm1-processor-reverse-e... you can see that the circuitry dedicated for these instructions took a lot of space in the original 32-bit ARM core.)

hajile

1 replies

17h6m

2024-07-28 01:23:53 UTC

Look into Intel's iAPX and tell me that ISA doesn't matter. How would you go about making that pile of garbage into something fast?

Most CISC designs died. A large part of x86's survival is because it wasn't exceptionally CISC.

topspin

0 replies

13h13m

2024-07-28 05:16:28 UTC

How would you go about making that pile of garbage into something fast?

Make a Rosetta analog and translate to an instruction set that is amenable to efficient prediction, wide dispatch, etc., replacing all the iAPX hardware object oriented nonsense with conventional logic. If necessary, extend the CPU to accommodate whatever memory ordering behavior is necessary for efficient execution, ah la Apple Silicon ARM with x86 total store ordering.

fanf2

1 replies

1d7h

2024-07-27 10:49:08 UTC

This article has several paragraphs discussing how the decode width of x86 front ends is limited by the need to discover instruction boundaries, which in turn limits the useful issue width. Big ARM cores have much larger decode width than x86 cores, so they don’t need SMT to keep their execution units busy.

ISA doesn’t matter any more because all the CISCiest CISCs and the RISCiest RISCs have been discarded (except for RISC V…) so modern CPUs don’t have to cope with multiple memory addresses per instruction or indirect addressing.

kllrnohj

0 replies

5h40m

2024-07-28 12:50:10 UTC

Big ARM cores have much larger decode width than x86 cores

Not in general they don't. Apple's does, and Qualcomm's newest Snapdragon X Elite also does. But most big ARM cores don't have larger decode widths than x86 cores. The Cortex-A78 is a 4-wide decode, same as the majority of x86 CPUs on the market. ARM's latest & greatest Cortex X2 is only a 5-wide decode, it only just finally surpassed the original Zen design (4-wide).

Also Zen 5 bumps to an 8-wide decode, the same as what Apple & Qualcomm's best ARM chips can do.

scheme271

0 replies

1d8h

2024-07-27 10:13:20 UTC

I think a few other things like memory models also matter and affect cpu architecture. E.g. the x86 total store order vs arm64's model. You potentially get to do a few nice optimizations on arm64 vs x86. I'm not sure how much of a difference that makes though.

ribit

0 replies

2024-07-27 17:47:59 UTC

While I understand the argument, it would also be good to see some empirical evidence. So far all x86 built need more power to reach the same performance level as ARM. Of course, Apple is still the outlier.

toast0

4 replies

1d12h

2024-07-27 05:57:34 UTC

Speculative execution is so valuable for performance that a computer without it is completely unusable. If you really want a processor without it, buy an old first-gen Pentium.

Pentiums have branch prediction and speculative execution. You need to go back to i486 if you don't want speculative execution. Most of the socket 5/7 processors from other makers also had branch predictors and speculative execution, but not the Centaur Winchip. The Cyrix 5x86 for socket 3 (486) had speculative execution, but it was disabled by default and is reported to be buggy (but helps performance on published benchmarks).

pezezin

3 replies

1d5h

2024-07-27 13:18:41 UTC

According to Wikipedia, the original Pentium had branch prediction, but speculative execution was first implemented by the Pentium Pro: https://en.wikipedia.org/wiki/P6_(microarchitecture)#Feature...

gpderetta

2 replies

1d3h

2024-07-27 14:40:42 UTC

The wiki is wrong or at least misleading. Branch prediction is a form of speculative execution.

Even 486 (and possibly 386) had branch prediction (although a trivial one).

P6 was a huge deal because it added out of order execution.

Edit: I guess it is a matter of semantics: in the classical 5 stage in order RISC, instructions after the speculated branch are fetched, decoded, etc, but they won't reach the execution stage before the speculation is resolved, so only the branch is technically "executed" speculatively at the fetch stage. So there is less state to unwind, compared to a true OoO machine that can run ahead.

to11mtm

1 replies

19h29m

2024-07-27 23:00:36 UTC

according to this thing I found [0] the 486 has a predictor for the branch not taken. Not sure what that means but it looks like it's mostly for the instruction fetch/decode based on other notes. Ironically sounds close to your 5 stage RISC description, extra ironic because 486 is 5 stages too...

P5 has a more 'real' BP but also had the U/V Pipelining.

But yes as far as OoO goes, I think for x86 it was Nx586 first, then PPro (P6), then Cyrix 6x86 [1] and AMD K6 (courtesy of Nextgen tech) the next year.

[0] - https://people.computing.clemson.edu/~mark/330/colwell/case_...

[1] - Worth noting the Cyrix Coma bug, which was a way to express a flaw in the pipeline. https://en.wikipedia.org/wiki/Cyrix_coma_bug

gpderetta

0 replies

8h47m

2024-07-28 09:42:44 UTC

As far as I know 486 predicted all branches as not taken, so it would fetch and decide instructions after a branche and throw that away when the speculation was proven wrong.

I guess calling this speculative execution is a stretch.

gpderetta

0 replies

1d7h

2024-07-27 10:43:03 UTC

The first pentium had branch prediction so of course it had speculative execution.

What it didn't have is out of order execution, which greatly increases the speculation window.

akira2501

2 replies

1d17h

2024-07-27 00:49:57 UTC

Interactions between speculative execution and virtual memory translation and caches are exploitable. It's not an inherent vulnerability in prediction.

paulmd

1 replies

1d17h

2024-07-27 00:54:52 UTC

Sure, as long as it’s not observable in any way, it’s not observable. But the problem is that speculation has been repeatedly been found to be observable in unexpected ways from both brands.

Amd, for example, has observability issues in all zen1/2/3 processors that leaks enough data to break KASLR that remain unpatched in all chips except for Milan (specifically epyc only, not ryzen). They didn’t expect cache ways to be visible in that fashion, and it’s observable that the model is incorrectly implemented.

the idea cannot fail, only be failed, because if it’s observable then obviously you just did it wrong.

akira2501

0 replies

1d17h

2024-07-27 00:59:45 UTC

The point was, you can engineer around problems that aren't _inherent_ in the design, but arise as an interaction between systems. You just change the rules of interaction between systems to mitigate the vulnerabilities.

So, to your original question, we're always going to have prediction, and we're going to have to solve any vulnerabilities that arise downstream. Which, thankfully, is always an available option.

ksec

20 replies

1d14h

2024-07-27 04:25:31 UTC

It will be interesting to see the SMT performance, I am expecting this would provide benefits and be further refined in future generation. With Zen5c we get 192 Core or 384vCPU. We should be getting 256 Core with Zen 6c next year. Which means on a Dual Socket 1U Server, that is a potential of 512 Core with 1024 vCPU.

Whatever Web App Scaling issues we had in 2014 could now fit into a single server, assuming we somehow manage to cool the thing. Even at 1 RPS per vCPU that is 1000 RPS, excluding cache hit. Even on HN front-page dont hit the server at 1000 Page View per second.

teaearlgraycold

5 replies

1d12h

2024-07-27 06:07:19 UTC

Web app scaling issues are usually around database latency.

CodesInChaos

2 replies

1d9h

2024-07-27 09:27:06 UTC

In my experience, scaling is limited by database throughput, not latency.

teaearlgraycold

0 replies

17h48m

2024-07-28 00:41:38 UTC

Depends. Most apps that are slow have shitty app code with N+1 queries. If you can get past that stage and write good enough app code then eventually you’ll hit database throughout issues.

codesnik

0 replies

1d2h

2024-07-27 16:01:14 UTC

well, db latency in some naive unoptimized app with multiple sequential queries directly affects _app_'s throughput

jiggawatts

0 replies

12h27m

2024-07-28 06:02:43 UTC

Only because the database is on the far end of a network hop, and then all too often its storage is remote from its compute as well. This is the most common scenario in enterprise or cloud deployments, where for good measure a couple of firewalls, load balancers, routers, and reverse proxies are added at every stage.

jeltz

0 replies

1d6h

2024-07-27 11:49:38 UTC

But database performance is also a lot about CPU and RAM speed plus concurrent use of those resources.

bayindirh

4 replies

1d11h

2024-07-27 07:01:18 UTC

Serving web pages is cheap. You’ll probably hit network I/O limits before you saturate the cores.

I wonder what about its HPC performance. I think cooling this won’t be big problem, but might be wet one, requiring DLC after a certain point.

Dylan16807

2 replies

22h51m

2024-07-27 19:38:55 UTC

Serving web pages is cheap. You’ll probably hit network I/O limits before you saturate the cores.

It's hard to be network I/O bound when serving web pages. Netflix struggles to be network I/O bound when serving video, which is so much bigger and uses so much less processing.

Epyc started off with 32 cores on PCIe 3, and quickly moved to 64 cores on PCIe 4. When we hit 256 cores it's probably going to have PCIe 6, which means it's still the same I/O per core.

But those numbers are crazy overkill for web serving anyway. If you wanted to allocate about a gigabit per core, with 512 cores across two CPUs, using PCIe 5 to be conservative, you'd need at total of... 16 lanes, for a single 400gb/s card you can buy today.

(This is assuming you mean the I/O from the server to the network. If you're talking about I/O outside the server, then upgrade your switches. Using denser servers doesn't increase your network load, it lets you take the same load and send it to fewer racks. You're already handling the total data somewhere.)

bayindirh

1 replies

21h36m

2024-07-27 20:53:30 UTC

I mean, we already run NDR Infiniband, and with sometimes multiple cards per host in our data center, so the numbers are not surprising or out of the this world for me...

Terabit Ethernet also is easy. Just 10 100gbit fibers. Again, not much either in port count, or in physical space needs (port count is ample, fibers are thin).

What I meant is, in a conventional data center, you'll probably hit your allocated bandwidth limits before you saturate a processor like this while serving web pages, unless you're sitting on a big exchange point or the backbone router is not in the next system hall. A single socket Epyc system has 128 PCIe lanes and 12 channel memory. A complete overkill for such a job unless you serve millions, and only have a basement to put your servers with some very fat network pipes.

Dylan16807

0 replies

20h44m

2024-07-27 21:45:43 UTC

I'm assuming a situation where you'd need more than one server and there actually are bottlenecks. If one server per datacenter is enough, with no bottlenecks, then great you're done.

If your actually-kept-busy servers get 20x faster and your allocated bandwidth doesn't, there's a point where your racks are almost empty and you have some issues to address with the datacenter owner.

touisteur

0 replies

21h30m

2024-07-27 20:59:32 UTC

As for zen3 and zen4, it's been a pleasure to work with good vectorization perf, and it seems they've been doing avx512 'right', at least for HPC I know.

Add to that the 12 channels of DDR per socket and the quite generous cache, and once you've swallowed the need to buy 24 sticks for 2 sockets, these things are beautiful beasts, at what Intel has taught us, a very reasonable price (ask you OEM for prices, public prices on CPUs are nuts).

fulafel

3 replies

1d10h

2024-07-27 07:30:46 UTC

SMT could use a lot more benchmarking investigations.

Intuitively having more tasks working on the same problem at half speed should have a memory usage cost, are apps commonly using more memory for no speed gain when using SMT?

In a lot of published benchmarks it seems most apps don't noticeably benefit in executions peed.

crest

2 replies

1d8h

2024-07-27 09:49:44 UTC

SMT also allows the cores to hide memory latency by executing the other thread(s). Taken to the extreme end this would give you a barrel CPU.

nullc

0 replies

11h9m

2024-07-28 07:20:28 UTC

I think unfortunately with just 2-way smt it's pretty easy to get both of them stalled at once.

The power9 4-way SMT seems to be a fair bit more effective than what I've seen on x86.

fulafel

0 replies

1d1h

2024-07-27 16:57:52 UTC

Yes, this is reflected in the execution speed of the workloads (vs being an additional thing).

Other things making up the potential for being more than a zero sum game (half the speed) include making better use of the ALU and branch units that would otherwise be waiting for dependencies. On the other side of the scale is how different working sets of the threads can thrash each others cache working sets.

andrepd

1 replies

1d9h

2024-07-27 08:42:23 UTC

Man I sure hope that your server does more than 1 RPS :)

ksec

0 replies

1d8h

2024-07-27 09:30:49 UTC

I wish I am joking but there are plenty of production instances doing less than 10 RPS per vCPU. ( I think the request part doesn't really explain well, since it could be a page view but it could also be an API request. In this case it is mostly referring to a page view )

WithinReason

1 replies

1d10h

2024-07-27 07:53:38 UTC

The opposite might be true, the better you are at utilising the CPU pipeline the less space you have to run a 2nd thread so SMT's benefits might diminish

ColonelPhantom

0 replies

1d1h

2024-07-27 16:38:16 UTC

Correct. However, Zen 5 seems to be the first major 'scaling up' that AMD has done since the original Zen, given that the core seems to be moving from dispatching 6 to 8 instructions per clock and 4 to 6 ALUs. I would expect that SMT to be excellent for low-IPC workloads to get the most out of that scaling.

jiggawatts

0 replies

12h33m

2024-07-28 05:56:27 UTC

We're entering the era of "kilo cores" in the same way computing entered the era of kilobytes in the 1940s. If you consider a tightly-coupled rack of servers with GPUs to be one machine, then we're well into the hundreds of kilocores.

I found it entertaining having a debate with someone here on HN who just couldn't grok the concept that it's possible to serve something the size of Wikipedia from a single server. That's been easy for a while now, it just isn't done for practical reasons such as availability or cost-efficiency.

emn13

17 replies

1d21h

2024-07-26 21:06:09 UTC

As a novice in this area, it's not clear to me after reading this what exactly the 2-ahead branch predictor is.

cpldcpu

8 replies

1d20h

2024-07-26 21:34:06 UTC

My understanding is that they do not predict the target of the next branch but of the one after the next (2-ahead). This is probably much harder than next-branch prediction but does allows to initiate code fetch much earlier to feed even deeper pipelines.

layer8

5 replies

1d19h

2024-07-26 22:42:26 UTC

Surely you must also predict the next branch to predict the one after. Otherwise you wouldn’t know which is the one after.

Given that, I still don’t understand how predicting the next two branches is different from predicting the next branch and then the next after that, i.e. two times the same thing.

toast0

1 replies

1d12h

2024-07-27 06:12:06 UTC

Given that, I still don’t understand how predicting the next two branches is different from predicting the next branch and then the next after that, i.e. two times the same thing.

I'm not involved in CPU design, I just read a lot, but...

I think you need to do something special to have a second prediction, because you have to track three windows of out of order execution:

Window 0: code you're definitely running but is still being completed.

Window 1: code from the branch you think will be taken

Window 2: code from the 2-ahead branch you think will be taken.

If you figure out that the window 1 branch isn't taken, you have to drop the whole pipeline (pipeline bubble). But if you figure out that window 1 is taken, then window 1 becomes window 0 and window 2 bcomes window 1.

With a 1 ahead predictor, the pipeline stalls if you get to a conditional branch while speculating in window 1, because the processor can't manage three instruction windows.

IMHO, it sounds like if the core is doing SMT and both threads are active, each thread only gets 1-ahead prediction because the two windows are statically divided between the cpu threads. This may mean a) a significant boost for some loads when SMT is not in use and b) SMT can branch predict in both threads in the same cycle, I don't think that was possible on AMD before (no idea for other vendors)

gpderetta

0 replies

1d7h

2024-07-27 10:47:54 UTC

With a typical OoO CPU you have reorder buffers in the hundreds of instructions, so you are not speculating two or three branches but potentially dozens ahead of non-speculative execution (and that's why prediction accuracy is so important).

So the question remain, what's the innovation here? I'm sure there is something, but it is not simply speculating two ahead.

I need tor was further, bit this might be an optimization in the fetched tage to avoid a bubble when fetching consecutive taken branches.

ithkuil

0 replies

1d19h

2024-07-26 23:17:06 UTC

Interestingly you can build a branch predictor that predicts the second branch without predicting the first.

A branch predictor result is just a tuple of ("branch instruction address", "branch target address") that hints the processor that when the CPU will encounter a given branch instructions in the future (at "branch instruction address") it will likely branch to the branch target and so it would make sense to start fetching that address and filling the instruction pipeline with whatever steps are safe to perform before the jump will be actually performed.

Now, commonly this branch happens to be at the end of the current basic block and I assume some branch predictors may also leverage this fact in order to encode only offsets from the current instruction pointer.

But there is no reason why the branch location might be after some other branches may be taken. As long as the cpu eventually gets to that branch location the prediction will be useful. If the IP never reaches that location it's like the branch was never actually taken.

fulafel

0 replies

1d11h

2024-07-27 07:07:16 UTC

Surely you must also predict the next branch to predict the one after. Otherwise you wouldn’t know which is the one after.

I'd think if you are at PC N and there are branches at N+1 and N+2, predicting just branch N+2 is fine because you predicted the N+1 branch previously, at PC N-1.

Filligree

0 replies

1d18h

2024-07-27 00:06:48 UTC

Building on the sibling comment:

  if (a) { ... }

  if (b) { return x; } else { return y; }

The two branches can be wholly independent, but predicting the second is still a two-ahed prediction.

flamedoge

0 replies

1d12h

2024-07-27 05:53:03 UTC

I wonder what they need before this change. Branch predictor hardware may not have accounted for depth beyond single conditional branch? but pipeline was probably always filled, unpredicted.

emn13

0 replies

1d20h

2024-07-26 22:23:02 UTC

Ah, that makes sense in the context of the article - thanks!

eyegor

3 replies

1d18h

2024-07-26 23:58:29 UTC

I think it just predicts 2 branches per cycle instead of 1. So it can evaluate the result of n+2 ahead of time instead of only n+1 (typical branch prediction). How this works without wrecking the L1 cache, I'm not sure. It seems like the lookahead past n+1 would make cache evictions much more likely, so maybe I'm missing something here.

Zen 5 can look farther forward in the instruction stream beyond the 2nd taken branch and as a result Zen 5 can have 3 prediction windows where all 3 windows are useful in producing instructions for decoding.

The original paper is open access but I haven't read far into it: https://dl.acm.org/doi/10.1145/237090.237169

phire

2 replies

1d11h

2024-07-27 07:01:53 UTC

> It seems like the lookahead past n+1 would make cache evictions much more likely, so maybe I'm missing something here.

The frontend is already predicting dozens of branches ahead of what the backend can actually confirm. Looking ahead by one extra branch ahead doesn't really hurt.

Also, modern TAGE branch predictors are scary accurate, well above 99% on most code (including unpredictable indirect jumps). Besides, the majority of branch prediction targets are already in the L1 cache, it only predicts them because it saw them recently.

The branch predictor in Apple's M1 actually takes advantage of the latter fact. It doesn't predict what address to fetch next, it predicts which cacheline in L1 holds the target. So you only actually get branch predictions for targets in L1.

mjevans

1 replies

23h15m

2024-07-27 19:14:50 UTC

That seems like a good idea against speculation based attacks too; predict within what is at hand, do not cause side effects.

phire

0 replies

18h12m

2024-07-28 00:18:14 UTC

I didn't even notice that advantage. I was just thinking about how it minimised each branch target entry to about 16 bits.

I suspect Apple are also using it as a way predictor. If the BTB points directly to the correct cacheline and each cacheline points to the next way then the only time you need to do a search is on a branch misspredit.

deadmutex

1 replies

1d21h

2024-07-26 21:16:40 UTC

You can check out the seminal paper linked in the article. Or start by summarizing the paper with Gemini, Claude, ChatGPT, etc. to get a high level overview (and then confirm the answer by reading the paper).

coherentpony

0 replies

1d20h

2024-07-26 22:01:54 UTC

https://link.springer.com/article/10.1007/s10676-024-09775-5

sillywalk

0 replies

1d14h

2024-07-27 03:55:24 UTC

It's from around 30 years ago, my guess is it's referring to this[0] paper from 1996. It's above my head, but it seems to help with branch prediction issues arising with both many instruction units and high-clock speed, which were sort of either or in the '90s, but I think most modern processors are both.

Multiple-block ahead branch predictors

Abstract:

A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel cost-effective mechanism called the two-block ahead branch predictor. Information from the current instruction block is not used for predicting the address of the next instruction block, but rather for predicting the block following the next instruction block. This approach overcomes the instruction fetch bottle-neck exhibited by wide-dispatch "brainiac" processors by enabling them to efficiently predict addresses of two instruction blocks in a single cycle. Furthermore, pipelining the branch prediction process can also be done by means of our predictor for "speed demon" processors to achieve higher clock rate or to improve the prediction accuracy by means of bigger prediction structures. Moreover, and unlike the previously-proposed multiple predictor schemes, multiple-block ahead branch predictors can use any of the branch prediction schemes to perform the very accurate predictions required to achieve high-performance on superscalar processors."

[0] https://dl.acm.org/doi/10.1145/237090.237169

EDIT: oops. Looks like eyegor posted the link earlier. Oh well, enjoy the abstract.

hmry

0 replies

1d18h

2024-07-26 23:45:05 UTC

You're not alone. Non-novice, same here. Article spends a lot of time explaining the absolute basics of branch prediction but then when it gets to 2-ahead it just skips over explaining it...

vegabook

9 replies

1d17h

2024-07-27 01:07:44 UTC

Now all it needs is more memory bandwidth, because those two memory channels on the consumer AM5 socket are pathetic given the compute this will deliver, and especially in comparison with even the most basic ASi.

I moved to an M2 Max from a chunky Zen setup and it's a revelation how much the memory bandwidth improvement accelerates intensive data work. Also for heavy-ish multitasking the Zen setup's narrow memory pipe would often choke.

AnthonyMouse

6 replies

1d14h

2024-07-27 04:20:12 UTC

There are very few applications that are actually memory bandwidth bound but aren't more suited to a GPU than a CPU.

The reason people have been looking at Apple Silicon for LLMs in particular is that even though they are more suited to GPUs, they also require a lot of VRAM and NVIDIA charges an extortionate amount of money for GPUs with a lot of VRAM.

What AMD should really do if they want to steal NVIDIA's thunder is to sell consumer GPUs with 64-128GB of VRAM.

The_Colonel

5 replies

1d13h

2024-07-27 04:31:55 UTC

What AMD should really do if they want to steal NVIDIA's thunder is to sell consumer GPUs with 64-128GB of VRAM.

For gaming, it's an overkill (esp. given the price), it would be useful only for LLM enthusiasts which isn't that big of a market.

AnthonyMouse

4 replies

1d13h

2024-07-27 05:29:13 UTC

Then how come so many people are buying GPUs with that amount of VRAM for extraordinary amounts of money?

Nobody said it was for gaming. Though of course consumers could use it for both.

kimixa

3 replies

1d12h

2024-07-27 06:00:22 UTC

But AMD already offer GPUs with that amount of VRAM for extraordinary amounts of money, so there'll be no change?

And from what I hear they're selling every one they're producing, even to the level where there's a ~6 month waiting list. So even if they reduced the price, the only difference would be that the list would get longer and AMD got less money. No more devices would actually make into people's hands, being production limited.

AnthonyMouse

2 replies

1d11h

2024-07-27 06:32:48 UTC

The production limit isn't in the amount of VRAM. They could offer any of their existing consumer GPUs in versions with more VRAM with prices that add twice their own cost for the extra VRAM (but still far less than the datacenter versions) and thereby sell the same GPUs for higher margins.

This might even leave gamers with more GPUs, because right now there are people buying e.g. four 24GB consumer GPUs to run a large model, and they could instead buy one 96GB GPU and leave three more for gamers.

kimixa

1 replies

1d11h

2024-07-27 07:12:06 UTC

You're not really saying "They're not offering a product" - so much "They're not offering a product at a price I want", which is really a very different statement.

They're intentionally differentiated in the market and marked up due to the relatively higher R&D costs from targeting a much smaller market niche, with it's own demands on development. I doubt you'll just be running games on them, after all.

If that is your niche, AMD is effectively saying they don't want your custom at that price. No point selling products that lose money, after all. AMD have done it before - arguably are doing it right now as their graphics BU lost money last year if you exclude APUs and consoles - and it went badly each time (despite what some people say about "Market Share!" on some internet forums).

AnthonyMouse

0 replies

1d11h

2024-07-27 07:16:36 UTC

That strategy doesn't win though. The enterprise is going to buy the enterprise GPU anyway because they're spending someone else's money and it's a tax write off and the enterprise GPUs are faster and have ECC memory in addition to having more VRAM. But giving something for hobbyists to play with is how you build an ecosystem. So not only do they get higher margins on their consumer GPUs, they get more code written for their architecture, which lets them sell more of the expensive GPUs.

Not doing that is making a mistake.

cesarb

1 replies

1d3h

2024-07-27 14:56:25 UTC

because those two memory channels on the consumer AM5 socket

There are actually four memory channels in AM5, because DDR5 doubles the number of channels.

pezezin

0 replies

17h42m

2024-07-28 00:47:31 UTC

But each channel is half the width, so we are back at the same point.

phkahler

9 replies

1d19h

2024-07-26 23:26:36 UTC

> Now when Zen 5 has two threads active, the decode clusters and the accompanying fetch pipes are statically partitioned.

This sounds like a big boost for hyper threading performance. My Zen1 gets about 25 percent faster due to HT. Has anyone tested the newer ones in this regard?

paulmd

8 replies

1d17h

2024-07-27 01:01:39 UTC

High SMT speedups aren’t a good thing, because those are pipeline bubbles - resources that can’t be saturated by the first thread.

The ideal speedup from SMT is 0% because you’d be getting full output for a single thread. Like if you got a 50% speedup from SMT, in an ideal world that means a single thread could run 50% faster than it is, but it’s being stalled by pipeline bubbles.

Remnant44

4 replies

1d16h

2024-07-27 01:50:42 UTC

For a fully compute-bound workload, you're certainly correct.

That's rare though. All it takes is a couple stalls waiting on memory where a second thread is able to make progress to make that "ideal speedup" certainly be nonzero.

paulmd

3 replies

1d16h

2024-07-27 01:57:03 UTC

Regardless though why would it potentially being higher in newer architectures be viewed as a good thing?

Remnant44

1 replies

1d14h

2024-07-27 04:09:58 UTC

Because most code is not running anywhere near saturation of the available resources, and the problem is only getting worse as cores get wider. I mean, look at the Zen5 block diagram - there are 6 ALUs and 4 AGUs on the integer side alone! That's almost two entire Zen1 cores worth of execution resources, which is pretty amazing. Very, very little real world code is going to be able to get anywhere near saturating that every cycle. SMT helps improve the utilization of the execution resources that have already been paid for in the core.

I'll give another example from my own experience. I write a lot of code in the computer graphics domain. Some of the more numeric-oriented routines are able to saturate the execution resources, and get approximately 0% speedup from SMT.

Importantly though, there are other routines that make heavy use of lookup tables. Even though the tables reside completely within L1 cache, there are some really long dependency chains where the 3/4 cycle wait for L1 chains and causes some really low utilization of ALUs. Or at least, that's my theory. :) Regardless in that code running SMT provides about a 30% speedup "for free" which is quite impressive.

I was uncertain of if SMT had a future for a while, but I think for x86 in general it provides some pretty significant gains, for a design complexity that has already been 'paid' for.

adrian_b

0 replies

1d10h

2024-07-27 07:32:02 UTC

With the continuous improvement of out-of-order execution, the SMT gains have been diminishing from Zen 1 to Zen 4.

However you are right that Zen 5, like also the Intel Lion Cove core, has a jump in the number of available execution resources and it is likely that out-of-order execution will not be enough to keep them busy.

This may lead to a higher SMT gain on Zen 5, perhaps on average around 30% (from typically under 20% with Zen 3 or Zen 4), like in the Intel presentation where they compared a Lion Cove without SMT with a Lion Cove with SMT. In the context of a hybrid CPU, where the MT-performance can be better provided by efficient cores than by SMT, Intel has chosen to omit SMT, for better PPA (performance per power and area), but in the context of their future server CPU with big cores, they will use cores with SMT (and with wider SIMD execution units, to increase the energy efficiency).

Dylan16807

0 replies

1d10h

2024-07-27 08:29:06 UTC

Regardless though why would it potentially being higher in newer architectures be viewed as a good thing?

Because SMT getting faster is a nearly free side-effect. We didn't add extra units to speed up SMT at the cost of single-thread speed. We added extra units to speed up the single thread, and they just happened to speed up SMT even more (at least for the purpose of this theoretical). That's better than speeding up SMT the same percent, or not speeding up SMT at all.

Imagine if I took a CPU and just made SMT slower, no other changes. That would be a bad thing even though it gets the speedup closer to 0%, right? And then if I undo that it's a good thing, right?

adrian_b

2 replies

1d11h

2024-07-27 07:20:17 UTC

Yes, while in my older tests on Intel Skylake derivatives I have also obtained SMT speedups around 25%, on the newer Zen 3 (a 5900X) I have obtained at most a 20% speedup in the most SMT friendly task that I have ever encountered, i.e. the compilation of a big software project (the comparison being done between optimal parameters for SMT disabled vs. SMT enabled, which for a 5900X have been determined to be "make -j13" vs. "make -j24").

An example of a multithreaded benchmark that is not SMT friendly is the GeekBench 6 multithreaded test, where Zen 3 with SMT disabled (12 threads on a 5900X) is slightly faster than with SMT enabled (24 threads on a 5900X).

ColonelPhantom

1 replies

1d1h

2024-07-27 16:45:41 UTC

It's worth noting that compilation is a partially serial task (e.g. linking is often largely single-core). It's entirely possible that going from 4 to 8 threads is much more helpful than 12 to 24 threads, as a 24-thread system will have far more idle threads in comparison. (Of course this is assuming 4c8t Skylake, so a normal consumer i7. Skylake-X had more cores.)

adrian_b

0 replies

1d1h

2024-07-27 17:10:31 UTC

For big projects, as I have mentioned, the linking phase is at most a few percent of the compilation time.

I have CPUs with 48 threads and for big projects the compilation time decreases monotonically from 1 thread to 48 threads (and almost proportionally with the number of threads until 24 threads, then with a constant smaller slope until 48). The published benchmarks show that for big software projects the compilation times decrease monotonically until 512 threads on a 2-socket MB with 128-core CPUs.

So the compilation of a big software project, like Chromium, Firefox, LLVM, gcc, the Linux kernel, Libreoffice, etc. (all of which have many thousands of files that must be compiled) is one of the tasks that can use efficiently any number of threads that is currently available.

Moreover, there are now linkers that work partially concurrently. Tasks like the relocation of the object files and of the external symbol references can be done completely concurrently for all source files (after the start addresses are known for all object files and the corresponding object files are known for each symbol).

im3w1l

2 replies

1d17h

2024-07-27 01:18:59 UTC

I still ahve no idea what a 2-ahead branch predictor is.

adrian_b

1 replies

1d10h

2024-07-27 07:45:43 UTC

You can better start by reading the old research papers linked in the article.

In general, the older research papers suppose that the reader knows much less about such subjects, because they were still much more niche knowledge at that time.

im3w1l

0 replies

1d4h

2024-07-27 14:15:39 UTC

Okay, so a "block" is a sequence of instructions that can be executed from top to bottom, i.e. only the last instruction may be a jump.

A 2-ahead branch predictor is a branch predictor that predicts not which block to run next after the current one, but rather, which block should be run after that one.

gary_0

2 replies

1d20h

2024-07-26 22:16:10 UTC

Here's a great explanation of branch prediction, starting from the earliest implementations: https://danluu.com/branch-prediction/

skywal_l

0 replies

1d12h

2024-07-27 05:33:19 UTC

Godbolt recently did an ELI5 on computerphile about everything CPU[0] and about branch prediction in particular[1].

[0]: https://www.youtube.com/watch?v=nhXevKMm3JI&list=PLzH6n4zXuc...

[1]: https://www.youtube.com/watch?v=nczJ58WvtYo&list=PLzH6n4zXuc...

holoduke

0 replies

21h58m

2024-07-27 20:32:11 UTC

Thats a good read

brcmthrowaway

1 replies

1d16h

2024-07-27 02:01:23 UTC

Did the paper author get a cut from AMD?

adrian_b

0 replies

1d10h

2024-07-27 07:55:55 UTC

They have no reason to give him any cut.

On the contrary, he points that one of their most important innovations is not that innovative, because it was analyzed in detail almost 30 years ago.

Before AMD, Intel had started to use such multiple decoders in their Atom cores, now rebranded as E-cores (efficient cores). The multiple decoders have been first used in Tremont (2020), then in Gracemont (2021) and Crestmont (2023), and now in Skymont (2024).

It is hard to predict which of the decoders of Skymont (triple 3-instruction decoder) or of Zen 5 (double 4-instruction decoder) is better. AMD has made the choice of the double decoder because their core uses SMT, and a double decoder is easy to be partitioned into two decoders when both simultaneous threads are active.

sholladay

0 replies

1d3h

2024-07-27 14:33:08 UTC

Despite being aware of AMD’s Zen chips, for a moment I thought this was about a ZEN player. Good times!

https://en.wikipedia.org/wiki/Creative_Zen

ryukoposting

0 replies

1d17h

2024-07-27 01:06:10 UTC

4 years after graduating college, my decision to dive into computer architecture classes has borne no fruit except my ability to loosely understand what writeups like this are talking about. But, I guess that's the point, isn't it? This is fascinating stuff, whether or not I need to know it.

pyrolistical

0 replies

1d12h

2024-07-27 06:06:39 UTC

We would need more branch hints? https://github.com/ziglang/zig/issues/5177

Cold, warm, warmer, omit hot as it is the default? Sometimes you would set all branches to be cold except one

hnpl

0 replies

17h24m

2024-07-28 01:05:43 UTC

I'd love to see the performance data before judging whether it is a good idea. There's no information on the branch prediction penalty of this approach as well.

Anyway, I think the intuition of this approach is to aggressively fetch/decode instructions that might not already in L1 instruction cache / micro-op cache. This is important for x86 (and probably RISC-V) because both have varied instruction length, and just by looking at an instruction cache block, the core wouldn't know how to decode the instruction in the cache block. Both ISAs (x86, RISC-V) require knowing the PC of at least one instruction to start decoding an instruction cache block. So, knowing where the application can jump to 2 blocks ahead helps the core fetching/decoding further ahead compared to the current approach.

This approach is comparable to instruction prefetching, though, instruction prefetching does not give the core information about the starting point.

(High performance ARM cores probably won't suffer from the "finding the starting point" problem because every instruction has the length of 32-bit, so the decoding procedure can be done in parallel without knowing a starting point).

This approach likely benefits front-end heavy applications (applications with hot code blocks scatter everywhere in the binary, e.g., cloud workloads). I wonder if there's any performance benefit/hit for other types of applications.