A grossly over-simplified argument for SMT that resonated with me was that it could keep a precious ALU busy while a thread stalls on a cache miss.
I gather in the early days the LPDDR used on laptops was slower too and since cores were scarce so this was more valuable there. Lately, though, we often have more cores than we can scale with and the value is harder to appreciate. We even avoid scheduling work on a shared with an important thread to avoid cache-contention because we know the single-threaded performance will be the bottleneck.
A while back I was testing Efficient/Performance cores and SMT cores for MT rendering with DirectX 12; on my i7-12700K I found no benefit to either: just using P-cores took about the same time to render a complex scene as P+SMT and P+E+SMT. It's not always a wash, though: on the Xbox Series X we found the same test marginally faster when we scheduled work for SMT too.
Rendering is one of the scenarios which was either same or slower with SMT since the beginning. This is because rendering is already math heavy, and your FPU is always active, esp. dividers (which is always the most expensive operation for processors).
SMT shines while waiting for I/O or doing some simple integer stuff. If both your threads can saturate the FPU, SMT is generally slower because of the extra tagging added to the data inside the CPU to note what belongs where.
If you're waiting for IO, you're likely getting booted off the processor by the OS anyway. SMT is most useful when your code doesn't have enough instruction-level parallelism but is still mostly compute bound.
I believe "I/O" here is referring to data movement between DRAM and registers. Not drives or NICs.
Yes, exactly. One exception can be Infiniband, since it can put the received data to RAM directly, without CPU intervention.
DMA is a much older technology. It's just that at some point you do need the CPU to actually look at it.
Infiniband uses RDMA, which is different than ordinary DMA. Your IB card sends the data to the client point to point, and the IB card directly writes it to the RAM. IB driver notifies that the data is arrived (generally via IB accelerated MPI), and you directly LOAD your data from the memory location [0].
IOW, your data magically appears in your application's memory, at the correct place. This is what makes Mellanox special, and made NVIDIA to acquire them.
From the linked document:
Instead of sending the packet for processing to the kernel and copying it into the memory of the user application, the host adapter directly places the packet contents in the application buffer.
[0]: https://docs.redhat.com/en/documentation/red_hat_enterprise_...
Linux has had zero copy network support for 15 years. No magic.
It's not "zero copy networking" only.
In an IB network, two cards connect point to point over the switch and "beam" one's RAM contents to other. On top of it, with accelerated MPI, certain operations are offloaded to IB cards and IB switches (like broadcast, sum, etc.), so MPI library running on the host doesn't have to handle or worry about these operations, leaving time and processor cycles for computation itself.
This is the magic I'm talking about.
I’ve used other peripherals that did this. Under the hood you would have a virtual mapping to a physical address and extent where the virtual mapping is in the address space of your process. This is how dma works in qnx because drivers are userspace processes. The special thing here is essentially doing the math in the same process as the driver.
I agree that sounds very nice for distributed computation.
No, you're doing MPI operations on the switch fabric and the IB ASIC itself. CPU doesn't touch these operations, but only see the result of the operation. NVIDIA's DPU is just a more general purpose version of this.
IB didn't invent RDMA, and it's not even the only way to do it today.
it's also not amazingly great, since it only solves a small fraction of the cluster-communication problem. (that is, almost no program can rely on magic RDMA getting everything were it needs to be - there will always be at least some corresponding "heavyweight" messaging, since you still needs locks and other synchronization.)
But the way you make rendering embarrassingly parallel is the way you make web servers parallel; treat the system as a large number of discrete tasks with deadlines you work toward and avoid letting them interact with each other as much as possible.
You don’t worry about how long it takes to render one frame of a digital movie, you worry about how many CPU hours it takes to render five minutes of the movie.
Yes, however in an SMT enabled processor, there are one physical FPU per two logical cores. FPU is already busy with other thread's work, so the threads in that SMT enabled core take turns for their computation in the FPU, creating a bottleneck here.
As a result, you don't get any speed boost at best case, or lose some time in the worst case.
Since SMT doesn't magically increase the number of FPUs available in a processor, if what you're doing is math heavy, SMT just doesn't help. Same is true for scientific simulation, too. I observed the same effect, and verified that indeed saturating the FPU with a thread makes SMT moot.
If you have contention around a single part of the CPU then yes, SMT will not help you. The single FPU was an issue on the first Niagara processor as well, but it still had great throughput per socket unless all processes were fighting for the FPU.
If, however, you have multiple FPUs on your processor, then it might be useful to enable SMT. As usual, it pays to tune hardware to the workload you have. For integer-heavy workloads, you might prefer SMT (there are options for up to 8 threads per physical core out there) up to the point either cache misses or backend exhaustion happens.
Current processors contains one FPU per core. When you have a couple of FPU heavy programs in a system, SMT makes sense, because it allows you to keep FPU busy while other lighter threads play in the sand elsewhere.
OTOH, when you run a HPC node, everything you run wants the FPU, Vector Units and the kitchen sink in that general area. Enabling SMT makes the queue longer, and nothing is processed faster in reality.
So, as a result, SMT makes sense sometimes, and is detrimental to performance in other times. Benchmarking, profiling and system tuning is the key. We generally disable SMT in our systems because it lowers performance when the node is fully utilized (which is most of the time).
I'm not really sure why you say "one FPU per core". are you talking about the programmer-visible model? all current/mainstream processors have multiple and parallel FP FUs, with multiple FP/SIMD instructions in flight. afaik, the inflight instrs are from both threads if SMT is enabled (that is, tracked like all other uops). I'm also not sure why you say that enabling SMT "makes the queue longer" - do you mean pipeline? Or just that threads are likely to conflict?
At this point, especially with backside power, I wonder how much cache stalls on one processor result in less thermal throttling both on that processor and neighboring ones.
Maybe we should just be letting these procs take their little naps?
This leads, in the extreme, to the idea of a huge array of very simple cores, which I believe is something that has been tried but never really caught on.
Sounds like gpu to me.
The Xeon Phi was a “manycore” x86 design with lots of tiny CPU cores, something like the original Pentium, but with the addition of 512-bit SIMD and hyperthreading:
https://en.m.wikipedia.org/wiki/Xeon_Phi
IIRC the first Phi has SMT4 in a round robin fashion similar to the Cell PPUs. To make a core run at full speed, you should schedule 4 threads on it.
The very, very first Phi still had its ROPs and texture units, being essentially a failed GPU with identifying marks filed off (yes, the first units were Larrabee prototypes with video outputs unpopulated)
Such a shame. I'd love to base a workstation on those.
Seems to be a hobby I never quite act upon - to misuse silicon and make it work as a workstation.
GPUs are actually SMT'd to the extreme. For example, Intel's Xe-HPG has 8-wide SMT. Other vendors have even bigger SMT: RDNA2 can have up to 16 threads in flight per core.
That description reminds me of GreenArrays' (https://www.greenarraychips.com) Forth chips that have 144 cores – although they call them "computers" because they're more independent than regular CPU cores, and eg. each has its own memory and so on. Each "computer" is very simple and small – with a 180nm geometry they can cram 8 of them in 1mm^2, and the chip is fairly energy-efficient.
Programming for these chips apparently a bit of a nightmare though. Because the "computers" are so simple, even eg. calculating MD5 turns into a fairly tricky proposition as you have to spread out the algorithm to multiple computers with very small amounts of memory, so something that would be very simple on a more classic processor turns into a very low level multithreaded ordeal
Worth noting that the GreenArrays chip is 15 years old. 144 cores was a BIG DEAL back then. I wonder what a similar architecture compiled with a modern process could achieve. 1440 cores? More?
those weren't "real" cores. you know what current chip has FUs that it falsely calls "cores"? that's right, Nvidia GPUs. I think that's the answer to your question (pushing 20k).
it's a very interesting product - sort of the lovechild of a 1980s bitsliced microcode system optimized for application-specific pipelines (systolic array, etc).
do you know whehter it's had any serious design wins? I can easily imagine it being interesting for missile guidance, maybe high-speed trading, password cracking/coin mining. you could certainly do AI and GPUs with it, but I'm not sure it would have an advantage, even given the same number of transistors and memory resources.
Calling all TIS-100 fans
Transputers.
I wonder if instead of having SMT, processors could briefly power off the unused ALUs/FPUs while waiting for something further up the pipeline, and focus on reducing heat and power consumption rather than maximizing utilization.
I consider SMT a relic left over from the days when CPU design was all about performance per square millimeter. We are in the process of substituting that goal with that of performance per watt, or in the process of slowly realizing that our goals have shifted quite a while ago.
I really don't expect SMT to stay much longer. Even more so with timing visibility crosstalk issues lurking and big/small architectures offering more parallelism per chip area where single thread performance isn't in the spotlight. Or perhaps the marketing challenge of removing a feature that had once been the pride of the company is so big that SMT stays forever.
Intel is removing SMT from their next gen mobile processor.
My guess is this will help them improve ST perf. We will see how well it works, and if amd will follow
They basically do: it's pretty common to clock gate inactive parts of the ALU, which reduces their power consumption greatly. Modern processor power usage is very workload-dependent for this reason.
Could you, do they, put the “extra” LUs right next to the parts of the chip with the highest average thermal dissipation to even out the thermal load across the chip?
Or stack them vertically, so the least consistently used parts of the chip are farthest away from the heat sink, delaying throttling.
Intel’s hyperthreading is really a write pipe hack.
It’s not so much cache misses as allowing the core to run something else while the write completes.
This is why some code scales poorly and other code achieves near linear speed ups.
Why would the core have to wait for the write to complete?
A core stalls on a write only if the store buffer is full. As hyper threads share the write buffer, SMT makes store stalls more likely, not less ( but still unlikely to be the bottleneck).
This makes sense, the Series X has GDDR RAM, and so has substantially worse latency than DDR/LPDDR. SMT can help cover that latency, and the higher GDDR bandwidth mitigates the higher memory bandwidth needed to feed both threads.
Anecdotally, mkp224o (.onion vanity address miner, supposedly compute-bound with little memory access) runs about 5-10% faster on my 24-core AMD with 48 threads than with 24 threads. However, I haven't tried the same benchmark with SMT disabled in firmware.
Oddly, the latency hasn't improved too much - CAS is often 5-10ns for DDR2/3/4/5. Bus width, transfers per second, queueing, power per bit transfered and stored have, but if a program depends on something that's not in cache and was poorly predicted, the RAM latency is the issue.