It’s fun to be just a curious bystander for many years in this industry.
Every now and then Moore’s law hits a roadblock. Some experts see that as a clear sign that it’s reaching its end. Others that it’s already dead, because actually, the price per transistor has increased. Others that it’s physics, we can approach Y but after X nm it can’t be done.
Then you read others that claim that Intel has just been lazy enjoying its almost monopoly for the past decade and was caught off guard by TSMC’s ultraviolet prowess. Or people who really know how the sausage is made, like Jim Keller, enthusiastically stating that we are nowhere near any major fundamental limitation and can expect 1000X improvement in the years to come at least.
Anyway, it’s really fun to watch, like I said. Hard to think of a field with such rollercoaster-like forecasting while still delivering unparalleled growth in such a steady state for decades.
The limitations are very real. Dennard scaling has been dead since the mid-2000s (that is, power use per unit area has been increasing, even though energy use per logic operation is very much dropping at leading edge nodes) which means an increasing fraction of all silicon has to be "dark", power-gated and only used for the rare accelerated workload. Additionally, recent nodes have seen very little improvement in SRAM cell size which is used for register files and caches. So perhaps we'll be seeing relatively smaller caches per core in the future, and the addition of eDRAM (either on-die or on a separate chiplet) as a new, slower L4 level to partially cope with that.
What if it went the other way and you got much larger die area dedicated to caches or even on-chip RAM, since that usage is relatively cheaper from a power/heat point of view? Or is the process different enough between the two that it just doesn't make sense to have them interwoven like that?
The point of SRAM, especially at the L1/L2 level is having an extremely high BW and extremely low latency (a few clock cycles). So it is not really an option to put them somewhere else (although L3 and as mentioned other lower level layers) can and are already being put into either separate chiplets in the same PCB w/extremely fast ring OR directly on top of the die (3D stacking).
Is it possible to use big and fat CPU registers instead of cache? There might be no wasted clock cycles and no delay.
A compiler AND processor design amateur here. (Latter in school.)
Once you have enough registers, having more mean lowers active utilization for any given instruction (bad use of space, vs. fast pipelined access to cached stack) or higher levels of parallel instruction dispatch (much greater complexity, and even greater inefficiency for branching misses).
Then you have to update instruction sets, which could be impossible given how tightly they fit in current instruction sizes.
Ergo, increasing register banks is a major architecture & platform change from hardware to software redesign, with heavy end user impact, and a fair chance of decreasing performance.
In contrast, anything that improves caching performance is a big non-disruptive win.
What about if you use register windows or special renaming of architectural registers to internal ones? https://en.wikipedia.org/wiki/Register_window
The amount of stack in the L1 cache is essentially that. A shifting fast access working memory area.
Then think of registers as just part of the fetch & store pipelines for staging operations on stack values.
Forth goes all in with this approach.
Registers are quite expensive in space and power, because multiple at once have to be accessible in many places.
If you add more registers, the cost per register increases rapidly, and you very quickly hit your limits.
If you make registers wider, that's still very expensive, and you introduce extra steps to get to your data most of the time.
So no, you can't do that in a reasonable way.
Thank you!
CPU registers are either SRAM or even larger flip-flops, they have the same problem.
Yeah. The analogy for cache that I like to use is a table at the library. If you think about doing research (the old fashioned way) by looking through a library shelf by shelf and bringing books to your table to read through more closely. If you have a bigger table you can store more books which can speed your lookup times since you don’t need to get up and go back and forth to the shelves.
But at some point making your table larger just defeats the purpose of the library itself. Your table becomes the new library, and you have to walk around on it and look up things in these piles of books. So you make a smaller table in the middle of the big table.
Your fundamental limitation is how small you can make a memory cell, not how big you want to make a cache. That’s akin to making the books smaller print size so you can fit more on the same size table.
well, sorta, since caches are just sram+tag logic. you can parallelize tables, so that each remains fast, but it costs you power/heat. the decoder inherent to sram is what introduces the size-speed tradeoff.
I was ignoring the details on how SRAM works in favour of thinking about it physically. Most of those details just affect the average cell size at the end of the day.
The other physical aspect we’re dealing with is propagation delay and physical distance. That’s where the library analogy really shines: if there’s a minimum size to a book and a minimum size of you (the person doing the research) this corresponds roughly to minimum cell sizes and minimum wire pitch, so you’re ultimately limited in the density you can fit within a given volume.
Really good analogy!
the caches are already ~75% of the space. you can't significantly increase that. On die ram is also relatively unlikely due to process differences. my best guess is more 3d cache chips. if we can get the interconnects small enough and fast enough, I could see a future where the logic is stacked on top of a dozen (physical) layers of stacked cache
stacking is a heat problem, and heat has been the PRIMARY system limit for over a decade.
2.5d is just too easy and effective - we're going to have lots more chiplets, and only the cool ones will get stacked.
AMD stacked cache is a significant increase and gives a huge boost in certain gaming scenarios, to the point that it's a 100% increase in certain games that rely on huge caches
I wonder if we’ll see compressed data transmission at some point.
fast compression is way too slow.
remember, we're talking TB/s these days.
Could be useful for sparse data structures.
Good question - but it would have to be a one of the kind that decrease latency not the one that decrease bandwidth. Maybe there is a way to achieve such.
I'm ignorant of this space, but it seems like the obvious solution for heat dissipation is to layer lattices and not solid layers, in order to increase the overall surface area of the chip. I assume the manufacturing is too difficult...?
That's one of the promises of 3D stacked transistors, yes.
Limitations in existing processes, sure. But not limitations in physics. If E=mc^2, we've got a lot of efficiencies still to find.
Fusion-based computing FTW!
That is just mainstream reporting.
If one actually went and read the paper referred or what the context was. It was always the same thing. It was all about the economics, all the way back from early 90s. We cant do x node because it would be too expensive to sustain it at a node every two years.
Smartphone era ( Referring to Post iPhone launch ) essentially meant we ship an additional ~2 Billions Pocket computer every year including Tablet. That is 5x the most optimistic projection to traditional PC model at 400M / year. ( Which we never reached ). And that is ignoring the Server market, Network Market, GPU market, AI Market etc. In terms of transistor and revenue or profits the whole TAM ( Total Addressable Market ) went up at least 10x more than those projection. Which is essentially what scale us from 22nm to now 3nm, and all the way to 2nm and 1.4nm. And my projection of 1nm by 2030 as well. I even wrote on HN in ~2015 I have a hard time to see how we could sustain this post 3nm. At the time when trillion dollar company was thought to be impossible.
On the other side of things, the cost projection to next node ( e.g 2nm ), and next next node (e.g 1.4nm ) was always higher than what its turns out. As with any large project management it is was better to ask and project more in case shit hits the fan. ( Intel 10nm ) But every time TSMC has executed so well.
So as you can see there is a projection mismatch at both ends. Which is why the clear sign of progress coming to end keeps being wrong.
I just want to state that this figure keeps being throw around. It was Jim Keller comparing at the time Intel 14nm ( Which is somewhere close to TSMC N10 ) to hypothetical physics limit. At 3nm we are at least 4x pass that. Depending on how you want to measure it we could reach less than 100x by 2030.
AI trend could carries us forward to may be 2035. But we dont have another product category like iPhone. Server at hyperscaler are already at a scale growth is slowing. We will again need to substantially lower the development cost of leading node ( My bet is on the AI / Software side ) and some product that continues to grow the TAM. May be Autonomous Vehicles will finally be a thing by 2030s ? ( I doubt it but just throwing in some ides ).
TSMC or ASML? Or both? I am not trying to be dismissive, just curious about who deserves the credits here.
TSMC otherwise intel and samsung would not be chasing TSMC
arguably, the current race is down to TSMC making the right decision on hi-NA EUV (ie, to run with low-NA). it's not as if Intel couldn't have acquired EUV, they just chose not to.
Its predominantly TSMC. I do get quite tired and sometimes annoyed when 99.99999% of the internet including HN stating it is just "ASML". As if buying those TwinScan would be enough to make you the world's best leading edge foundry.
Remember,
1. TSMC didn't beat Intel because they had the newer EUV first. They beat Intel before the whole thing started.
2. If having EUV Machines from ASML were enough, Samsung would have been the 2nd given they are or they will use it for NAND and DRAM. And yet they are barely competing.
3. It is not like Global Foundry dropped off racing the leading edge node for no reason.
4. TSMC has always managed to work around any road block when ASML failed to deliver their promise on time.
5. Quoting from CEO of ASML, half jokingly but also half true "Dont ask us about how those EUV machine are doing. Ask TSMC, they know that thing better than we do."
Of course there is large number of small companies around the whole Pure Play Foundry business in which TSMC ex-CEO calls it the Grand Alliance. You need every party to perform well for it to happen. This is somewhat different to Samsung and Intel, both are ( more or less ) much more vertically integrated.
It’s a massive supply chain, so, yes, both. But also a hundred other companies. TSMC and other foundries bring together many technologies from many companies (and no doubt a lot of their own) to ship a full foundry solution (design-technology-cooptimization, masks, lithography, packaging, etc).
However there is a big difference between those "~2 Billions Pocket computer every year including Tablet" and regular computers, so to speak.
They are mostly programmed in managed languages, where the respective runtimes and OS collaborate, in order to distribute the computing across all available cores in the best way possible, with little intervention required from the developers side.
Additionally, the OS frameworks and language runtimes collaborate in the best way to take advantage of each specific set of CPU capabilities in an almost transparent way.
Quite different from the regular POSIX and Win32 applications coded in C and C++, where everything needs to be explicitly taken care of, which is what kind of prevents most of the cool CPU approaches to take off, sitting there idle most of the time.
I was under the impression that distributing workloads across many CPU cores (or HW threads) is done at the process and thread level by the OS? That gives managed and unmanaged languages the same benefits.
Managed languages provide higher level primitives that makes it easier to create a multi-threaded application. But isn't that still manually coded in the mainstream managed languages?
I'm thinking of inherently CPU-intensive custom workloads. UI rendering and IO operations become automatically distributed with little intervention.
Or am I missing something, where there is "little intervention required from the developers side" to create multi-threaded apps?
You are missing the part that ART, Swift/Objective-C runtime, and stuff like Gran Central Dispatch also take part in the decision process.
So the schedulers can decide in a more transparent way what runs where, specially on Android side, where the on-device JIT/AOT compilers are part of the loop.
Additionally, there is more effort on having the toolchains explore SIMD capabilities, where on C and C++ level one is expected to write that code explicilty.
Yes, auto-vectorization isn't as good as writing the code explicitly, however the latter implies that only a niche set of developers actually care to write any of it.
Hence why frameworks like Accelerate exist, even if a JIT isn't part of the picture, the framework takes the best path depending on available hardware.
Likewise higher level managed frameworks offer a better distribution between the parallel processing taking part across CPU, GPU or NPU, which again on classical UNIX/Win32 in C and C++, have to be explicility programmed for.
Such higher level frameworks can of course also be provided in such languages, e.g. CUDA and SYCL, howver then we start discussing about programmer culture to adopt such kind of tooling in classical LOB applications.
I don't know these, but from a quick googling it still looks like explicit multi-threading? Albeit with higher level primitives than in older languages, but still explicit?
I'm not sure I see a hard dividing line between older languages and managed ones as far as auto-vectorization? Sure, a higher-level language might make it easier for the compiler since it knows more about potential side effects, but simple and local C code doesn't have any side effects either.
Accelerate looks nice, but it still looks like it has to be called explicitly in the user code?
I'm not sure I understand, can you give more explicit examples?
My point here isn't that managed languages don't give big benefits over C. I prefer Python and C# when those can be used.
It's more that I don't see "automatic parallel processing" as a solved problem?
Sure, we get better and better primitives for multi-threading, and there are more and more high-level parallel libraries like you mentioned. But for most cases, the programmer still has to explicitly design the application to take advantage of multiple cores.
I remember reading around the 300nm transition that Moore’s law was all over because wavelengths and physics. No one was talking about multiple masking patterns, probably because it was prohibitively expensive. Inconceivable, much like trillion dollar companies in the early 2000s.
I remember a quote from Von Braun where he learned to use the word 'impossible' with the greatest caution.
When you have a significant fraction of the GDP of a super power dedicated to achieving some crazy engineering task, it almost certainly can be done. And I wouldn’t bet against our hanger for better chips.
Totally agree.
There will be fancier iPhones with on board offline Large Language Models and other Foundation Models to talk to, solving all kinds of tasks for you that would require a human assistant today.
As Jim Keller himself famously put, Moore's law is still fine. Furthermore, the number of people predicting end of Moore's law doubles every 18 months, thus following the Moore's law itself.
It is fun to watch and keep track of - And keeping in mind it's also been an insane amount of work by an insane number of people with an insane amount of budget thrown at the problems. You can do quite a bit in software "as a hobby" - and this field is not it.
Aren't Intel, TSMC and Samsung all customers (and investors) of ASML, which is actually the manufacturer and developer of the EUV (ultraviolet) machines this refers to? Basically, if at all, they might have a slight exclusivity deal, but given the owner structure you can imagine that this will not really affect anything in the long run. With the willingness of spending the money on new nodes they will have the technology too.