Intel, Samsung, and TSMC Demo 3D-Stacked Transistors

It’s fun to be just a curious bystander for many years in this industry.

Every now and then Moore’s law hits a roadblock. Some experts see that as a clear sign that it’s reaching its end. Others that it’s already dead, because actually, the price per transistor has increased. Others that it’s physics, we can approach Y but after X nm it can’t be done.

Then you read others that claim that Intel has just been lazy enjoying its almost monopoly for the past decade and was caught off guard by TSMC’s ultraviolet prowess. Or people who really know how the sausage is made, like Jim Keller, enthusiastically stating that we are nowhere near any major fundamental limitation and can expect 1000X improvement in the years to come at least.

Anyway, it’s really fun to watch, like I said. Hard to think of a field with such rollercoaster-like forecasting while still delivering unparalleled growth in such a steady state for decades.

The limitations are very real. Dennard scaling has been dead since the mid-2000s (that is, power use per unit area has been increasing, even though energy use per logic operation is very much dropping at leading edge nodes) which means an increasing fraction of all silicon has to be "dark", power-gated and only used for the rare accelerated workload. Additionally, recent nodes have seen very little improvement in SRAM cell size which is used for register files and caches. So perhaps we'll be seeing relatively smaller caches per core in the future, and the addition of eDRAM (either on-die or on a separate chiplet) as a new, slower L4 level to partially cope with that.

What if it went the other way and you got much larger die area dedicated to caches or even on-chip RAM, since that usage is relatively cheaper from a power/heat point of view? Or is the process different enough between the two that it just doesn't make sense to have them interwoven like that?

The point of SRAM, especially at the L1/L2 level is having an extremely high BW and extremely low latency (a few clock cycles). So it is not really an option to put them somewhere else (although L3 and as mentioned other lower level layers) can and are already being put into either separate chiplets in the same PCB w/extremely fast ring OR directly on top of the die (3D stacking).

Is it possible to use big and fat CPU registers instead of cache? There might be no wasted clock cycles and no delay.

A compiler AND processor design amateur here. (Latter in school.)

Once you have enough registers, having more mean lowers active utilization for any given instruction (bad use of space, vs. fast pipelined access to cached stack) or higher levels of parallel instruction dispatch (much greater complexity, and even greater inefficiency for branching misses).

Then you have to update instruction sets, which could be impossible given how tightly they fit in current instruction sizes.

Ergo, increasing register banks is a major architecture & platform change from hardware to software redesign, with heavy end user impact, and a fair chance of decreasing performance.

In contrast, anything that improves caching performance is a big non-disruptive win.

What about if you use register windows or special renaming of architectural registers to internal ones? https://en.wikipedia.org/wiki/Register_window

The amount of stack in the L1 cache is essentially that. A shifting fast access working memory area.

Then think of registers as just part of the fetch & store pipelines for staging operations on stack values.

Forth goes all in with this approach.

Registers are quite expensive in space and power, because multiple at once have to be accessible in many places.

If you add more registers, the cost per register increases rapidly, and you very quickly hit your limits.

If you make registers wider, that's still very expensive, and you introduce extra steps to get to your data most of the time.

So no, you can't do that in a reasonable way.

Thank you!

CPU registers are either SRAM or even larger flip-flops, they have the same problem.

Yeah. The analogy for cache that I like to use is a table at the library. If you think about doing research (the old fashioned way) by looking through a library shelf by shelf and bringing books to your table to read through more closely. If you have a bigger table you can store more books which can speed your lookup times since you don’t need to get up and go back and forth to the shelves.

But at some point making your table larger just defeats the purpose of the library itself. Your table becomes the new library, and you have to walk around on it and look up things in these piles of books. So you make a smaller table in the middle of the big table.

Your fundamental limitation is how small you can make a memory cell, not how big you want to make a cache. That’s akin to making the books smaller print size so you can fit more on the same size table.

well, sorta, since caches are just sram+tag logic. you can parallelize tables, so that each remains fast, but it costs you power/heat. the decoder inherent to sram is what introduces the size-speed tradeoff.

I was ignoring the details on how SRAM works in favour of thinking about it physically. Most of those details just affect the average cell size at the end of the day.

The other physical aspect we’re dealing with is propagation delay and physical distance. That’s where the library analogy really shines: if there’s a minimum size to a book and a minimum size of you (the person doing the research) this corresponds roughly to minimum cell sizes and minimum wire pitch, so you’re ultimately limited in the density you can fit within a given volume.

Really good analogy!

the caches are already ~75% of the space. you can't significantly increase that. On die ram is also relatively unlikely due to process differences. my best guess is more 3d cache chips. if we can get the interconnects small enough and fast enough, I could see a future where the logic is stacked on top of a dozen (physical) layers of stacked cache

stacking is a heat problem, and heat has been the PRIMARY system limit for over a decade.

2.5d is just too easy and effective - we're going to have lots more chiplets, and only the cool ones will get stacked.

AMD stacked cache is a significant increase and gives a huge boost in certain gaming scenarios, to the point that it's a 100% increase in certain games that rely on huge caches

I wonder if we’ll see compressed data transmission at some point.

fast compression is way too slow.

remember, we're talking TB/s these days.

Could be useful for sparse data structures.

Good question - but it would have to be a one of the kind that decrease latency not the one that decrease bandwidth. Maybe there is a way to achieve such.

I'm ignorant of this space, but it seems like the obvious solution for heat dissipation is to layer lattices and not solid layers, in order to increase the overall surface area of the chip. I assume the manufacturing is too difficult...?

That's one of the promises of 3D stacked transistors, yes.

Limitations in existing processes, sure. But not limitations in physics. If E=mc^2, we've got a lot of efficiencies still to find.

Fusion-based computing FTW!

It’s fun to be just a curious bystander for many years in this industry. Every now and then Moore’s law hits a roadblock. Some experts see that as a clear sign that it’s reaching its end......

That is just mainstream reporting.

If one actually went and read the paper referred or what the context was. It was always the same thing. It was all about the economics, all the way back from early 90s. We cant do x node because it would be too expensive to sustain it at a node every two years.

Smartphone era ( Referring to Post iPhone launch ) essentially meant we ship an additional ~2 Billions Pocket computer every year including Tablet. That is 5x the most optimistic projection to traditional PC model at 400M / year. ( Which we never reached ). And that is ignoring the Server market, Network Market, GPU market, AI Market etc. In terms of transistor and revenue or profits the whole TAM ( Total Addressable Market ) went up at least 10x more than those projection. Which is essentially what scale us from 22nm to now 3nm, and all the way to 2nm and 1.4nm. And my projection of 1nm by 2030 as well. I even wrote on HN in ~2015 I have a hard time to see how we could sustain this post 3nm. At the time when trillion dollar company was thought to be impossible.

On the other side of things, the cost projection to next node ( e.g 2nm ), and next next node (e.g 1.4nm ) was always higher than what its turns out. As with any large project management it is was better to ask and project more in case shit hits the fan. ( Intel 10nm ) But every time TSMC has executed so well.

So as you can see there is a projection mismatch at both ends. Which is why the clear sign of progress coming to end keeps being wrong.

and can expect 1000X improvement in the years to come at least.

I just want to state that this figure keeps being throw around. It was Jim Keller comparing at the time Intel 14nm ( Which is somewhere close to TSMC N10 ) to hypothetical physics limit. At 3nm we are at least 4x pass that. Depending on how you want to measure it we could reach less than 100x by 2030.

AI trend could carries us forward to may be 2035. But we dont have another product category like iPhone. Server at hyperscaler are already at a scale growth is slowing. We will again need to substantially lower the development cost of leading node ( My bet is on the AI / Software side ) and some product that continues to grow the TAM. May be Autonomous Vehicles will finally be a thing by 2030s ? ( I doubt it but just throwing in some ides ).

But every time TSMC has executed so well.

TSMC or ASML? Or both? I am not trying to be dismissive, just curious about who deserves the credits here.

TSMC otherwise intel and samsung would not be chasing TSMC

arguably, the current race is down to TSMC making the right decision on hi-NA EUV (ie, to run with low-NA). it's not as if Intel couldn't have acquired EUV, they just chose not to.

Its predominantly TSMC. I do get quite tired and sometimes annoyed when 99.99999% of the internet including HN stating it is just "ASML". As if buying those TwinScan would be enough to make you the world's best leading edge foundry.

Remember,

1. TSMC didn't beat Intel because they had the newer EUV first. They beat Intel before the whole thing started.

2. If having EUV Machines from ASML were enough, Samsung would have been the 2nd given they are or they will use it for NAND and DRAM. And yet they are barely competing.

3. It is not like Global Foundry dropped off racing the leading edge node for no reason.

4. TSMC has always managed to work around any road block when ASML failed to deliver their promise on time.

5. Quoting from CEO of ASML, half jokingly but also half true "Dont ask us about how those EUV machine are doing. Ask TSMC, they know that thing better than we do."

Of course there is large number of small companies around the whole Pure Play Foundry business in which TSMC ex-CEO calls it the Grand Alliance. You need every party to perform well for it to happen. This is somewhat different to Samsung and Intel, both are ( more or less ) much more vertically integrated.

It’s a massive supply chain, so, yes, both. But also a hundred other companies. TSMC and other foundries bring together many technologies from many companies (and no doubt a lot of their own) to ship a full foundry solution (design-technology-cooptimization, masks, lithography, packaging, etc).

However there is a big difference between those "~2 Billions Pocket computer every year including Tablet" and regular computers, so to speak.

They are mostly programmed in managed languages, where the respective runtimes and OS collaborate, in order to distribute the computing across all available cores in the best way possible, with little intervention required from the developers side.

Additionally, the OS frameworks and language runtimes collaborate in the best way to take advantage of each specific set of CPU capabilities in an almost transparent way.

Quite different from the regular POSIX and Win32 applications coded in C and C++, where everything needs to be explicitly taken care of, which is what kind of prevents most of the cool CPU approaches to take off, sitting there idle most of the time.

They are mostly programmed in managed languages, where the respective runtimes and OS collaborate, in order to distribute the computing across all available cores in the best way possible, with little intervention required from the developers side.

I was under the impression that distributing workloads across many CPU cores (or HW threads) is done at the process and thread level by the OS? That gives managed and unmanaged languages the same benefits.

Managed languages provide higher level primitives that makes it easier to create a multi-threaded application. But isn't that still manually coded in the mainstream managed languages?

I'm thinking of inherently CPU-intensive custom workloads. UI rendering and IO operations become automatically distributed with little intervention.

Or am I missing something, where there is "little intervention required from the developers side" to create multi-threaded apps?

You are missing the part that ART, Swift/Objective-C runtime, and stuff like Gran Central Dispatch also take part in the decision process.

So the schedulers can decide in a more transparent way what runs where, specially on Android side, where the on-device JIT/AOT compilers are part of the loop.

Additionally, there is more effort on having the toolchains explore SIMD capabilities, where on C and C++ level one is expected to write that code explicilty.

Yes, auto-vectorization isn't as good as writing the code explicitly, however the latter implies that only a niche set of developers actually care to write any of it.

Hence why frameworks like Accelerate exist, even if a JIT isn't part of the picture, the framework takes the best path depending on available hardware.

Likewise higher level managed frameworks offer a better distribution between the parallel processing taking part across CPU, GPU or NPU, which again on classical UNIX/Win32 in C and C++, have to be explicility programmed for.

Such higher level frameworks can of course also be provided in such languages, e.g. CUDA and SYCL, howver then we start discussing about programmer culture to adopt such kind of tooling in classical LOB applications.

ART, Swift/Objective-C runtime, and stuff like Grand Central Dispatch

I don't know these, but from a quick googling it still looks like explicit multi-threading? Albeit with higher level primitives than in older languages, but still explicit?

auto-vectorization

I'm not sure I see a hard dividing line between older languages and managed ones as far as auto-vectorization? Sure, a higher-level language might make it easier for the compiler since it knows more about potential side effects, but simple and local C code doesn't have any side effects either.

with little intervention required from the developers side

Hence why frameworks like Accelerate exist, even if a JIT isn't part of the picture

Accelerate looks nice, but it still looks like it has to be called explicitly in the user code?

Likewise higher level managed frameworks offer a better distribution between the parallel processing taking part across CPU, GPU or NPU, which again on classical UNIX/Win32 in C and C++, have to be explicility programmed for.

I'm not sure I understand, can you give more explicit examples?

My point here isn't that managed languages don't give big benefits over C. I prefer Python and C# when those can be used.

It's more that I don't see "automatic parallel processing" as a solved problem?

Sure, we get better and better primitives for multi-threading, and there are more and more high-level parallel libraries like you mentioned. But for most cases, the programmer still has to explicitly design the application to take advantage of multiple cores.

I remember reading around the 300nm transition that Moore’s law was all over because wavelengths and physics. No one was talking about multiple masking patterns, probably because it was prohibitively expensive. Inconceivable, much like trillion dollar companies in the early 2000s.

I remember a quote from Von Braun where he learned to use the word 'impossible' with the greatest caution.

When you have a significant fraction of the GDP of a super power dedicated to achieving some crazy engineering task, it almost certainly can be done. And I wouldn’t bet against our hanger for better chips.

It was all about the economics, all the way back from early 90s. We cant do x node because it would be too expensive to sustain it at a node every two years.

Totally agree.

AI trend could carries us forward to may be 2035. But we dont have another product category like iPhone.

There will be fancier iPhones with on board offline Large Language Models and other Foundation Models to talk to, solving all kinds of tasks for you that would require a human assistant today.

As Jim Keller himself famously put, Moore's law is still fine. Furthermore, the number of people predicting end of Moore's law doubles every 18 months, thus following the Moore's law itself.

It is fun to watch and keep track of - And keeping in mind it's also been an insane amount of work by an insane number of people with an insane amount of budget thrown at the problems. You can do quite a bit in software "as a hobby" - and this field is not it.

Aren't Intel, TSMC and Samsung all customers (and investors) of ASML, which is actually the manufacturer and developer of the EUV (ultraviolet) machines this refers to? Basically, if at all, they might have a slight exclusivity deal, but given the owner structure you can imagine that this will not really affect anything in the long run. With the willingness of spending the money on new nodes they will have the technology too.

General question about semiconductors: Why is there so much emphasis on the density of transistors rather than purely on the costs of production (compute/$)? CPUs aren't particularly large. My computer's CPU may be just a few tablespoons in volume. Hence, is compute less useful if it's spread out (e.g., due to communication speeds)?

Light travels at one foot per nanosecond. So a processor one foot wide you'd expect to run at 1 GHz max.

Only if it is so badly designed that data needs to cross the entire dye's cross section.

Look at how much space cache uses on a die.

The core would use cache near it? Memory access delay such as caches is not considered part of cpu frequency either afaik.

cache latency is definitely part of what limits core clock. you're not going to have a good time if your L1 latency is, say 10 clocks. not to mention the fact that register files are not much different than SRAM (therefore cache-like).

Fair enough, to measure real world performance you're right anf that's all that should matter anyways.

Cache takes X number of cycles to return a result

You can make X lower by reducing the frequency (= having each cycle be longer)

But apart for that, the main reason big chips would clock slower is power, not timing. If you have a lot of transistors all switching on a high voltage so that the frequency is high, you get molten metal and the magic smoke leaves.

Big chips aren't one big stage where light travels from one side to the other. But they are giant weaves of heating elements that can't all run fast all of the time

That's only if you needed a signal to cross the whole chip in one cycle. There's no such limitation preventing a 1 foot wide chip from being filled with 5ghz cores on an appropriate ring bus.

something something quantum tunnelling for probabilistic FTL signalling (/s)

But then you have many cores. With 3d scaling you could make a bigger core and still have high hz.

Isn't Apple M3 larger in size than other Arm CPUs? Still, they don't run slower.

the physical limitation of more CPUs: heat, which in turn downgrades performance

But I think the GP's point is that heat is far easier managed when spread out over a larger area, so why all the emphasis on ultra tiny transistors vs just making a chip that's two inches by two inches or something?

And I think the main answer to that comes when you look at some of the discourse around Apple's M-series chips, that doing a larger-die design is just way riskier: there are huge implications on cost, yield, flexibility, etc, so it was really something that Apple was uniquely positioned to move aggressively on vs a player like Qualcomm who needs to be way more conservative in what they try to sell to their main customers (phone OEMs like Samsung).

making a chip that's two inches by two inches

These already exist. Lookup images of AMD Ryzen Threadripper PRO 7995WX - 96 cores:

https://www.techpowerup.com/img/fp51OPD4JRS7wvTK.jpg

Those are made up of much smaller chiplets. No individual die is all that big

Oh the reason they don’t make dies that big is the probably of a defect balloons and your yields would be shit.

Apple handles that in part by fusing off the broken subcomponents of the huge die, though, which is how they end up with stuff like a 7-core GPU.

Every chip manufacturer does that, that's how they come up with cheap, low end parts. They just try to keep number of cores an even number, so the trick is less obvious.

Because when you make it denser, you can cut the CPU into smaller parts, which decreases costs

when you make it less dense, it can clock up higher, but you will have fewer cores per mm^2

AMD went with both approaches, where their hybrid CPU will have densely packed low speed Zen 4C cores and some high speed Zen 4 cores to boost at the highest frequency

Increasing density has caused chip cost per FLOP/s to decrease exponentially over the last decades. But nowadays the price per transistor doesn't go down as fast with increased density like it used to.

E.g. new Nvidia GPUs are getting smaller for the same price, which means they are getting more expensive for the same size. At some point, the price per transistor will actually increase. Then Moore's Law (the exponential increase in transistor density) will probably stop, simply because it's not economical to produce slower chips for the same price. (Maybe the increased power efficiency will still make density scaling worth it for a little while longer, but probably not a lot longer.)

This isn't due to fundamental cost increasing per transistor though, this is because NVidia changed their pricing strategy to decouple it from that.

They are simply making greater % profit/transitor.

> this is because NVidia changed their pricing strategy to decouple it from that

Because neither AMD nor Intel can come withing striking distance of Nvidia's flagships, and seeing how their silicone flies off the shelves, they have also adjusted their pricing to match their relative performance to Nvidia.

You're talking about one product, which is the 4090, which is 0.88% of all GPUs used by Steam users [0]

The top GPU is the 3060, at 4.89% which both Intel and AMD match with their offerings [1]

The top AMD dGPU is still the RX 580 at 0.97%

[0] https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

[1] https://youtu.be/aXU9wee0tec?t=834

You could always purchase a multi CPU system (effectively what you're suggesting) from several years ago for much cheaper than modern hardware. If you're using it regularly though, the electrical cost will eventually eat away any money savings vs the same computational power in a modern single CPU.

With the way Solar/wind + batteries are bringing electricity prices down the cost per compute will still come down even as moore laws slows down. Looking at current trends running today processors 10 years from now could just cost in electricity just 10-12% what it costs now.

Yes, electricity doesn’t move instantly

Density is one of the main ways to get cost savings. But there are others too, and there's also a lot of hype around them. Chiplets for example. Or CXL for memory.

I believe this was some of the advantage of the AMD Zen series of chips which moved to a larger die size from Athlon.

In addition to the answers already given, there are defects during the process that are more likely to render your chip useless the larger your chip is. This is true for smaller chips as well, and often the design handles a defunct component, but you prefer minimizing defects per chip.

cost is area, because defects.

TOF latency isn't that much of a big deal, though driving a signal for distance consumes a lot of power, and power has been the primary design-limiter for at least a decade.

Because you are assuming there is an objectively optimal processor design for a specific manufacturing process.

If you don't constrain the chip to a specific design then what is going to count as compute? The number of adders or multipliers? That is just a different way of talking about transistor density.

reducing costs is nice for consumer… making cpu higher cost that goes brrrrt is better for business

It is?

A factory makes transistors ,and if you increase a 'node', you make twice as much. If you do an amazing job, you might reduce cost 10%.

So by far the best way to maximize value in semiconductors is to enable shrink.

But you also just don't hear it in the popular or even engineering press. Most manufacturers and designers look at a PPAC curve (power, performance, area, cost) and find optimal design points.

As for spreading it out: the unit of production isn't a wafer, it is a lithographic field, which is roughly 25*35mm. You cant practically 'speead out' much more (ok, you sort of can with field stitching, but that is really expensive).

Personal usage still relies on fast single threaded performance. As far as business usage, the cost is primarily energy which requires smaller node size for the same performance.

What do you mean by spread? Multi socketed mainboards?

That would help only for parallelizable workloads. For many workloads is the single threaded performance that matters most.

In storage, moving away from 2D MLC and TLC NAND towards 3D TLC stacking (and horrendous higher bits) has introduced disturbances that literally shorten the memory life cycle. When a cell is read, the voltage alters the state of adjacent cells, which must be forced to be rewritten to preserve their state, thus shortening the life cycle of the disk just by reading data. they are selling us crap.

From the little I understand about the problem, this would be solved by occupying more surface area to separate the tracks that run through the vertical stacks ? what would be like a 2D design surface area but with bigger complications. Although I have read papers[1] that propose adding latency in an attempt to mitigate (not solve) the problem.

So now, reading this news about processors and stacking, I wonder about what inconveniences the end users are going to suffer with processors built under these techniques. Whether in computational reliability, vulnerabilities and so on.

I wrote vulnerabilities (pure imagination and speculation of my own, I'm imagining a prefetch problem at the transistor level) because if it turns to be real at future I can see the manufacturer introducing a fix for randomly increase latencies or any other thing, and sending the computing power back ten years with an "oh, we didn't expect it such thing were possible when we designed it".

And of course the computational reliability.

is being taken care of to avoid all of this?.. if not, I leave my comment here for courts in the future.

[1] [2021] doi.org/10.1145/3445814.3446733 (use sci-hub)

[2] [2018] doi.org/10.1145/3224432 https://people.inf.ethz.ch/omutlu/pub/3D-NAND-flash-lifetime...

they are selling us crap.

You can completely rewrite a modern 4TB 3D TLC NAND every day for 3 years (3000 TBW). How is that crap? Who even has such needs?

You are talking about some arbitrary "quality" - I want to be able to rewrite it a zillion times - which make no sense for 99.9% of use cases.

I'd much rather have a 4 TB drive which can be rewritten 1000 times versus a same price 256 GB one which can be rewritten 1 million times.

TLC is a decent spot, which is why it's still being produced.

QLC is less so, since its endurance is only ~300 cycles. there's plenty of tension in the storage industry about this, with vendors saying "don't worry be happy", and purchasers saying "wait, what read:write ratio are you assuming, and how much dedupe?"

PLC (probably <100 cycles) is very dubious, IMO, simply because it would only be suitable for very cold data - and at that point you're competing with magnetic media (which has been scaling quite nicely).

I have hard drives which I only write to once - long term archive. I will gladly change them fro QLC/PLC storage if price is reasonable.

There is a market for any cycle count, it just needs to be reliable and respect the spec.

that's the tape market I mentioned. agreed, tape doesn't fit the personal market, but it totally dominates anywhere that has scale.

the question is: what counts as reliable? if PLC is good for 50 erasures, are you really comfortable with that? it's going to cost more than half of QLC, I assure you...

the interesting thing about flash is that people want to use the speed. which means they put in places that have a high content-mutation rate. if it's just personal stuff - mostly cold, little mutation - that's fine but not the main market.

There is a market for high speed read only data - S3 serving, and all kinds of mostly read database scenarios (OLAP). You can have tiered storage, data is first consolidated/updated on TLC drives, and as it ages is moved to PLC storage. RocksDB already supports something like this.

2D TLC wasn't quite decent, but 3D TLC is decent. I think some bad reputation about TLC is come from 2D TLC.

You can completely rewrite a modern [..]

3D NAND has introduced degradation when is read data from the disk. You need to calculate then how many times the disk is read, the unwritten free space that will be consumed for to maintain the data when the disk is read, and so on.

and? the TBW guarantees are known in advance.

The TBW of the disk shown in the specifications is the estimated write limit of each cell multiplied by the number of cells. They don't take into account that in order to read the data of each cell, the adjacent cells will be written and will consume little by little these estimated write limits.

Therefore, if you fill the disk and only read data, it will sooner or later go into protection mode or lose data because of it.

They could only guarantee the TBW if more memory were added for to cover the writes consumption by the reads usage of the current 3D NAND design. I no longer know how to explain that it is programmed obsolescence, self-destructing disks by read data.

We stopped seeing 10 years guaranties when 3D NAND was introduced, so they know well what they are doing.

I don't think so https://www.jedec.org/standards-documents/focus/flash/solid-...

Why do you link an Industrial SSD Storage standard with write test? It only shows the cells have the corresponding write limits at beginning.

My last comment, I'm sorry but I can't spend any more time with this.

To read data consume writes,

https://dl.acm.org/doi/10.1145/3445814.3446733

" Figure 1a plots the average SSD lifetime consumed by the read-only workloads across 200 days on three SSDs (the detailed parameters of these SSDs can be found from SSD-A/-B/-C in Table 1). As shown in the figure, the lifetime consumed by the read (disturbance) induced writes increases significantly as the SSD density increases. In addition, increasing the read throughput (from 17MBps to 56/68MBps) can greatly accelerate the lifetime consumption. Even more problematically, as the density increases, the SSD lifetime (plotted in Figure 1b) decreases. In addition, SSD-aware write-reduction-oriented system software is no longer sufficient for high-density 3D SSDs, to reduce lifetime consumption. This is because the SSDs entered an era where one can wear out an SSD by simply reading it."

Data retention consume writes,

https://ghose.cs.illinois.edu/papers/18sigmetrics_3dflash.pd...

" 3D NAND flash memory exhibits three new error sources that were not previously observed in planar NAND flash memory:

(1) layer-to-layer process variation, a new phenomenon specific to the 3D nature of the device, where the average error rate of each 3D-stacked layer in a chip is significantly different;

(2) early retention loss, a new phenomenon where the number of errors due to charge leakage increases quickly within several hours after programming; and

(3) retention interference, a new phenomenon where the rate at which charge leaks from a flash cell is dependent on the data value stored in the neighboring cell. "

Free way.

>So now, reading this news about processors and stacking, I wonder about what inconveniences the end users are going to suffer with processors built under these techniques. Whether in computational reliability, vulnerabilities and so on.

Denser logic hasn't got the same issues as dense non-volatile storage as logic doesn't need to have any persistence.

It's what the likes of Micron and Samsung are good at fixing and working around when they launch and scale their Xnm processes for a specific storage technology, and what makes them better than competitors.

Intel, TSMC, GloFo, etc they all can buy the latest gen EUV machines from ASML if they want, but yet TSMC is always one node ahead on logic and Micron and Samsung win at storage, because they're good at ironing out the kinks and challenges that come from shrinking down those specific designs closer and closer to sub-nm level while the others can not (so easily).

If fabbing cutting edge silicone was as easy as just having the latest gen ASML machines, then ASML would just hoard the cutting edge machines for themselves and become vertically integrated in fabbing their own cutting chips as a side hustle before everyone else.

Is that going to increase the GHz as well or just the number of cores.

Doubt that. Frequency will be tied to heat dissipation. And in a 3d stack, the heat dissipation of the inner transistors is going to be very difficult

In gaming, especially simulators: The 5800x3d and then the 7800x3d has proved how exemplary performance benefits can be gained in certain use cases, in some cases outperforming Intel with less than half the power usage (if not a third).

Limiting overclocking is a price to pay for that, but you kind of get it back with the monthly power bills - and still going toe to toe with Intel in general.

I doubt that for one user using one CPU the power bill is going to matter. People still use AC, washing machines and electrical heating consuming thousands of kw.

It's a 100-150w difference. It all adds up.

This also means you don't need to run the fans in your system as high.

Look at the numbers.

Power/heat matters for datacenters, but not for people. Yes, you can built a kw desktop, but you know you're doing something weird. For most people, their computer's peak dissipation has been falling for a decade. 90W cpus used to be common, but mainstream is currently going from 65 to 40W categories in desktops. And normal people do not have GPUs. Even more normal people depend primarily on mobile devices, where 15W laptops are routine, and lots of people use devices <4W.

Look at the specs for the Core series since 2007. The clocks have doubled. It’s not a fast increase, to be sure, but it’s happening

These two layers are touching, nanometers apart. The heat dissipation will be the same for both layers. It's still a simple problem of density, not a more complicated problem similar to trying to cool multiple dies.

Edit: To throw math at it, silicon conducts at 2-4 Watts per centimeter-Kelvin. If we need the heat to travel an extra 100nm, and we're looking at a 1cm by 1cm area of chip, then it takes 20 to 40 kilowatts flowing through that slice before the top and bottom will differ by more than 0.1 degrees.

Doubt that. Frequency will be tied to heat dissipation. And in a 3d stack, the heat dissipation of the inner transistors is going to be very difficult

GHz might not increase but maybe they can do more IPC by having a wider architecture.

What real world outcomes might we expect from this technology?

Anyone know?

Novel cooling solutions, among others, one suspects.

What do you mean by that and what are you basing that on?

Running watts through transistors produces heat. Flat transistors are cooled by various heat dispersal mechanisms today. Thicker 3D stacked transistors will possibly provide impetus for a different cooling paradigm.

This is just the same thing you said before worded differently.

What different cooling paradigm and what information leads you to think that it's a reality?

Thicker phones.

One can dream.

Faster chips which use less power to do the same amount of computation, same as ever.

CFETs are very much real world technology which are on the roadmaps for all leading edge fabs. They're the same as current gen FinFets and GAAFets a year or two from now in that they essentially just do the same thing as previous generations of chip tech except they do it better.

Maybe I’m missing something here, but wouldn’t heat become a bigger issue? Right now we have pretty intense cooling solutions to get heat off the surface of a comparatively thinner chip. If chips become more cubic how would we cool the inside?

If we keep going down this route I have to wonder if we'll see something drastic in the cooling space.

CPU dies are optimised towards being cooled from one side. I wonder if we'll eventually see sockets, motherboards and heat spreaders shift towards cooling both sides of the CPU.

Probably not, can't imagine what a halfway feasible solution to integrating pin out and a heat spreader would be.

A relatively easy win here is to have a “stock” set of fins built into the motherboard behind the CPU socket. The CPU could get attached to it with a pad or paste on the back.

A couple years back they noted that they were looking at having essentially cooling pipes _inside_ the chips. There hasn't been much noise in terms of commercialization, but that's the kind of extreme they were looking at.

https://www.tomshardware.com/news/tsmc-exploring-on-chip-sem...

cooling both sides of the CPU

That would only double cooling capacity, and has costs - completely invalidates current motherboard designs.

Heres my first thought on how you might be able to do pin out on a 'sandwich a cpu between 2 heatsinks design'.

1) DRAM gets integrated with the cpu. Slight thickness increase, probably quite a bit of added width. We get a bigger area to cool, closer ram and no need for any memory pins.

2) Add power connections to the 2 cooling sides. Running power wires through the coolers shouldn't be an issue.

3) Run as many of the fastest PCIe lanes as you can out the 4 thin sides of the package. These end up handling ALL of the IO.

Some downsides I can think of off the bat are cooking the ram chips and with so much density and heat not sure how well signal integrity would work out.

Experts estimate CFETs to roll out commercially seven to ten years from now, but there is still a lot of work before they are ready.

So the technology is still very much science fiction at this point.

No, they've been made, just not via scaled-up commercial production process.

The fact that they are predicted to be "seven to ten years" away suggests there are still many unsolved problems that are preventing scaling-up from becoming a reality.

Fair point. I think for comparison it's interesting that Intel spent most of a decade on a single process.

Fun times.

I think one of the interesting takeaways here should be that they have a 48 - 50nm "device pitch" which is to say the transistors are small in the XY plane there are pitch widths much larger than "5nm" or "3nm" (people familiar with chip production realize this but too often people who don't have a very deep understanding of chip production are mislead into thinking you can put down transistors 5nm apart from each other)

So from a density perspective, a perhaps 30 - 40% gain in overall number of transistors in the same space.

Looking at the Intel inverter design, it looks like if they were willing to double the depth they could come up with a really compact DRAM cell. A chiplet with 8 GB of ECC DDR memory on it would be a useful thing both for their processors and their high end FPGA architectures.

really compact DRAM? have you seen DRAM? the aspect ratio is already huge, though afaik no one stacks the pass transistor.

high-end systems already have stacked DRAM chiplets, though admittedly this hasn't made much of an appearance outside GPUs until now (MI300a).

A minor startup isn't Intel, Samsung and TSMC but www.thruchip.com did 3d stacking 10 years ago.

https://web.stanford.edu/class/ee380/Abstracts/141022-slides...

https://www.theregister.com/2014/02/21/thruchip_communicatio...

the case for inductive stacked chips is pretty compelling, if those slides are right!

I wonder if you can also couple adjacent chips that way, since 2.5D is, if anything, more important than stacking...

What does heat do in these chips? How does it not melt?

heat density is heat density. this technique isn't the same as stacking two logic dies (which would have a heat problem).

backside power is actually a pretty important power improvement - both delivery and cooling.

Interesting bit about Samsung’s secret sauce:

Samsung went even smaller than Intel, showing results for 48-nm and 45-nm contacted poly pitch (CPP), compared to Intel’s 60 nm, though these were for individual devices, not complete inverters. Although there was some performance degradation in the smaller of Samsung’s two prototype CFETs, it wasn’t much, and the company’s researchers believe manufacturing process optimization will take care of it. Crucial to Samsung’s success was the ability to electrically isolate the sources and drains of the stacked pFET and nFET devices. Without adequate isolation, the device, which Samsung calls a 3D stacked FET (3DSFET), will leak current. A key step to achieving that isolation was swapping an etching step involving wet chemicals with a new kind of dry etch. That led to an 80 percent boost in the yield of good devices. Like Intel, Samsung contacted the bottom of the device from beneath the silicon to save space. However, the Korean chipmaker differed from the American one by using a single nanosheet in each of the paired devices, instead of Intel’s three. According to its researchers, increasing the number of nanosheets will enhance the CFET’s performance.

Interesting that when we can't make chips bigger laterally, we go vertical and stack transistors. It's like we discovered high-rise buildings all over again.

Since it's still a GAA channel, are the channel lengths sthr same as the latest 3nm node?

Any EE member here? How's photonics' computing going?