HN comments for: Intel Processor Instability Causing Oodle Decompression Failures

Ochi

56 replies

1d6h

2024-02-23 14:08:37 UTC

So ideally, we should disable hyper threading to mitigate security issues and now also disable turbo mode to mitigate memory corruption issues. Maybe we should also disable C states to avoid side-channel attacks and disable efficiency cores to avoid scheduler issues... and at some point we are back to a feature set from 20+ years ago. :P

secondcoming

34 replies

1d5h

2024-02-23 15:00:46 UTC

Or just disable overclocking.

ajross

24 replies

1d5h

2024-02-23 15:07:34 UTC

Why is this downvoted? That's exactly what's happening here. The affected devices are being overclocked, and the instructions at the end of the linked support document detail how to find the correct limits for your CPU and set them in your BIOS.

sandworm101

17 replies

1d5h

2024-02-23 15:30:01 UTC

I think it is because overclocking has become so normal that it is an expected feature on most chips. Being told to disable it is like being told to disable the supercharger on your new Ferrari: you are no longer getting what you thought you had paid for.

ajross

7 replies

1d3h

2024-02-23 17:02:02 UTC

Ferrari for sure warrants their cars as sold. But if you take it to a mod shop and put in an aftermarket turbo that damages your valves, you don't go whining to HN with an article with "Ferrari Engine Instability" in the title, do you?

I don't know what you want Intel to do here. They tell you upfront what the power and clock limits are on the parts. But the market has a three decade history of people pushing the chips a little past their limit for fun and profit, so they "allow" it even if they know it won't work for everything.

rygorous

3 replies

22h16m

2024-02-23 22:25:11 UTC

(Oodle maintainer here.) This issue only occurs on some small fraction of machines, but on those that we've had access to, it reproduces with BIOS defaults and no user-specified overclocking. It turns out several of these mainboards will overclock and set other values out of spec even at BIOS defaults.

I don't have a problem with end users experiencing instability once they manually overclock (that's how it goes), but CPU + mainboard combinations experiencing typical OC symptoms with out-of-the-box settings is just not OK.

This appears to be an arms race between mainboard vendors all going further and further past spec by default because it gives better benchmark and review scores and their competition does it. Intel for their part are themselves also dialing in their parts more aggressively (and, presumably although I don't know for sure, with smaller margins) over time, and they are for sure aware that this is happening, because a) even had they not known already (which they did) they would have learned about this months ago when we first contacted them about this issue, b) technically out of spec or not, as long as it seems to work fine for users and makes their parts look better in reviews, they're not going to complain.

However, it turns out, it does not work fine for at least some small fraction of machines. I have no idea what that percentage is, but it's high enough that googling for say "Intel 13900K crash" yields plenty of relevant results. Some of this will be actual intentional overclockers but, given how boards default to some extend of out-of-spec overclocking enabled, it's unlikely to be all of them.

Meanwhile we (and other SW vendors) are getting a noticeable uptick in crash reports on, specifically, recent K-series Intel CPUs, and it's not something we can sanely work around because the issue manifests as code randomly misbehaving and it's not even when doing anything fancy. The Oodle issue in particular is during LZ77-family decompression, which is to say, all integer arithmetic (not even multiplies, just adds, shifts and logic ops), loads/stores and branches. This is the bare essentials. If it was an issue with say AVX2, we could avoid AVX2 code paths on that family of machines (and preferably figure out what exactly is going wrong so we can come up with a more targeted workaround). But there is no sane plan B for "integer ALU ops, load/stores and branches don't work reliably under load". If we can't rely on that working, there is not enough left for us to work around bugs with!

I realize this all looks like finger-pointing, but this is truly beyond our capacity to work around in a sane way in SW, with what we know so far anyway. Maybe there is a much more specific trigger involved that we could avoid, but if so, we haven't found it yet.

Either way, when it's easy to find end user machines that are crashing at stock settings, things have gone too far and Intel needs to sit down with their HW partners and get everyone (themselves included) to de-escalate.

ajross

1 replies

21h39m

2024-02-23 23:02:57 UTC

I realize this all looks like finger-pointing

You think that might have something to do with you having put "Intel Processor Instability" in the title of a whitepaper on an issue that you already root caused to motherboard settings? I mean, did you want to troll a big flame war? Because this is how you troll a big flame war.

rygorous

0 replies

20h27m

2024-02-24 00:14:25 UTC

It's an issue that, while the mainboard is involved, happens to occur on (at least) the 3 best-selling mainboard vendors compatible with that family of CPUs, at stock settings, so that you can take an affected CPU, swap it through a selection of the most popular mainboards compatible with said CPU and see the same kind of instability problems.

I don't think it's unreasonable to call that Intel's problem, maybe not in terms of culpability (but truly, nobody cares) but definitely in the sense this is doing damage to their brand. If the mainboards are all out of spec then they need to talk about this publicly, rein them in, start a certification program, whatever. Being publicly completely fine with this as long as it results in good review scores but then starting to go "well actually..." when there's stability issues on a small fraction of sold units is not a good look.

mips_r4300i

0 replies

13h55m

2024-02-24 06:46:50 UTC

As a data point, I just built a new dev machine with a 14900K on a new ASUS board.

Out of the box with default settings, it was pushing 320W through the CPU in stress tests.

I use my machine for FPGA compiles so I need reliability. I learned that ASUS Multicore Enhancement is not the only thing that must be disabled, you must manually enter the power limits.

Now my compiles take exactly the same length of time but use at least 100W less power.

I am glad to know that with your field data, I've inadvertently sidestepped a potentially catastrophic bug. I don't want to release an FPGA bitstream to users with flipped bits. And the FPGA tools already crash on their own enough.

kuschku

2 replies

1d3h

2024-02-23 17:35:16 UTC

These Motherboards are Intel certified. If I get a mod shop to install a ferrari certified part, I expect the part to work.

xcv123

0 replies

1d1h

2024-02-23 18:54:46 UTC

Ferrari does not allow modifications of their cars. If you take it to a mod shop, they will void the warranty and you will be banned from purchasing a new Ferrari.

ajross

0 replies

1d1h

2024-02-23 18:53:43 UTC

Meh. So you're in the "Intel should take affirmative action to prevent overclocking" camp. And as mentioned the response to that is that they've tried that (on multiple occasions, using multiple techniques) and people freaked out about that too. They can't win, I guess.

crazygringo

6 replies

1d4h

2024-02-23 16:03:24 UTC

Is that true?

I thought the entire premise of overclocking was that it's not officially supported and it may break things.

The whole point is that you're not paying for it and it's entirely at-risk.

Because if you do want a higher level of guaranteed performance, you do need to pay for a faster chip (if it exists).

Kluggy

3 replies

1d3h

2024-02-23 16:49:43 UTC

CPU manufacturers certainly hold the line you stated but motherboard venders have jumped over the line and now sell motherboards that overlock for the end user entirely transparently.

It’s fair for the end user who bought a motherboard that promises a higher clock speed to expect that clock speed.

crazygringo

2 replies

1d3h

2024-02-23 17:37:11 UTC

Do these motherboards explicitly provide a warranty that covers not just damage from overclocking but also CPU errors?

If you can provide links, I'd be curious to see what guarantees they make. "What's fair" depends very specifically on what language they use.

Red_Leaves_Flyy

0 replies

1d2h

2024-02-23 17:52:54 UTC

From my limited knowledge the motherboard manufacturers hide behind disclaimers. Iirc even using fast ram at their rated clock speed with a cpu that does not support that speed is a warranty violation.

Kluggy

0 replies

1d2h

2024-02-23 18:14:45 UTC

https://www.asus.com/motherboards-components/motherboards/tu... has "ASUS Multicore Enhancement" bios setting which defaulted to Auto which is documented as "This item allows you to maximize the overclocking performance optimized by ASUS core ratio settings."

They have now entered the AI bubble with

https://www.asus.com/microsite/motherboard/Intelligent-mothe...

MSI has a similar setting, although I don't know exactly what models have it nor what it's called

simondotau

0 replies

23h47m

2024-02-23 20:54:37 UTC

The whole point is that you're not paying for it

Tell that to anyone who paid extra for a K-series Intel chip.

sandworm101

0 replies

1d3h

2024-02-23 17:35:47 UTC

Well, Ferrari also tells people not to break speed limits. But if their cars started breaking apart at 85mph they would still be blamed. This might not be warranty repair, intel is probably not liable legally, but this should have impact on their reputation: intel put out a chip that does not handle overclocking very well. Ok. I'll remember that when I am shopping for my next chip.

johnklos

0 replies

1d2h

2024-02-23 18:32:22 UTC

If you put your foot to the floor in a supercharged car, you're going to eventually have to let it up lest you melt things or you burn all your oil because your rings aren't making contact with the cylinder walls any more. It's an apt metaphor since the same is true of CPUs. You can't run a CPU at 400 watts continuously for more than a handful of seconds at a time.

The problem is that Intel has normalized it so much that all their high end CPUs do this, and apparently do it often. It's not unexpected that they might be too close to the point where things are melting, so to speak.

I'd rather slower and more stable any day - I chose a Ryzen 7900 over a 7900X intentionally - but that isn't what all the marketing out there is trying to sell. The fancy motherboards, the water coolers, the highly clocked memory all account for lots of markup, so that's what's marketed. I'm not a fan.

It is worth noting a distinction between the terms "overclocking" and "turbo clocking". "Overclocking" has traditionally meant running the clock "over" the rating. "Turbo clocking" is now built in to almost every CPU out there. One technically can void your warranty, whereas the other doesn't.

Since we're mostly technical people here, we should use the appropriate term where the context makes that choice more accurate. It's like virus and Trojan - we SHOULD be technically correct, but that doesn't mean highly technical people aren't still calling Trojans viruses now and then.

bee_rider

0 replies

1d4h

2024-02-23 16:03:25 UTC

I don’t think there’s a great car analogy because the ecosystems and stakes are different.

These chips require motherboards to function, and these unlocked chips get their configuration from the motherboard. There’s no analogous entity to Ferrari the company here, it is like you bought an engine from one company, a gearbox from another, and the gearbox had a “responsiveness enhancement” setting that always redlined your RPMs or something (I don’t know cars).

cduzz

5 replies

1d2h

2024-02-23 18:03:07 UTC

I think "overclock" implies that the end-user is doing something that's out-of-spec for the thing they're operating.

This "I can run a core at a faster speed" is a documented feature so not really overclocking.

TylerE

4 replies

1d1h

2024-02-23 18:46:15 UTC

That's literally overclocking. You're clocking it at a rate over the nameplate value. Just because the BIOS is factory-unlocked doesn't really change anything.

LoganDark

2 replies

1d1h

2024-02-23 19:15:29 UTC

Yeah. Intel advertises the ability to overclock, but that doesn't mean overclocking is in spec. It just means Intel allows you to run it out of spec if you so choose. The spec says you can set the clock multiplier, it doesn't say anything above the stock range will actually be stable.

TylerE

1 replies

23h26m

2024-02-23 21:15:09 UTC

Plus almost always people are tweaking voltages and such also.

LoganDark

0 replies

22h45m

2024-02-23 21:56:45 UTC

This article is about automatic / enabled-by-default overclocking, which isn't actually specified by Intel but is done by the motherboard manufacturers anyway. At least the "GAMING GAMING GAMING" oriented ones like MSI and friends.

As an example of motherboard manufacturers going outside specifications, my MSI motherboard has a built-in option to change BCLK, which is the clock reference for the entire PCIe bus. Changing it not only overclocks the CPU, but also the GPU's connection (not the GPU itself), as well as the NVMe SSD.

This was so not-endorsed by Intel that they quickly pushed microcode that shuts the CPU down if it detects BCLK tampering.

In response, MSI added a dropdown that allows you to downgrade the microcode of the CPU.

So yeah. Very not within specifications.

cduzz

0 replies

7h37m

2024-02-24 13:04:29 UTC

If intel sells a "3.2ghz cpu" and also advertises that it can run, thermals allowing, a core or two at 4.2ghz, I don't consider that 4.2ghz core "overclocked" as much as "this chip is engineered to have a variety of clocks as advertised." The chip is made from the factory to operate in a couple different ways, just like my car may have a transmission that allows the engine to spin at a couple different speeds, as duty cycle demand.

If I run the chip in a way not documented by the manufacturer, or modify the ECU to allow the turbo to generate more boost, those are both unsupported modifications, and I'd consider either of those "overclocking"

Kon-Peki

4 replies

1d3h

2024-02-23 17:32:37 UTC

But isn't overclocking the entire point of buying the K version of these chips?

secondcoming

1 replies

1d2h

2024-02-23 17:52:45 UTC

Yes it is but there's more to overclocking than just the CPU. You also need adequate cooling and fine-tuning of parameters I'll never truly understand. There are so many moving parts that you're not guaranteed anything. It seems like the CPUs were actually running at their overclocked speeds, but the rest of the system couldn't keep up.

szundi

0 replies

2024-02-23 19:42:17 UTC

Also might need to raise voltage etc

rygorous

1 replies

23h37m

2024-02-23 21:04:25 UTC

Definitely not. These are supposed to be higher-quality bins that also ship with higher stock clock rates (both base and boost) and are rated for them.

I don't know how common this is across the whole population of PC buyers, but personally, I have for sure bought K-series parts then not clocked them past their stock settings, trusting that they are rated for it and deeply uninterested in any OCing past that. (I prefer my machines stable, thank you very much.)

bbarnett

0 replies

23h29m

2024-02-23 21:12:04 UTC

Decades ago I visited a fellow Amiga user's house. He had an overclocked 68060 Apollo board.

He was so happy with the speed. Would not stop telling everyone, and talking about it. Yet as I watched him demo it, it rebooted every minute or so. Most unstable thing ever.

Sure it booted in 2 seconds, and he just went about his merry way, but.. what?! Guy could have still overclocked a little less and had stability, but nope.

Some overclockers are weird.

zenonu

1 replies

1d2h

2024-02-23 17:46:36 UTC

Intel is already running their CPUs at the red line. We're seeing the margin breaking down as Intel tries to remain competitive. The latest 14900KS can even pull > 400W. It's utter insanity.

refulgentis

0 replies

2024-02-23 19:50:46 UTC

I wish I kept up with them better, I swear every 3 months I see a headline that is "Intel says N nodes in {N-1 duration - 3 months}. I think I just saw 5 nodes in 4 years? And we've had 2 in the last 4? Sigh.

bee_rider

1 replies

1d5h

2024-02-23 15:10:59 UTC

At least the built in “multicore enhancement” type overclocks that are popular nowadays with motherboard manufacturers.

I wonder if the old style “bump it up and memtest” type overclocking would catch this. Actually, what is the good testing tool nowadays? Does memtest check AVX frequencies?

wmf

0 replies

1d1h

2024-02-23 19:14:59 UTC

Intel Performance Maximizer is from Intel so I'd hope it has good tests.

bee_rider

16 replies

1d6h

2024-02-23 14:20:08 UTC

IMO it is worth noting that the “turbo mode,” as you call it, seems to be an overlock that some motherboards do by default. Not the stock boost frequencies.

The hyperthread and c-state stuff, eh, if you want to run code that might be a virus you will have to limit your system. I dunno. It would be a shame if we lost the ability to ignore that advice. Most desktops are single-user after all.

dist-epoch

5 replies

1d6h

2024-02-23 14:36:30 UTC

Intel should police their own ecosystem.

ajross

3 replies

1d6h

2024-02-23 14:41:45 UTC

They have, in the past. People (including posters here) absolutely freaked out about clock-locked processors and screamed about the needless product differentiation of selling "K" CPUs at a premium.

People want to overclock. Gamers want to see big numbers. If gamers don't do it their motherboard vendors will. It's not a market over which Intel is going to have much control, really.

Note that you don't, in general, see this kind of silly edgelord clocking in the laptop segments.

dist-epoch

2 replies

1d5h

2024-02-23 14:52:45 UTC

Overclocking is ok.

Out of the box default overclocking is not, this aspect should be policed.

ajross

1 replies

1d5h

2024-02-23 15:05:49 UTC

FWIW, there's no evidence that this is an "out of the box default" configuration on any of this hardware. Almost certainly these are users who clicked on the "Mega Super Optimizzzz!!!" button in their BIOS settings. And again, overclocking support on gaming motherboards is a feature that consumers want, and will pay for. So of course the vendors are going to provide it.

rygorous

0 replies

23h32m

2024-02-23 21:09:55 UTC

Oodle maintainer here, we had two people that hit the issue offer to run some experiments for us. Neither were doing any overclocking before and both tried numerous things including resetting to BIOS defaults and also updating their BIOS (there was a known [to Intel] issue affecting some ASUS boards that had been fixed in a BIOS update in spring of 2023, and we were asked to rule it out.)

This issue doesn't affect every such machine, but both people affected by the issue that consented to run tests for us still had the issue reproduce after flashing BIOS to current and with BIOS default settings for absolutely everything.

Among the settings enabled by default on some boards: current limit set to 511 amps (...wat), long duration power limit set to 350W (Intel spec: 125W), short duration power limit also set to 350W (Intel spec: 253W), "MultiCore Enhancement" which is extra clock boosting past what the CPUs do themselves set to "Auto" not "Off", and some others.

jnxx

0 replies

1d5h

2024-02-23 14:59:51 UTC

Why does this reminds me in this big, extremely profitable company that made something every American needs in a while, which seems to have abandoned all sanity in their processes? Looks like Intel and Boeing are on a similar path....

jrockway

3 replies

1d5h

2024-02-23 15:18:42 UTC

Remember that you run a lot of untrusted code on your single-user desktop through Javascript on websites. Javascript can do all those side channel attacks like Spectre and Meltdown.

VyseofArcadia

1 replies

1d5h

2024-02-23 15:28:42 UTC

Maybe you do, but some of us use NoScript[0] and whitelist sites we trust.

I'm not affiliated with NoScript. I just think it's insane that we run oodles of code to display web pages.

[0] https://noscript.net/

Workaccount2

0 replies

1d3h

2024-02-23 17:10:33 UTC

Using no-script made me realize how unchained the Internet has become. Sites with upwards of 15 different domains all running whatever JS they want on your machine. Totally insane.

bee_rider

0 replies

1d5h

2024-02-23 15:26:07 UTC

There are almost certainly unmitigated Spectre-style bugs hiding in modern hardware. People who don’t block JavaScript by default are impossible to protect anyway.

jnxx

2 replies

1d5h

2024-02-23 14:57:27 UTC

The hyperthread and c-state stuff, eh, if you want to run code that might be a virus you will have to limit your system.

So, you are trusting all web pages you view? Because these are unknown code running on your box which probably has some beefy private data.

bee_rider

0 replies

1d5h

2024-02-23 15:05:52 UTC

I run noscript and try to be selective about which pages I enable.

Wowfunhappy

0 replies

1d5h

2024-02-23 15:03:58 UTC

I know some people browse the web while gaming, but I don't. For the gaming use case, I legit want a toggle that says "yes, all the code I'm running is trusted, now please prioritize maximum performance at all costs." For all I care this mode can cut the network connection since I don't do multiplayer.

I imagine people doing e.g. heavy number crunching might want something similar.

blibble

1 replies

1d6h

2024-02-23 14:26:55 UTC

turbo boost is an advertised feature of the chip

these chips that have been specially binned because they are supposedly stable at those frequencies (within an envelope set by intel)

if intel can't get it to work they shouldn't be selling these chips at all

bee_rider

0 replies

1d5h

2024-02-23 14:57:00 UTC

Unless I misread the blog post, there doesn’t seem to be any issue with the stock turbo behavior.

alwayslikethis

0 replies

1d6h

2024-02-23 14:31:41 UTC

Provided enough cooling, a chip that can boost to its turbo frequency for a few seconds should also run stably at that frequency indefinitely. Nowadays these boost clocks are so high that there is often not much gained by pushing any further.

whoisthemachine

0 replies

1d4h

2024-02-23 16:20:46 UTC

Good, fast, cheap. Choose two.

vondur

0 replies

1d4h

2024-02-23 16:10:45 UTC

It seems like problems occur from different firmware from the various motherboard manufacturers. I have a motherboard with a Ryzen 7950x and it would randomly not boot. I'd have to remove the battery from the system, let it fully reset, and then it would work again. Finally an update to the firmware fixed that bug.

rkagerer

0 replies

1d1h

2024-02-23 19:36:35 UTC

Right, when they still knew how to make reliable hardware instead of cramming in features that aren't fully thought out and come with questionable tradeoffs to hit the bleeding edge.

paulmd

0 replies

1d1h

2024-02-23 19:06:48 UTC

haha, knew it wouldn't take long for the AMD fanboys to get winding up on how awful this is gonna be.

https://news.ycombinator.com/item?id=39479081

Somehow people think that it's a strawman, but people like parent comment actually think and post like this lol

eqvinox

32 replies

1d11h

2024-02-23 09:32:34 UTC

The article is a bit unclear on whether this happens with standard/default settings, tough that's probably because they don't know themselves. The workarounds changing things from "Auto" to "disabled" or even increasing voltage settings certainly seems like it also applies with defaults, and isn't some overclocking/tuning side effect.

If that is the case… ouch.

scrlk

26 replies

1d11h

2024-02-23 09:37:48 UTC

This sounds like motherboard manufacturers pushing aggressive OOTB performance settings, likely in excess of the Intel spec.

eqvinox

25 replies

1d10h

2024-02-23 09:44:37 UTC

That's… an assumption. At least 3 motherboard vendors are affected, and going by the Gigabyte/MSI workarounds at the end of the article, it looks like things need to be adjusted away from Intel defaults.

…it'll need a statement from Intel for some clarity on this…

scrlk

18 replies

1d10h

2024-02-23 09:52:00 UTC

"Intel's default maximum TDP for the 13900K is 253 watts, though it can easily consume 300 watts or more when given a higher power limit. In our testing, manually setting the power limit to 275–300 watts and the amperage limit to 350A, proved to be perfectly stable for our 13900K. That required going into the advanced CPU settings in the BIOS to change the PL1/PL2 limits — called short and long duration power limits in our particular case. The motherboard's default "Auto" power and current limits meanwhile created instability issues — which correspond to a power limit of 4,096 watts and 4,096 amps." [0]

The motherboard manufacturers are setting default/auto power and current limits that are way outside of Intel's specs (253 W, 307 A) [1].

[0] https://www.tomshardware.com/pc-components/cpus/is-your-inte...

[1] https://www.intel.com/content/www/us/en/content-details/7438... (see pg. 98 and 184, the 13900K/14900K is 8P + 16E 125 W)

mschuster91

7 replies

1d9h

2024-02-23 10:44:25 UTC

amperage limit to 350A

Jesus. By German electrical code, you need a 70 mm² cross-section of copper to transfer that kind of current without the cable heating up to a point that it endangers the insulation. How do mainboard manufacturers supply that kind of current without resistive loss from the traces frying everything?

bonzini

3 replies

1d9h

2024-02-23 10:51:09 UTC

Those electrical code cross sections are for 350A at 230V, corresponding to about 80 kW (400V is the same as it's actually three 230V wires).

Processors operate at about 1V. At 300W it's enough to use a much smaller cross section, which is split across many traces.

coryrc

2 replies

1d9h

2024-02-23 10:55:17 UTC

That's not how I^2R losses work. Voltage is not relevant.

planede

0 replies

1d5h

2024-02-23 15:25:20 UTC

I agree. However voltage is relevant for insulation, which also affects how heat can dissipate for the wire, and might also be relevant for a failing wire, when higher and higher voltage can build up at the point of failure (not sure if it's a common engineering consideration outside of fuses, which are designed to fail).

bonzini

0 replies

6h30m

2024-02-24 14:11:29 UTC

At 300W there's only so much power that can become heat. With 80 kW flowing through the wire, if the insulation melts due to excessive wire resistance you call the fire brigade.

michaelt

0 replies

1d9h

2024-02-23 11:31:48 UTC

The traces are extremely short. Look at a modern motherboard and you'll find a bank of capacitors and regulators about 2cm away from the CPU socket.

If you've got 4 layers of 2oz copper, and you make the positive and negative traces 10mm wide, you'll only be dissipating 28 watts when the CPU is dissipating 300 watts. And most motherboards have more than 4 layers and have space for more than 10mm of power trace width. And there's a bunch of forced air cooling, due to that 300 watts of heat the CPU is producing.

Electrical code doesn't let buildings use cables that dissipate 28 watts for 2cm of distance because it would be extremely problematic if your 3m long EV charge cable dissipated 4200 watts.

crote

0 replies

1d1h

2024-02-23 18:53:38 UTC

Bursty current spikes, short and fat traces, using the motherboard as a heat sink, active cooling, and allowing the temperature to rise quite a bit. If you look at thermal camera videos[0], it pretty clear where all the heat is going (although a significant part of that is coming from the voltage regulators).

On the other hand, your national electrical code is going to assume you're running that 350A cable at peak capacity 24/7, right next to other similarly-loaded cables, stuffed in an isolated wall, for very long runs - and it still has to remain at acceptable temperatures during a hot summer day.

[0]: https://www.youtube.com/watch?v=YyDMlXEZqb0

coryrc

0 replies

1d9h

2024-02-23 10:54:17 UTC

That code is for round wire (minimal surface area per volume) that can be placed inside insulation in walls.

This 350A is flat conductors (maximal surface area thus heat dissipation) and very short (not that much power to dissipate so the things it connects to have a significant effect on heat dissipation).

eqvinox

5 replies

1d10h

2024-02-23 10:20:05 UTC

Your [0] also says:

It's not exactly clear why the 13900K suffers from these instability problems, and how exactly downclocking, lowering the power/current limits, and undervolting prevent further crashes. Clearly, something is going wrong with some CPUs. Are they "defective" or merely not capable of running the out of spec settings used by many motherboards?

scrlk

4 replies

1d10h

2024-02-23 10:38:39 UTC

Are they "defective" or merely not capable of running the out of spec settings used by many motherboards?

I'd wager good money on the latter. Why would Intel validate their CPUs against power and current limits that are outside of spec? The users reporting issues probably have CPUs that just made it in to the performance envelope to be binned as a 13900K, so running out of spec settings on these weaker chips results in instability.

It's cases like this where I wish Intel didn't exit the motherboard space, they were known to be reliable but typically at the cost of having a more limited feature set.

eqvinox

1 replies

1d7h

2024-02-23 12:47:57 UTC

I'd wager good money on the latter.

I don't disagree, but I'm cautious about making a call with the current information available. For example: yes, a "4096W / 4096A" power limit sounds odd, but it's not an automatic conclusion that this limit is intended to work to protect the CPU. Instead, it is a function that allows building a system with a particular PSU dimension — it would be odd if that were overloaded to protect the chip itself. Maybe it is, maybe it isn't.

It's also very much possible that the M/B vendors altered other defaults, but… I don't see information/confirmation on that yet. It used to be that at least one of the settings is the original CPU vendor default, but last I looked at these things was >5 years ago :(.

It's cases like this where I wish Intel didn't exit the motherboard space,

Full ACK.

sirn

0 replies

1d6h

2024-02-23 14:28:31 UTC

Modern CPUs have many limits to protect the CPU and later the clock behavior. For example, clock limit, current limit (IccMax, 307A for 13900K), long power limit (PL1, 125W), short power limit (PL2, 253W), transient peak limit (PL3), overcurrent limit (PL4), thermal limit (TjMax, 100c), Fast Throttle threshold (aka Per-Core Thermal Limit, 107c), etc. It also has Voltage/Frequency curves (V/F curve) to map how much voltage needs to drive a certain frequency.

Intel 13900K has a fused V/F curve until its maximum Turbo Boost 2.0 (5.5 GHz) in all cores, and two cores at its Thermal Velocity Boost (aka favored cores, 5.8 GHz). How much to boost depends on By Core Turbo Ratio. For stock 13900K, this is 5.8 GHz for 2 cores, and 5.5 GHz for up to 8 cores with E-cores capped at 4.3 GHz.

As you may have noticed, the CPU has a very coarse Turbo Ratio beyond the first 2 cores. This is to allow the clock to be regulated by one of the limits rather than a fixed number. In reality, 253W PL2 can sustain around 5.1 GHz all P-cores, and after 56 seconds it will switch to 125W PL1 which should give it around 4.7 GHz-ish (IIRC).

This is why when a motherboard manufacturer decides to set PL1=PL2=4096 without touching other limits, it results in a higher number in benchmark. The CPU will consume as much power as it can to boost to 5.5 GHz, until it hits one of the other limits (usually 100c TjMax). This is how we ended up in this mess in the consumer market.

Xeon, on the other hand, has a very conservative and granular Turbo Ratio. My Xeon w9-3495x do have a fused All Core Boost that does not exceed PL1 (56 cores 2.9 GHz at 350W), which makes PL2 exist only for AVX512/AVX workload.

(Side note: I always think that PL1=PL2=4096W is dumb since performance gain is marginal at best, and always set PL1=PL2=253W in all machines that I assembled. I think even PL1=PL2=125W makes sense for the most usage. I do overclock my Xeon to sustain PL1=PL2=420W though (this is around 3.6 GHz, which is enough to make it faster than 64-cores Threadripper 5995WX))

RetroTechie

1 replies

1d9h

2024-02-23 11:32:48 UTC

I'd wager good money on the latter.

Don't guess, measure! The proper action here would be to change BIOS settings from their default / "auto" settings to per-Intel-spec safe ones. Same for RAM, and on systems with known good power supplies, CPU cooling, software installs etc. Then one of the following will happen:

a) BIOS ignores user settings & problem persists.

b) BIOS applies user settings & problem goes away.

c) BIOS applies user settings but problem persists.

Cases a & b count as "faulty BIOS" (motherboard manufacturer caused). Case c counts as "faulty CPU", and replacement cpu may or may not fix that.

No need to guess. Just do the legwork on systems where problem occurs & power supply, RAM, CPU cooling & OS install can be ruled out. Sadly, no doubt there's many systems out there where that last condition doesn't hold.

vdaea

0 replies

1d8h

2024-02-23 12:34:59 UTC

I have a 13900K. The default BIOS settings set a maximum wattage of 4096W (!!!) that makes Prime95 fail. If I change the settings back to 253W, what Intel says is the maximum wattage, Prime95 stops failing.

Still, I don't know if I should RMA. I got the K version because I intended to overclock in the future. And all of this sounds like I won't be able to. I think increasing the voltage a little bit makes the system more stable. I have to play with it. (Really, if someone can say whether I should RMA or not, I would appreciate some input)

Edit: decided to RMA. I have no patience for a CPU that cost me +600€

michaelt

2 replies

1d9h

2024-02-23 11:10:19 UTC

> The motherboard manufacturers are setting default/auto power and current limits that are way outside of Intel's specs

The CPU only draws as much power as it needs, though?

I mean, if you plug a 20 watt phone into a 60 watt USB-C power supply, or a 60 watt laptop into a 100 watt USB-C power supply the device doesn't get overloaded with power. It draws no more current than it needs.

The motherboard's power limits should state the amount of power the PCB traces and buck regulators are rated to provide to the socket - and if that's more than the processor needs that's good, as it avoids throttling.

scrlk

0 replies

1d8h

2024-02-23 12:15:33 UTC

* User is running a demanding application (e.g. game).

* CPU clock speed increases (turbo boost), as long as the CPU isn't hitting: 1) Tj_MAX (max temp before thermal throttling kicks in); 2) the power and current limits specified by the motherboard (in this case, effectively disabled by the out of spec settings).

* Weaker chips will require more power to hit or maintain a given turbo clock speed: with the power and current limits disabled, the CPU will attempt to draw out of spec power & current, causing issues for the on die fully-integrated voltage regulator (noting that there's also performance/quality variance for the FIVR), resulting in the user experiencing instability.

rfoo

0 replies

1d9h

2024-02-23 11:26:10 UTC

The problem is these processors are unstable if they are not properly throttled.

Of course users, especially enthusiast motherboard consumers, hate throttling, hence the default.

wongarsu

0 replies

1d7h

2024-02-23 13:22:36 UTC

Drawing peak power far in excess of the TDP is what all Intel processors have been designed to do for many years now.

Some consider it cheating the benchmarks, but the justification is that TDP is the Thermal Design Power. It's about the cooling system you need, not the power delivery. If you make reasonable assumptions about the thermal inertia of the cooling system you can Turbo Boost at higher power and hope the workload is over before you are forced to throttle down again.

Any mainboard that sets power limits to the TDP would be considered wrong by both the community and Intel. This looks like a solid indication that the issue is with Intel

lupusreal

3 replies

1d9h

2024-02-23 11:01:29 UTC

At least 3 motherboard vendors are affected

Boss says, "Do the thing." Engineer says, "The thing is out of spec!" Boss says, "Competitor is doing the thing already and it works." Engineer does the thing.

eqvinox

1 replies

1d7h

2024-02-23 12:57:20 UTC

Yes, but also no. The motherboard market is "timing-competitive", your product needs to be ready when the CPU launches, especially for the kind of flagship CPU that this specific issue is about. You can't wait and see what the competitors are doing.

lupusreal

0 replies

1d6h

2024-02-23 14:33:04 UTC

Fair point. Maybe "This sort of thing worked fine in the past."

crote

0 replies

1d1h

2024-02-23 18:55:59 UTC

Or perhaps they're all copying from the same reference design?

paulmd

1 replies

1d1h

2024-02-23 19:28:58 UTC

all three motherboard vendors enabling some out-of-spec defaults wouldn't actually be surprising, though?

people forget, blowing up AM5 cpus wasn't just as Asus thing... they were just the most ham-handed with the voltages. Everyone was operating out of spec, there were chips that blew up on MSI and Gigabyte boards, and it wasn't just X3D either.

Intel is no different - nobody enforces the power limit out of the box, and XMP will happily punch voltages up to levels that result in eventual degradation/electromigration of processors (on the order of years). Every enthusiast knows that CPU failures are "rare" and yet either has had some, or knows someone who's had some in their immediate circles. Because XMP actually has caused non-trivial degradation even on most DDR4 platforms.

In fact it's entirely possible that this is an electromigration issue right here too - notice how this affects some 13700Ks and 13900Ks too? Those chips have been run for a year or two now. And if the processors were marginal to begin with, and operated at out-of-spec voltages (intentionally or not)... they could be starting to wear out a little bit under the heaviest loads. Or the memory controllers could be starting to lose stability at the highest clocks (under the heaviest loads). That's a thing that's not uncommon on 7nm and 5nm tier nodes.

rygorous

0 replies

2024-02-23 20:41:37 UTC

This is blowing up now, but the first report of this kind of issue that reached me (I'm the current Oodle maintainer) was in spring of last year. We've been trying to track it down (and been in contact with Intel) since then. The page linked in the OP has been up since December.

Epic Games Tools is B2B and we don't generally get bug reports from end users (although later last year, we did have 2 end users write to us directly because of this problem - first time this has happened for Oodle that I can think of, and I've been working on this project since 2015). Point being, we're normally at least one level removed from end user bug reports, so add at least a few weeks while our customers get bug reports from end users but haven't seen enough of them yet to get in touch with us (this is a rare failure that only affects a small fraction of machines).

13900Ks have been out since late Oct 2022. It's possible that this doesn't show up on parts right out of the box and takes a few months. It's equally plausible that it's been happening for some people for as long as they've had those CPUs, and the first such customers just bought their new machines late 2022, maybe reported a bug around the holidays/EOY that nobody looked at until January, and then it took another 2-3 months for 3-4 other similar crashes to show up that ultimately resulted in this case getting escalated to us.

yetihehe

3 replies

1d11h

2024-02-23 09:35:22 UTC

It seems like it happens only on some select cpu specimens (apparently works after replacing cpu with another one of the same model), so probably a small binning failure?

eqvinox

1 replies

1d11h

2024-02-23 09:38:52 UTC

I'm not sure I would call a binning failure "small" — mostly because I can't remember this ever happening before. Binning is a core aspect of managing yields. And it seems that this is breaking for a sufficient number of people to have a game tooling vendor investigate. How many bug reports would it take to get them into action?

rygorous

0 replies

1d1h

2024-02-23 19:04:53 UTC

Person who actually did the investigation here. It took exactly one bug report.

RAD/Epic Games Tools is a small B2B company. Oodle has one person working full-time on it, namely me, and I do coding, build/release engineering, docs, tech support, the works. There's no multiple support tiers or anything like that, all issues go straight into my inbox. Oodle Data in particular is a lossless data compression API and many customers use two entry points total, "compress" and "decompress".

I get a single-digit number of support requests in any given month, most of which is actually covered in the docs and takes me all of 5 minutes to resolve to the customer's satisfaction. The 3-4 actual bug reports I get in any given year, I will investigate.

wmf

0 replies

1d3h

2024-02-23 17:13:37 UTC

This is the usual silicon lottery. Every chip will work at stock settings. Some will be stable when overclocked and some won't.

rygorous

0 replies

2024-02-23 19:53:24 UTC

(I'm Oodle maintainer and did most of this investigation.)

For the majority of systems "in the wild", I don't know. We had two people with affected machines contact us and consent to do some testing for us, and in both cases the issue still reproduced after resetting the BIOS settings to defaults.

mbrumlow

27 replies

1d7h

2024-02-23 12:57:25 UTC

I recently built a new system with a i9 149kf and a Ausus Formula motherboard. For a VFIO system so I could run windows and play some games.

It was a nightmare to get running stable. None is the default settings the motherboard used worked. Games crashed, kernel and emacs compiles failed.

End result I had to cap turbo to 5.4ghz on a 6ghz chip, and enable settings that capped max watts and temperature for throttling to 90c.

System seems stable now. Can get sustained 5.4ghz without throttling and enjoying games at 120fps with 4k resolution.

Even though it is working I do feel a way about not being able to run the system at any of the advertised numbers I paid for.

JohnBooty

13 replies

1d6h

2024-02-23 14:16:11 UTC

    enable settings that capped max watts and temperature for throttling to 90c.

You were going above 90C before???

My first thought is that seems insane, but is apparently normal for that chip, according to Intel: "your processor supports up to 100°C and any temperatures below it are normal and expected"

https://community.intel.com/t5/Processors/i9-14900K-temperat...

That is just wild though. On one hand you should obviously get the performance that was advertised and that you paid for. On the other hand IMO operating a CPU at 90-100C is just insane. It really feels like utter desperation on Intel's part.

I would be curious what kind of cooling setup you have.

crote

5 replies

1d2h

2024-02-23 18:23:05 UTC

Temperatures like that have been fairly normal for a few generations now - both for Intel and AMD. It might look insane compared to what you were used to seeing a decade ago, but it's actually not that crazy.

First, the temperature sensors got a lot better. Previously you only had one sensor per core/cpu, and it was placed wherever there was space - nowadays it'll have dozens of sensors placed in the most likely hotspots. A decade ago a 70C temp meant that some parts of the CPU were closer to 90C, whereas nowadays a 90C temp means the hottest part is actually 90C.

Second, the better sensors allow more accurate tuning. While 100C might be totally fine, 120C is probably already going to cause serious damage. The problem here is that you can't just rely on a somewhat-distant sensor to always be a constant 20C below the peak value: it's also going to be lagging a bit. It might take a tenth of a second for that temp spike in the hotspot to reach the sensor, but in the time between the spike starting and the temp at the sensor raising enough to trigger a downthrottle you could've already caused serious damage. A decade ago that meant leaving some margin for safety, but these days they can just keep going right up to the limit.

It's also why overclocking simply isn't really a "thing" anymore. Previous CPUs had plenty of safety margin left for risk-takers to exploit, modern CPUs use up all that margin by automatically overclocking until it hits either a temperature limit or a power draw limit.

JoshTriplett

3 replies

2024-02-23 19:50:39 UTC

Exactly. Temperature measurements are a lot like available memory measurements in that regard. People wonder why the OS uses up all available memory, and it's because the OS knows that empty memory is useless, while memory used to cache disk is potentially useful (and can always be discarded when that memory is needed for something else). So, the persistent state of memory is always "full".

Similarly, processors convert thermal headroom to performance, until they run out of thermal headroom. So if you improve the cooling on a processor that has work to do (rather than sleeping), it will use up that cooling and perform better, rather than performing the same and running cooler.

(Mobile processors operate differently, since they need to maintain a much tighter thermal envelope to not burn the user. And processors can also target power levels rather than thermals. But when a processor is in its default "run as fast as possible" mode, its normal operating temperature will be close to the 100C limit.)

jtriangle

1 replies

23h26m

2024-02-23 21:15:17 UTC

There's a side benefit to this as well, your cooling solution is more effective at 90C than it is at 40C, you know, highschool physics, deltaT and all that.

wtallis

0 replies

18h18m

2024-02-24 02:23:44 UTC

That also comes with the side effect that the transistors are leakier at high temperature, leading to higher power for the same performance as compared with operating at a lower temperature. This effect is significant: in mobile systems it can easily be the case that turning on a fan (or speeding it up) leads to a net decrease in power consumption because the power saved by making the chip less leaky is more than the power spent running the fan.

hulitu

0 replies

2h15m

2024-02-24 18:26:23 UTC

People wonder why the OS uses up all available memory

This is really no excuse to have Windows eating 12GB of RAM with an Edge (with 2 static tabs), excel some antivirus and Teams running. This (12GB) is just sick.

hulitu

0 replies

1h16m

2024-02-24 19:25:49 UTC

but it's actually not that crazy.

Unless you take reliability into account.

bee_rider

2 replies

1d5h

2024-02-23 15:29:58 UTC

Hey, is there a cooling solution that sprays water on some sort of heat spreader and lets it evaporate? Kidding. Kinda. But actually, is that possible?

smolder

0 replies

1d3h

2024-02-23 16:49:49 UTC

Heat pipes sort of do that naturally without any active "spraying". They contain a fluid that phase changes and carries heat away. Closed loop water coolers have the active flow of water you want for maximum effect. I don't think your idea would be an improvement on that.

Analemma_

0 replies

1d4h

2024-02-23 16:37:44 UTC

That's essentially what vapor chamber coolers are. It's a sealed unit where the working fluid evaporates on the CPU end, absorbing a ton of heat, and then condenses on the other side, before going back to do the cycle again. Because the heat of vaporization is so large, these can move a lot more heat than ordinary heat sinks.

legosexmagic

1 replies

1d6h

2024-02-23 14:30:02 UTC

the amount of cooling you get is proportional to the difference of component temperature to ambient temperature. thats why modern chips are engineered to run much hotter.

jnxx

0 replies

1d5h

2024-02-23 15:01:48 UTC

Until Dust Puppy kills 'em

jcalvinowens

0 replies

1d2h

2024-02-23 17:52:30 UTC

On the other hand IMO operating a CPU at 90-100C is just insane.

No it isn't, the manufacturer literally says it's normal! I think people who spend as much money on cooling setups as the chip are the insane ones.

My favorite story: I once put Linux on a big machine that had been running windows, and discovered dmesg was full of thermal throttling alerts. Turns out, the heatsink was not in contact with the CPU die because it had a nub that needed to occupy the same space as a little capacitor.

I'd been using that machine to play X-plane for over two years, and I never noticed. It was not meaningfully slower: the throttling events would only happen every ten or so seconds. I'm still using it today, although with the heatsink fixed :)

I have a garage machine with a ca. 2014 Haswell that's been running full tilt at 90C+ for a good bit of its life. It just won't die.

dist-epoch

0 replies

1d6h

2024-02-23 14:38:12 UTC

For both Intel/AMD 100C is now a target, not a limit.

CooCooCaCha

5 replies

1d6h

2024-02-23 13:47:22 UTC

Do you think that happened because you had insufficient cooling?

Its hard to cool these new chips. AMD included.

Hikikomori

2 replies

1d6h

2024-02-23 13:54:29 UTC

For different reasons though. AMD's chiplets produce heat in a small area which makes it hard to transfer heat quickly. Intel just use a shitload more power and thus more heat.

CooCooCaCha

1 replies

1d3h

2024-02-23 16:45:52 UTC

That’s not entirely it though. Modern AMD and Intel chips are built to run at their thermal/frequency limits and will jump to those limits at a moments notice in order to maximize performance.

So unless you have powerful cooling you will hit the thermal limit.

Hikikomori

0 replies

1d2h

2024-02-23 17:54:53 UTC

I didnt say that it was everything that matters, just commenting on the difference between them.

kijin

1 replies

1d6h

2024-02-23 13:52:27 UTC

Even if GP's cooling setup was less than ideal, the chip should have throttled itself to a stable frequency instead of crashing left and right.

cesarb

0 replies

1d6h

2024-02-23 14:05:05 UTC

It might not have throttled fast enough. Without sufficient thermal mass (or with insufficient heat transfer to that thermal mass, for instance if the thermal paste is misapplied), it might heat up too fast for the sensors to keep up.

doubled112

2 replies

1d7h

2024-02-23 13:33:34 UTC

What I'm not happy about is the marketing around turbo boost.

You know how ISPs used to sell "up to X Mbps"? Same idea. Your chip will turbo boost "up to 6.00 GHz".

It's basically automated overclocking, and as you learned, sometimes it can't even do it in a stable fashion. Some of those chips will never clock "up to 6.00 GHz" but they didn't lie. "up to"

wtallis

1 replies

1d1h

2024-02-23 19:17:24 UTC

It's particularly bad when they stop telling you what clock speeds are achievable with more than one core active. At best these days you get a "base clock" spec that's very slow that doesn't correspond to any operating mode that occurs in real life. You used to get a table of x GHz for y active cores, but then the core counts got too large and the limits got fuzzier.

And laptops have another layer of bullshit, because the theoretical boost clocks the chip is capable of will in practice be limited by the power delivery and cooling provided by that specific machine, and the OEMs never tell you what those limits are. So they'll happily take an extra $200 for another 100MHz that you'll never see for more than a few milliseconds while a different model with a slower-on-paper CPU with better cooling can easily be more than 20% faster.

kilolima

0 replies

2024-02-23 20:01:55 UTC

Yes, this is the situation for my dell laptop's i7-1165G7. Alleged turbo boost to 4.7ghz! In reality, it will hit that for a sec and then throttle to ~1ghz. I had to disable the turbo boost AND two cores in bios to let it even achieve ~2.0ghz speeds consistently. It's a total scam. Turns out my 8th-Gen i5 laptop is almost the same speed on benchmarks, just because it's a few mm thicker with better cooling.

BobbyTables2

1 replies

1d5h

2024-02-23 14:47:16 UTC

Seems scary that 10% difference in clock frequency is makes/breaks stability.

How much margin is really there?

rygorous

0 replies

1d1h

2024-02-23 19:37:30 UTC

Dynamic switching power (i.e. the fraction of the chip's power consumption from actually switching transistors, as opposed to just "being on") scales with V^2 * f, where V=voltage and f=frequency, and V in turn depends on f, where higher frequencies need higher voltage. Not really linearly (It's Complicated(tm)), but it's not a terrible first-order approximation, which makes the dynamic switching power have a roughly cubic dependency on frequency.

Therefore, 1.1x the frequency at the high end (where switching power dominates) is 1.33x the power draw.

Those final few hundred MHz really hurt. Conversely, that's also why you see "Eco" power profiles with a major reduction in power draw that cost you maybe 5-10% of your peak performance.

phantomwhiskers

0 replies

22h11m

2024-02-23 22:30:36 UTC

I also recently built a system with the 14900KF on an ASUS TUF motherboard and NZXT Kraken 360 cooler, and so far I haven't experienced any issues running everything at default BIOS settings (defaulted to 5.7GHz). I haven't seen temps above 70C yet, although granted I also haven't seen CPU load go above 40%, and haven't tried running any benchmarking software.

I'm curious about what you are using for cooling, as 90C at 5.4ghz seems way off compared to what I am seeing on my processor, but it could just be that I'm not pushing my processor quite as hard even with the higher clock rate.

callalex

0 replies

1d3h

2024-02-23 17:38:09 UTC

Did you try cleaning everything and re-mounting the cooler with new paste? I’ve seen similar behavior when people mess up and get bubbles in their paste. Do you see wildly different temperature readouts for different cores?

mattgreenrocks

19 replies

1d8h

2024-02-23 11:49:50 UTC

I can't say I'm surprised by this at all.

I bought my 4790k's ASUS TUF board awhile back because I wanted something basic enough and wasn't interested in overclocking or tweaking. The BIOS had other ideas. I had to manually configure a lot more things just to avoid overclocking, including setting RAM timing and going through each BIOS setting to ensure it wasn't overclocking in some way. The "optimal" setting would turn on aggressive changes like playing with bus speed multipliers, etc.

layer8

12 replies

1d7h

2024-02-23 12:55:09 UTC

Few people buy a K processor who aren’t interested in overclocking and tweaking. I wouldn’t be surprised if the BIOS of a gaming mainboard sets the “optimal” defaults on that basis, since the gaming market is all about benchmarks.

phil21

4 replies

1d7h

2024-02-23 12:59:56 UTC

I'm pretty much the same as OP. I almost always buy the K version of the processor, but never intend to overclock. I just figure I want the theoretical ability to, and the more volume they have on those SKUs the less likely they are to take it away entirely.

That or I'm just rewarding shitty corporate product segmentation behavior. I never can quite decide.

I do agree over the recent years getting a "boring" higher-end configuration is getting more and more difficult.

bonton89

1 replies

1d6h

2024-02-23 14:10:39 UTC

K chips often came with higher default clocks and definitely have better resale value so they're often worth buying even if you don't overclock.

smolder

0 replies

1d3h

2024-02-23 17:10:34 UTC

Yes, the overclockable chips are better-binned/faster chips even without enabling overclocking. (Unless you're talking about X3D chips, which have most overclocking features turned off due to thermal limitations of stacked cache.)

ryukoposting

0 replies

6h2m

2024-02-24 14:39:05 UTC

I bought a Ryzen 2600X with no intention of overclocking it, because it had a higher boost clock than a 2600 and it was on sale for almost the same price. I would guess similar lines of reasoning apply to K-SKU intel buyers.

ThatPlayer

0 replies

13h37m

2024-02-24 07:04:00 UTC

Similarly I get them because that's what Microcenter includes in their motherboard/cpu/ram combos. Still cheaper than getting the non-K version with everything else individually.

Sweepi

3 replies

1d5h

2024-02-23 14:51:16 UTC

"Few people buy a K processor who aren’t interested in overclocking and tweaking." The opposite is true: Most people who buy a 'K' CPU dont do any tweaking, I would bet a majority does not even activate things like XMP. The 'K' SKUs are 1) The highest SKU in the Linup in a given Class 2) They are faster then the non-'K' SKUs out of the box.

crote

2 replies

1d2h

2024-02-23 18:35:49 UTC

I would bet a majority does not even activate things like XMP

I highly doubt that. XMP is pretty much mandatory to get even remotely close to the intended performance. Without XMP your DDR4 memory isn't going beyond 2400MHz - but you almost have to try to find a motherboard, memory, or CPU which can't run at 3200MHz or even higher. It has all been designed for speeds like that, it's just not part of the official DDR4 spec.

It's less critical with DDR5, but you're still expected to enable it.

paulmd

1 replies

1d1h

2024-02-23 19:12:49 UTC

nevertheless, both AMD and Intel refuse to warranty processors operated outside of the spec, including when done via XMP/Expo. AMD has gone so far as to add an e-fuse in recent generations that permanently marks processors that have been operated outside the official spec.

https://www.extremetech.com/computing/amds-new-threadripper-...

As much as enthusiasts would like this to be "normalized" - from the perspective of the vendor it is not, they are very clear that this is something they do not cover. And it will become more and more of a problem as generations go forward - electromigration is happening faster and faster (sometimes explosively, in the case of AMD).

But it is quite difficult to get a gamer to understand something when their framerate depends on not understanding it.

https://semiengineering.com/uneven-circuit-aging-becoming-a-...

https://semiengineering.com/3d-ic-reliability-degrades-with-...

https://semiengineering.com/mitigating-electromigration-in-c...

GD-106: Overclocking AMD processors, including without limitation, altering clock frequencies / multipliers or memory timing / voltage, to operate beyond their stock specifications will void any applicable AMD product warranty, even when such overclocking is enabled via AMD hardware and/or software. This may also void warranties offered by the system manufacturer or retailer. Users assume all risks and liabilities that may arise out of overclocking AMD processors, including, without limitation, failure of or damage to hardware, reduced system performance and/or data loss, corruption or vulnerability.

GD-112: Overclocking memory will void any applicable AMD product warranty, even if such overclocking is enabled via AMD hardware and/or software. This may also void warranties offered by the system manufacturer or retailer or motherboard vendor. Users assume all risks and liabilities that may arise out of overclocking memory, including, without limitation, failure of or damage to RAM/hardware, reduced system performance and/or data loss, corruption or vulnerability.

bee_rider

0 replies

23h10m

2024-02-23 21:31:51 UTC

I wonder if XMP is typically enabled by reviewers or on marketing slides.

dmvdoug

1 replies

1d4h

2024-02-23 16:23:34 UTC

since the gaming market is all about benchmarks.

Why is that? I’m not a gamer so legit asking. It would seem to me that what would be most important is do the actual games that exist perform well, not some random, hypothetical maximum performance that benchmarks can game.

Modified3019

0 replies

1d3h

2024-02-23 16:48:08 UTC

My impression is that people looking at gaming benchmarks are looking at comparisons of FPS and frame times taken from just running recent high end games, which sometimes have settings to run through a repeatable demo for exactly this purpose.

kllrnohj

0 replies

1d3h

2024-02-23 17:39:00 UTC

The K chips aren't just unlocked, they're also significantly faster out of the box. I'd guess very few K owners have any intention of overclocking, especially as the gains are very small, and instead just want the higher out of box performance

ajross

5 replies

1d6h

2024-02-23 13:46:13 UTC

As others are pointing out, that "basic enough" CPU is in fact aimed directly at the overclocking market, as (likely) is the motherboard you put it on. This isn't basic at all, this is a high end tweaker rig. It's just a decade old tweaker rig.

mattgreenrocks

4 replies

1d6h

2024-02-23 13:54:39 UTC

Fair enough. I build it to last 7-10 years typically, so happy to spend a little more on a quality board.

What's the go-to basic mobo brand/board for non-tweakers these days?

ajross

2 replies

1d6h

2024-02-23 14:34:56 UTC

There's not a lot, honestly. Pretty much all discrete motherboards are gaming rigs of some form. The basic computer for general users is now a "laptop" (which tend to work quite well for general gaming, FWIW). But the low end choices from the regular suspects (Gigabyte, MSI, Asus) are generally fine in my experience. You do occasionally get a weird/flawed device, like you do with many product areas.

Arrath

1 replies

23h8m

2024-02-23 21:33:16 UTC

Yeah it really seems the market has bifurcated into "DIY build-a-computer" targeted towards gamers, bedazzled with RGB and all that jazz, and "Buy a used/refurb Dell mini-atx office desktop computer", assuming they don't just default to 'buy a laptop' as you point out.

mips_r4300i

0 replies

13h42m

2024-02-24 06:59:28 UTC

At this point, if you want the fastest DDR5 ram (for singlethreaded stuff it's a huge help) then all you can get are RGB xxtreme gamer sticks.

My office glows at night because the RGB dimms stay lit up in sleep mode. But they are fast.

rpcope1

0 replies

1d4h

2024-02-23 15:43:43 UTC

Supermicro is usually a good bet.

londons_explore

14 replies

1d11h

2024-02-23 09:37:00 UTC

If Oodle has control of this code, the logical thing for them to do, when they detect a decompression checksum failure, is to re-decompress the same data (perhaps single threaded rather then multithreaded).

Sure, the user has a broken CPU, but if you can work around it and still let the user play their games, you should.

lifthrasiir

2 replies

1d10h

2024-02-23 09:46:13 UTC

As noted in the linked page, this issue would affect any heavy use of CPU. Oodle happened to be optimized well to hit this issue earlier than most other applications, but nothing can't be really trusted at that point. There is a reason that they recommend to disable overclocking if possible, because such issue is in general linked to the instability due to excessive overclocking.

barrkel

1 replies

1d10h

2024-02-23 10:14:44 UTC

Retry is not an unusual response to unreliable hardware, and all hardware is ultimately unreliable.

Software running at scale in the cloud is written to be resilient to errors of this nature; jobs are cattle, if jobs get stuck or fail they are retried, and sometimes duplicate jobs are started concurrently to finish the entire batch earlier.

lifthrasiir

0 replies

1d10h

2024-02-23 10:25:31 UTC

Cloud machines have a way better guarantee about such errors though. You will eventually see some errors at scale, but that error rate can be reliably quantified and handled accordingly.

Consumer machines are comparably wild. Remember that this issue was mainly spotted from Unreal error messages. Some do too much overclocking without enough testing, which will eventually harm the hardware anyway. Some happen to live in places where single-error upsets are more frequent (for example, high altitude or more radioactive bedrock). Some have an insufficient power supply that causes erratic behaviors only on heavy load. All those conditions can be handled in principle, but are much harder to do so in practice. So giving up is much more reasonable in this context.

yetihehe

1 replies

1d11h

2024-02-23 09:41:16 UTC

Yes, but then the processor will fail at another task during the game and corrupt some other memory. The only solution for unstable processor is to make it stable or replace.

mike_hock

0 replies

1d10h

2024-02-23 09:46:09 UTC

Yes. Props to Oodle for not passing on the hot potato but trying to get the root cause fixed. This hack would have been the easy way out for them so their product doesn't get blamed.

mnw21cam

1 replies

1d6h

2024-02-23 13:50:00 UTC

I'll add to the chorus of other responses that the whole computer is generally unreliable.

However, the really useful thing that this software could do is make the error message much better - explaining the likely thing that is causing the decompression to fail, and advising the computer should be fixed.

rygorous

0 replies

21h24m

2024-02-23 23:17:34 UTC

That error comes out of Unreal Engine proper, not Oodle, and is meant to be shown in cases when compressed shader data is corrupted in the game data files on disk in a way that hasn't been caught by package validation. Which is to say, it's not exactly common.

Oodle itself has some diagnostics hooked up to logs but none of that is user-facing. All the user-facing stuff needs to get handed off multiple times to get from low-level IO plumbing to somewhere that even knows how to display a user-facing error message to begin with.

The most common cause for that error message was, and continues to be (except on the relatively small number of affected machines), that compressed shader data on disk is corrupted. If and when we have a handle as to what actually causes the problem and a minimally-invasive fix (as opposed to the list of several different anecdotal "what if you try X?" that we got from Intel HW lab folks), we'll try to detect affected machines and point them to a website with instructions. For now, it's just a random error message that happens to frequently show up on machines encountering this issue, and all that would change if we changed the error message was that we'd confuse end users more and have to list two different error messages on that page instead of one.

crest

1 replies

1d10h

2024-02-23 10:22:04 UTC

I disagree. This code is just lucky enough to be able to detect the data corruption, but without very deep understanding afterward the whole system state has to be assumed to be corrupted. You can retry until the code doesn't detect data corruption, but you have to assume other state is also corrupted. The right thing would be to scream loudly at the user that their system is unstable and to expect (future) data corruption unless the hardware + firmware (or its configuration) is fixed.

Sure it's unpleasant to be the messenger of bad news, but the alternatives are far worse unless the system is just a dedicated game console without any background processes (which isn't how those CPUs are used).

BlueTemplar

0 replies

1d9h

2024-02-23 11:19:08 UTC

Yeah, I had a RAM issue that might or might not have involved turning on XMP, with eventually as a result several RAM sticks with errors (BadRAM is amazing BTW, sadly, *nix-only), and worse, corrupted storage partitions !

Was a real pain to deal with the fallout...

zX41ZdbW

0 replies

1d5h

2024-02-23 15:22:24 UTC

We have the same check for silent data corruption in ClickHouse. After it detects a checksum mismatch, it also tries to check if it is caused by a single bit flip. If it is, we can provide a more precise diagnostic for a user about possible causes: https://github.com/ClickHouse/ClickHouse/issues?q=is%253Aiss...

Then the natural question arises: if we detect that it is a single bit flip, should we "un-flip" that bit, fix the data, and continue? The answer is: no. These types of errors should be explicit. They successfully help to detect broken RAM and broken network devices, that have to be replaced. However, the error is fixed automatically anyway by downloading the data from a replica.

rygorous

0 replies

1d1h

2024-02-23 19:26:19 UTC

(I'm the person who did most of the investigation.)

A relatively major realization during the investigation was that a different mystery bug that also seemed to be affecting many Unreal Engine games, namely a spurious "out of video memory" error reported by the graphics driver, seemed to be occurring not just on similar hardware, but in fact the exact same machines.

For a public example, if you google for "gamerevolution the finals crash on launch" and "gamerevolution the finals out of video memory", you'll find a pair of articles describing different errors, one resulting from an Oodle decompression error, and one from the graphics driver spuriously reporting out-of-memory errors, both posted on the same day with the same suggested fix (lower P-core max clock multiplier).

That's the problem right there in a nutshell. It's not just Oodle detecting spurious errors during its validation. Other code on the same machine is glitching too. And "just try repeating" is not a great fix because we can't trust the "should we repeat?" check any more on that machine than we can trust any of the other consistency checks that we already know are spuriously failing at a high rate.

Many known HW issues you can work around in software just fine, but frequent spurious CPU errors don't fall into that category.

flohofwoe

0 replies

1d10h

2024-02-23 10:33:50 UTC

That same hardware bug will also result in other corruption in other places where it's not detected, which may then spiral into much more catastophic behaviour. Oodle is just special in that it detects the corruption and throws an error (which is the right thing to do in this situation IMHO).

The ball is in Intel's court, such faulty CPUs should never have made it out into the wild.

cesaref

0 replies

1d10h

2024-02-23 09:48:54 UTC

quote: However, this problem does not only affect Oodle, and machines that suffer from this instability will also exhibit failures in standard benchmark and stress test programs.

It sounds like a hardware issue, i'm guessing over-agressive memory/cpu tuning, underpowered PSU triggering off behaviour etc. The fact that replacing the processor makes the problem go away does not in itself point to the processor as the issue - you may find that changing the memory also 'fixes' the problem.

aChattuio

0 replies

1d10h

2024-02-23 10:23:45 UTC

Needs to be fixed by microcode, BIOS update or recall from Intel and partners

lifthrasiir

6 replies

1d10h

2024-02-23 09:44:11 UTC

This page doesn't seem to be linked from any other public page, so I think it was a response to unwanted complaints from users who tried to track the "oodle" thing in the error log---like SQLite back in 2006 [1].

[1] https://news.ycombinator.com/item?id=36302805

atesti

3 replies

2024-02-23 19:49:25 UTC

There are some pages that are not linked, wondered what happened to these products

https://www.radgametools.com/granny.html https://www.radgametools.com/iggy.htm https://www.radgametools.com/milesperf.htm

rygorous

2 replies

2024-02-23 20:18:47 UTC

Granny, Iggy and Miles are all discontinued as stand-alone products. We're still providing support to existing customers but not selling any new licenses.

pixelpoet

1 replies

2024-02-23 20:27:21 UTC

While we've got you, any chance you'll attend another demoparty here in Germany? :)

Big thanks for your awesome blog, learnt much from it over the years.

rygorous

0 replies

22h1m

2024-02-23 22:40:47 UTC

Thanks!

Chance, sure, it's just a matter of logistics. Revision is a bit tricky since it's usually shortly after GDC, a very busy time in the game engine/middleware space I work in, so not usually when I feel up to a pair of international flights. :) Best odds are for something between Christmas and New Year's Eve since that's when I'm usually in Germany visiting family and friends anyway.

eqvinox

1 replies

1d10h

2024-02-23 09:46:11 UTC

It's linked from https://www.radgametools.com/tech.htm (click "support" at the top, look next to "Oodle" logo → "Note: If you are having trouble with an Intel 13900K or 14900K CPU, please [[read this page]].")

lifthrasiir

0 replies

1d10h

2024-02-23 09:47:46 UTC

Ooh, thank you! I looked so long at the Oodle section and skimmed other sections as well (even searched for the `oodleintel.htm` link in their source codes), but somehow missed that...

franzb

6 replies

1d9h

2024-02-23 10:50:39 UTC

Reminds me of this saga I went through as an early adopter of AMD Threadripper 3970X:

https://forum.level1techs.com/t/amd-threadripper-3970x-under...

HN discussion: https://news.ycombinator.com/item?id=22382946

Ended up investigating the issue with AMD for several months, was generously compensated by AMD for all the troubles (sending motherboards and CPUs back and forth, a real PITA), but the outcome is that I've been running since then with a custom BIOS image provided by AMD. I think at the end the fault was on Gigabyte's side.

rkagerer

4 replies

1d1h

2024-02-23 18:55:53 UTC

Holy cow I had no idea CPU vendors would do this for you.

zare_st

1 replies

2024-02-23 19:57:16 UTC

Supermicro gave us same type of assistance. Then new feature of bifurcation did not work correctly. Without it, enterprise telecommunications peripheral that costs 10x more than 4 socket Xeon motherboard can't run at nominal speed, and it was ran on real lines, not test data.

They sent us custom BIOSes until it got stabilized and said they'll put the patch in the following BIOS releases.

The thing is neither Intel nor AMD nor Supermicro can test edge cases at max usage in niche environments without paying money, but they would really love to claim with backup they can be integrated for such solutions. If Intel wants to test stuff in space for free they have to cooperate with NASA; the alternative is in-house launch.

deepsun

0 replies

22h38m

2024-02-23 22:03:24 UTC

NASA has super-elaborate testbeds and simulators. Maybe producers can provide some format/interfaces/simulators for users, users would write test-cases for it, and give back to providers to run in-house.

If users pay seven figures+ it might make sense.

devmor

1 replies

1d1h

2024-02-23 19:35:34 UTC

When you’re not only helping them debug their own hardware but are also spending money on their ridiculously overpriced HEDT platform, it probably makes them want to keep you happy.

zitterbewegung

0 replies

2024-02-23 19:50:04 UTC

That is true and also lots of people use OCaml

rwmj

0 replies

1d4h

2024-02-23 15:45:12 UTC

Reminded me of the Intel Skylake bug found by the OCaml compiler developers: https://tech.ahrefs.com/skylake-bug-a-detective-story-ab1ad2...

zvmaz

5 replies

1d9h

2024-02-23 11:27:01 UTC

Does it mean that a formally verified piece of software like seL4 can still fail because of a potential "bug" in the hardware?

hmottestad

2 replies

1d9h

2024-02-23 11:41:27 UTC

I would assume that software can always fail in the event of a bug in the hardware. That's why systems that are really redundant, for instance flight control computers, have several computers that have to form a consensus of sorts.

eqvinox

1 replies

1d8h

2024-02-23 12:37:52 UTC

It doesn't even need a bug in the hardware; cosmic rays or alpha particles can also cause the same type of issue. For those, making systems redundant is indeed a good solution.

For the situation of an actual (consistent) hardware bug, redundancy wouldn't help… the redundant system would have the same bug. Redundancy only helps for random-style issues. (Which, to be fair, the one we're talking about here seems to be.)

davrosthedalek

0 replies

1d7h

2024-02-23 13:03:26 UTC

That's why some redundant systems use alternative implementations for the parallel paths. Less likely that a hardware bug will manifest the same way in all implementations.

rygorous

0 replies

1d1h

2024-02-23 19:13:16 UTC

Absolutely, yes.

It can also misbehave without any hardware bugs due to glitching. Rates of incidence of this must be quite low or that would be considered a HW bug, but it's never zero. Run code for enough hours on enough machines collecting stack traces or core dumps on crashes and you will notice that there's a low base rate of failures that make absolutely no sense. (E.g. a null pointer dereference literally right after a successful non-null pointer check 2 instructions above it in the disassembly.)

You will also notice that many machines in a big fleet that log such errors do so exactly once and never again, but some reoccur several times and have a noticeably elevated failure rate even though they're running the exact same code as everyone else. This too is normal. These machines are, due to manufacturing variation on the CPU, RAM, or whatever, much glitchier than the baseline. Once you've identified such a machine, you will want to replace it before it causes any persistent data corruption, not just transient crashes or glitches.

flumpcakes

0 replies

1d9h

2024-02-23 11:40:28 UTC

I would assume that _any_ software, formally verified or not, could fail due to a hardware problem. A cosmic ray could flip a bit in a CPU register. The chances of that happening, and that effecting anything in any meaningful way is probably astronomically low. We probably have thousands of hardware failures every day and don't notice them. This is why I think rust in a kernel is probably a bad idea if it doesn't change from the default 'panic on error'.

yetihehe

5 replies

1d11h

2024-02-23 09:32:54 UTC

TL;DR: some motherboards by default overclock too much on some intel processors, causing instability.

haunter

1 replies

1d10h

2024-02-23 09:45:13 UTC

default overclock too much

Per the article MSI literally suggests to OC the CPU to fix the problem

xcv123

0 replies

1d9h

2024-02-23 10:59:51 UTC

Article recommends disabling overclocking. The MSI recommendation is only to increase voltage.

eqvinox

1 replies

1d11h

2024-02-23 09:35:53 UTC

That's not my interpretation. Cf the following:

For MSI:

Solution A): In BIOS, select "OC", select "CPU Core Voltage Mode", select "Offset Mode", select "+(By PWM)", adjust the voltage until the system is stable, recommend not to exceed 0.025V for a single increase.

This really sounds like the Intel defaults are broken too.

yetihehe

0 replies

1d11h

2024-02-23 09:39:16 UTC

Yes, that or insufficient quality checks, meaning some units will fail, some will work. Apparently it was only a subset of each model failing.

worewood

0 replies

1d11h

2024-02-23 09:34:10 UTC

It's MCE all over again

tibbydudeza

5 replies

1d8h

2024-02-23 12:05:53 UTC

Why I chose a i9 13900 (non K) variant rather- my PC earns me money as a freelance software dev so I can't stand weird issues like this

svantana

1 replies

1d7h

2024-02-23 12:48:07 UTC

As another software dev, I would pay big money for the "worst possible computer" that exhibits all of the glitches and issues that end users see. It's so annoying to get bug reports that I can't reproduce.

tibbydudeza

0 replies

1d7h

2024-02-23 12:57:18 UTC

I had my time during my embedded days - did a site visit 1000 km away and discovered no wonder the serial port and scanner/printer is going wonky.

No shielding, earth - using the crappiest/cheapest PC they could get instead of using the recommended kit as the sales droid wanted a bigger commission.

Said call me when you replaced the h/w - I walked out and went to the airport. They never called me.

fabianhjr

1 replies

1d4h

2024-02-23 16:34:07 UTC

If that was the case why not go for Ryzen + ECC memory?

tibbydudeza

0 replies

1d3h

2024-02-23 17:12:13 UTC

Got ECC memory in my server - I am a value for money - my previous kit was i7-6700 system 48GB so I really sweated it until Jetbrains let me know "She canno go more Captain".

DDR4/Intel motherboards are cheaper than AM5/DDR5 - also a Ryzen laptop foobarred on my daughter so to me Intel kit was just more stable - no weird XMP issues or overclocking to the nines.

codexon

0 replies

1d1h

2024-02-23 19:40:00 UTC

I got a 13900 non-K on a Linux server and it randomly locked up the system after a month.

op00to

3 replies

1d8h

2024-02-23 12:15:34 UTC

This sounds a lot like the behavior I see when I have over locked my processor too far, and try to run AVX heavy workloads! Cranking down the frequency during AVX seems to stabilize things.

Arech

2 replies

1d7h

2024-02-23 12:54:05 UTC

Had the same experience overclocking old AMD Phenom II a while ago. Worked flawlessly in all publicly available test software I tried, until I run some custom heavily vectorized code, which eventually required to shave off almost all overclocking :D

op00to

1 replies

1d5h

2024-02-23 14:59:59 UTC

There's a way (at least on my Intel) to tell the processor to clock down a certain number of steps depending on whether AVX is being executed. So, for the majority of stuff that didn't use AVX I let 'er rip, but when AVX is running it clocks down a couple steps. I could use less voltage, and this CPU is fast enough. I think it's a 13900k.

Arech

0 replies

1d5h

2024-02-23 15:03:26 UTC

Ah, that's an interesting feature of new CPUs, I didn't know about it! Thanks for telling!

johnklos

3 replies

1d2h

2024-02-23 17:59:06 UTC

For years I've had this impression that Intel CPUs were, to put it simply, trying too hard. I administer servers for various companies, and some use Intel even though I generally recommend AMD or non-x86.

A pattern I've noticed is that some of the AMD systems I administer have never crashed or panicked. Several are almost ten years old and have had years of continuous uptime. Some have had panics that've been related to failing hardware (bad memory, storage, power supply), but none has become unstable without the underlying cause eventually being discovered.

Intel systems, on the other hand, have had panics that just have had no explanation, have had no underlying hardware failures, and have had no discernible patterns. Multiple systems, running an OS and software that was bit-for-bit identical to what has been running on AMD systems, have panicked. Whereas some of the AMD systems that had bad memory had consumer motherboards with non-ECC memory, the Intel systems have typically been Supermicro or Dell "server" systems with ECC.

In one case two identical Supermicro Xeon D systems with ECC were paired with two identical Steamroller (pre-Ryzen) AMD systems. All systems provided primary and backup NAT, routing, firewalling, DNS, et cetera. The Xeon systems were put in place after the AMD systems because certain people wanted "server grade" hardware, which is understandable, and low power AMD server systems weren't a thing in that time period. Over the course of several years, the Xeon systems had random panics, whereas one of the AMD systems had a failed SSD, but no unplanned or unexplained panic or outage, and the other had never had a panic or unplanned reboot in all the years it was in continuous service.

Had I collected information more deliberately from the very beginning of these side-by-side AMD and Intel installations, I'd have something more than anecdotal, but I'm comfortable calling the conclusion real: multiple generations of Intel systems, even with server hardware and ECC, have issues with random crashes and panics, on the order of perhaps one every year or two. I do not see a similar instability on AMD, though.

With brand new Intel CPUs taking substantially more power than similarly performing AMD CPUs, we have a more literal example of what I think is the underlying cause: Intel is trying way too hard to get every tiny bit of performance out of their CPUs, often to the detriment of the overall balance of the system. Between the not insignificantly higher number of CPU vulnerabilities on Intel due to shortcuts illustrated by the performance losses from enabling mitigations, and the rather shocking power draw of stock Intel CPUs that have turbo boosting enabled, I can't recommend any Intel system for any use where stability matters.

crote

1 replies

1d1h

2024-02-23 19:03:52 UTC

On the other hand, I'm currently dealing with an AMD system which seems to randomly hard reboot every couple of days, as if someone pressed the power button for 5 seconds.

It could keep running for 70 hours, or it could crash twice in 4 hours. Stress-testing CPU, GPU, memory, and storage doesn't invoke a crash, but it'll crash when all I'm running is a single Firefox tab with HN open.

Maybe I got unlucky, or maybe you got lucky. Who knows, really.

xcv123

0 replies

1d1h

2024-02-23 19:30:21 UTC

If it's not the system hardware could it be power instability? Brownouts will trigger a reset. A UPS with power conditioner will fix that.

sysoleg

0 replies

1h33m

2024-02-24 19:08:05 UTC

That's interesting. Do you have any examples of Linux kernel panic messages?

enraf

3 replies

1d9h

2024-02-23 11:40:23 UTC

I got one of the faulty 13900k, at least in my case I can confirm that the fault appeared using the default settings for pl1/pl2.

I was doing reinforcement learning on that system and it was always crashing, I spent quite a bit of time trying to find the problem, swapped the CPU for a 13700kf I was using in another PC, the problem was solved.

So I contact Intel to start the RMA process, Intel said that the MSI motherboard I was using doesn't support Linux, I emailed them the official Intel GitHub repo with the microcode that enables the support, they switched agents at that point but I was clear to me at that moment that Intel was trying their best to avoid the RMA, luckily I live in Europe, so I contacted my local consumer protection agency and did the RMA through them, in the meanwhile I saw a good offer for a 7950x + motherboard in an online retailer, bought it and sold in the second market my old motherboard and the RMA 13900k when I got it.

Not buying Intel ever again, I was using Intel because they sponsor some projects in DS but damn.

hopfenspergerj

2 replies

1d7h

2024-02-23 13:34:04 UTC

I’ve had instability with my 7700k since I bought it, and 16 months of bios updates haven’t helped. Maybe this latest generation of processors just has more trouble than older, simpler designs.

smolder

0 replies

1d3h

2024-02-23 17:03:12 UTC

Possibly. I would start swapping parts around at that point. Different memory, different CPU, or different motherboard. Just 1 more anecdote, but my r7-7700x has been a dream (won the silicon lottery). It runs at the maximum undervolt & RAM at 6000 with no stability problems.

acdha

0 replies

1d5h

2024-02-23 15:27:41 UTC

Intel has been struggling with CPU performance for a decade, and has been trying to regain their position in absolute performance and performance/{price,watt} comparisons. I think that means they’re being less conservative than they used to be on the hardware margins and also that their teams are likely demoralized, too.

adamc

3 replies

1d3h

2024-02-23 16:54:22 UTC

While I appreciate their point of view, from a consumer pov this would definitely be a failure of their software, since an implicit requirement is that it has to run on the customer machines. People aren't going to throw away their CPU for this, they are going to return the game (if possible), and certainly express the bad user experience they had with it.

xcv123

2 replies

1d1h

2024-02-23 18:59:55 UTC

This is a hardware fault causing other software to fail. Intel and mainboard manufacturers have recommended workarounds. Customers are not stupid and they know who is at fault.

adamc

1 replies

2024-02-23 19:48:46 UTC

I predict you are wrong and there will be returns. It's not really a question of stupidity, but of what options are available to them.

xcv123

0 replies

2024-02-23 19:58:56 UTC

No they just follow the provided instructions, go into the BIOS setup, and fix the settings. These are $1k CPUs. Purchased by enthusiasts, not retards. If their CPU is unstable they will want to fix it.

vdaea

2 replies

1d9h

2024-02-23 11:14:01 UTC

I have a 13900K and I am affected. Out of the box BIOS settings cause my CPU to fail Prime95, and it's always the same CPU cores failing. Lowering the power limit slightly will make it stable. I intended to better refrigerate the CPU and change the power limit back to the default and if the problems continued I would RMA the CPU, but now I'm not so sure that the BIOS is not pushing it beyond the operating limits.

mips_r4300i

1 replies

13h27m

2024-02-24 07:14:40 UTC

Can I ask what mobo vendor? Do you know what power limits the BIOS was targeting that caused the error?

On my Asus/14900k, it was uncapping PL1/2 and I saw absurd temps and power every time anything even touched the CPU. I programmed PL1/2 to 125/253w per Intel ARK and everything normalized.

I did not do Prime95 at the insane default power limits but I suspect similar.

vdaea

0 replies

10h3m

2024-02-24 10:38:17 UTC

MOBO is MSI, setting was "cpu cooler tuning" which was set at 4096W and I had to change to 253W (the limit according to Intel)

jeffbee

2 replies

1d7h

2024-02-23 13:35:41 UTC

Similar experience with an Asus motherboard. With their automatic tuning, instability leading to compiler crashes. Had to manually set the BIOS for sanity.

I believe the problems are compounded by the way their SuperIO controls the cooler, because the crashes were associated with temperature excursions to 100C. It's too slow to ramp up and too quick to ramp down. It is possible to tune this from userspace under Linux. But really the up ramp should be controlled by a leading indicator like the voltage regulator instead of a lagging indicator. Alternately the Linux p-state controller could anticipate the power levels and program a higher fan speed.

mips_r4300i

1 replies

13h22m

2024-02-24 07:19:20 UTC

I'm dealing with this right now (Asus ROG Z790, 14900k, Noctua NH-D15). The stock fan curves seem to be ineffective and also annoying as they are hunting constantly. Single P core temps bounce around constantly causing the fans to be spastic.

I have read increasing the ramp up time would smooth out the fan behavior but your experience says this can cause processor fails.

jeffbee

0 replies

3h30m

2024-02-24 17:11:48 UTC

Exact same setup, I'm afraid. Best solution I have been able to come up with is to read the datasheet of the SuperIO on that board and tune the hysteresis parameters from Linux after boot.

imdsm

2 replies

1d10h

2024-02-23 10:03:03 UTC

1994 all over again!

cwillu

1 replies

1d10h

2024-02-23 10:20:20 UTC

FDIV was a bug in the logical design, this is a over-aggressive clock tuning, I fail to see any resemblance whatsoever beyond intel being inside.

crest

0 replies

1d10h

2024-02-23 10:25:13 UTC

It's no longer black and white like the FDIV bug, but if the default configuration leads to data corruption in heavy SIMD workloads... sure you can reduce clock speed or increase voltage until it works, but unless the mainboards violate the specs this is at least partly an Intel CPU flaw leading to data corruption.

terrelln

1 replies

1d2h

2024-02-23 18:00:34 UTC

We also regularly run into hardware issues with Zstd. Often the decompressor is the first thing to interact with data coming over the network. Or like in this case the decompressor is generally very sensitive to bit-flips, with or without checksumming enabled, so notices other hardware problems more than other processes running on the same host.

One decision that Zstd made was to include only a checksum of the original data. This is sufficient to ensure data integrity. But, it makes it harder to rule out the decompressor as the source of the corruption, because you can't determine if the compressed data is corrupt.

mjevans

0 replies

21h32m

2024-02-23 23:08:58 UTC

Compressed data is like a backup. It's not valid until it's tested.

perryizgr8

1 replies

1d6h

2024-02-23 14:13:54 UTC

If I ever encounter a CPU bug causing problems in my production code, I will consider my life complete. I will be satisfied that I've practiced my profession to a high degree of completeness.

dist-epoch

0 replies

1d5h

2024-02-23 14:49:42 UTC

You should go work for Facebook. At their scale they are encountering CPU bugs daily:

This has resulted in hundreds of CPUs detected for these errors

https://arxiv.org/abs/2102.11245

mhio

1 replies

1d10h

2024-02-23 10:21:48 UTC

This sounds familiar... ye olde pentium III 1.13 GHz

https://www.tomshardware.com/reviews/intel-admits-problems-p...

ManuelKiessling

0 replies

1d8h

2024-02-23 12:06:21 UTC

Unrelated to the actual topic, but kudos to the Tom’s Hardware site that they serve a 24 years old web posting flawlessly.

ezekiel68

1 replies

22h36m

2024-02-23 22:05:46 UTC

Based on the article contents, this doesn't seem to be CPU errata. We already know that overclocking above a certain point will cause OS crashes. This seems to be system instability just below the threshold of crashing. Aggressive power and clock settings manifest as this instability without causing an actual crash.

I don't find this situation much different than needing to dial back BIOS settings when actual crashes are observed.

sedatk

0 replies

22h31m

2024-02-23 22:10:54 UTC

Exactly. I remember when I overcloked my 486DX4-100 to 120Mhz, everything would work fine but the floppy drive. It just wouldn’t work for whatever reason. Never thought it was a CPU issue, I’d just asked for it.

PawBer

1 replies

1d6h

2024-02-23 13:51:31 UTC

Reminds me of this Raymond Chen classic: https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35...

lostmsu

0 replies

14h45m

2024-02-24 05:56:12 UTC

I wonder why didn't they add a system crash analyzer component that would tell user their CPU is misbehaving (xor eax, eax) to save themselves some hard to debug support volume.

IYasha

1 replies

1d8h

2024-02-23 11:57:55 UTC

Wow. RAD was bought by Epic? I kinda missed that. Feels old. :(

mobilio

0 replies

1d7h

2024-02-23 13:03:05 UTC

Yup

https://www.epicgames.com/site/en-US/news/epic-acquires-rad-...

ryukoposting

0 replies

1d3h

2024-02-23 16:52:19 UTC

I vaguely recall motherboard vendors ignoring Intel's power recommendations a couple years ago, which was causing weird thermal/performance issues (was it Asus?). I get the impression that's what's happening here, again.

newsclues

0 replies

1d8h

2024-02-23 12:28:54 UTC

remember when. intel had intel branded reference boards? I would like a comeback please

dontupvoteme

0 replies

1d1h

2024-02-23 19:17:51 UTC

Decompression failure immediately went to a far grimmer failure mode in my mind..

colombiunpride2

0 replies

2024-02-23 19:46:30 UTC

I wonder if this is partly related to the LGA1700 frame problems that tend to bend the heat spreader.

There are two after market contact frames that drop the temperature around 10 Celsius and ensure flat contact with the head spreader. The stock frame causes the center of the head spreader to dip.

I wonder if the turbo boost is controlled by a Proportional–integral–derivative controller.(PID)

The idea that the parameters are fine tuned to slow down processor speed as it heats up but before it overshoots its maximum threshold.

If those PID values are tuned to assume flat heat spreader/heat sink contact, I can see where a bent heat spreader could cause the cpu to overshoot its safe limit and cause errors.

SergeAx

0 replies

12h0m

2024-02-24 08:41:14 UTC

overly optimistic BIOS settings

First time I see this nice euphemism for overclocking.

Havoc

0 replies

1d9h

2024-02-23 11:17:06 UTC

Same cpus as the unity engine (or was it unreal?) with issues

Not a good look but at least it’s fixable with bios tweaks rather than a silicon flaw that’s permanent

FileSorter

0 replies

1d1h

2024-02-23 19:09:35 UTC

I recently had to RMA my i9-13900KS because it was faulty. I was experiencing some of the weirdest behavior I have ever seen on a PC. For example, 1. Whenever I tried to install Nvidia drivers I would get "7-zip: Data error" 2. A fresh install of Windows would give me SxS error when trying to launch edge 3. I could not open the control panel 4. BSOD loop on boot