Intel identifies cause behind Raptor Lake crashes, says mobile CPUs aren't affected

Endymio · Jul 23, 2024

MSIGamer said:
We've had multiple systems at work that would instantly bsod upon launching an application that used more than a few cores. No bios update is going to fix that, they have been RMAed.

A microcode update very well could, however.

MSIGamer said:
If you think this is all confirmation bias, you're in a state of denial yourself. Why do you think this news is being reported on everywhere?

Because it's a valid issue -- for non-mobile CPUs. A company I'm associated with has several thousand Intel laptops with 13th/14th gen CPUs (I myself have two of them). The failure rates are not out of the ordinary. And the "videos and articles" mentioning laptop failures all tend to be apocryphal, with little to no hard data to back them up. Computers fail. Software fails. The fact that some Intel laptops are crashing in no way validates the thesis here.

HardReset · Jul 23, 2024

Endymio said:
You are again incorrect. The microcode to calculate the delivered voltage depends on many factors besides clock speed. And the fact that the crashes are rare tells us the conditions to duplicate the problem are equally uncommon.

Using very common compression algorithm does not sound like some "extremely unlike and never happening" -scenario. That's just one of many.

I agree that it there are many factors on this but considering how massive tests CPUs should pass, again, Intel must have morons working.

Also there has been numerous reports with somewhat accurate usage scenarios. Also Intel's first fix was to advice using default settings that lower voltage. So problem has been on voltage for a long time and it took this long from Intel to figure it out.

Endymio said:
You're still not seeing your logical error. For the voltage to be "required", the algorithm would have had to calculate it correctly. This is not happening, however. The algorithm is incorrectly calculating too high a voltage.

Again, if CPU needs X voltage to operate on frequency Y, then voltage Y is maximum that is ever needed. Simple as that. Now you say that algorithm calculates more voltage than maximum. That makes again zero sense.

MSIGamer · Jul 23, 2024

Endymio said:
A microcode update very well could, however.

Because it's a valid issue -- for non-mobile CPUs. A company I'm associated with has several thousand Intel laptops with 13th/14th gen CPUs (I myself have two of them). The failure rates are not out of the ordinary. And the "videos and articles" mentioning laptop failures all tend to be apocryphal, with little to no hard data to back them up. Computers fail. Software fails. The fact that some Intel laptops are crashing in no way validates the thesis here.

I think you don't understand those systems were beyond repair. Swapping out the cpu fixed them again for a while.

Yikes, well if these are normal failure rates I'm never buying an Intel cpu again lol. My desktop and laptop zen parts, which I use for gaming and work almost never crash, and when they do I can trace the cause to some software error.
More importantly, hardware stress testing never reports faults, while on raptor lake CPUs it's quite easy to produce one.
I wish I had a laptop with an Intel cpu to try this out on.

Endymio · Jul 23, 2024

HardReset said:
Using very common compression algorithm does not sound like some "extremely unlike and never happening" -scenario. That's just one of many.

Now you're simply fabricating tales. My Intel work machines -- and those of my colleagues -- run pretty much every compression algorithm known to man, and none are experiencing crashes.

HardReset said:
Again, if CPU needs X voltage to operate on frequency Y, then voltage Y is maximum that is ever needed. Simple as that. Now you say that algorithm calculates more voltage than maximum. That makes again zero sense.

I'm not sure which word you're failing to understand here. Let me explain with a hypothetical. The microcode, after examining temperature, core count, power profile settings, current software load, and quite a few other factors, performs a lengthy calculation. At one step, where it determines the maximum safe voltage should be lowered by 0.2v, it incorrectly adds that value instead. The delivered voltage now becomes 2.05v rather than the required 1.65v. Voila! A crash results. The CPU didn't need that 1.97v to "operate at frequency Y" ... but it received it anyway.

kiwigraeme · Jul 23, 2024

I don't trust Intel one Iota, initial evidence suggests an underlying fault, as eluded to above on gaming Nexus - maybe an oxidation problem.
So Intel wants to eke out the mortality curve
On the server side where these CPUs are trashed the death rate is in double figures from 5% to 50% . Companies are not saying IF a CPU will fail but when will it fail

Intel knows that many home users are not crunching all day, plus some CPUs may be less affected, so such fixes will extend these peoples CPUs out to hopefully when they upgrade = thereby reducing liability

Anyway the truth will be out in a couple of months with lab testing. Plus Intel with have to grovel to it's partners with their corporate customers . They won't be snow jobbed like some on here

Past actions are a good predictor of current and future actions. Not the first time Intel is lying , cheating and scamming people of their rights and a fair market place

Watzupken · Jul 24, 2024

emmzo said:
Since when undervolting means lower clocks?

I may be wrong here, but not all chips perform well at a given voltage/ power level. While the lowering of voltage may not affect all users, it may result in some lower quality chip to potentially run slower. The “default” voltage and power would have some sort of buffer to address this inequality so that they chips can perform at a certain expected level.

Watzupken · Jul 24, 2024

kiwigraeme said:
I don't trust Intel one Iota, initial evidence suggests an underlying fault, as eluded to above on gaming Nexus - maybe an oxidation problem.
So Intel wants to eke out the mortality curve
On the server side where these CPUs are trashed the death rate is in double figures from 5% to 50% . Companies are not saying IF a CPU will fail but when will it fail

Intel knows that many home users are not crunching all day, plus some CPUs may be less affected, so such fixes will extend these peoples CPUs out to hopefully when they upgrade = thereby reducing liability

Anyway the truth will be out in a couple of months with lab testing. Plus Intel with have to grovel to it's partners with their corporate customers . They won't be snow jobbed like some on here

Past actions are a good predictor of current and future actions. Not the first time Intel is lying , cheating and scamming people of their rights and a fair market place

If I ignore the speculations online, the fact that Intel failed to address the issue earlier would have cast some doubts over their claim here. I still recalled Intel openly highlighted that the issue is solely due to the motherboard makers not sticking to the power limits, and disclaimed it’s not their problem. But this sounds like their problem right? And it looks bad on them in terms of credibility as a partner and customer. If the issue persist after this microcode rollout next month, I feel their reputation will go down the drain. So time will tell if they are being honest and done their due diligence.
The reality is that there will always be bad products from time to time. But what sets you apart from a reliability/ credibility standpoint is how you resolve the issue.

HardReset · Jul 24, 2024

Endymio said:
Now you're simply fabricating tales. My Intel work machines -- and those of my colleagues -- run pretty much every compression algorithm known to man, and none are experiencing crashes.

Perhaps you are either lucky or use CPUs lightly. That Oodle compression thing is one popular scenario where those CPUs crash.

Endymio said:
I'm not sure which word you're failing to understand here. Let me explain with a hypothetical. The microcode, after examining temperature, core count, power profile settings, current software load, and quite a few other factors, performs a lengthy calculation. At one step, where it determines the maximum safe voltage should be lowered by 0.2v, it incorrectly adds that value instead. The delivered voltage now becomes 2.05v rather than the required 1.65v. Voila! A crash results. The CPU didn't need that 1.97v to "operate at frequency Y" ... but it received it anyway.

First, they must have microcode testing program that runs through every possible starting value and gives maximum result for every starting value combination. That way it's impossible not to see there may be excessive voltage.

Secondly, users have been giving pretty specific use cases for crashes. Like run Cinebench. Now, Intel could just run Cinebench and see what was maximum voltage supplied during run. If Intel didn't notice that "bug" despite testing, they are either morons or lying.

In other words, there is testing in place that ensures no possible combinations give more than allowed voltage. Additionally there is also hard limit for maximum voltage. Both are very basic stuff. That's why your logic makes zero sense.

Vanderlinde · Jul 24, 2024

kiwigraeme said:
I don't trust Intel one Iota, initial evidence suggests an underlying fault, as eluded to above on gaming Nexus - maybe an oxidation problem.
So Intel wants to eke out the mortality curve
On the server side where these CPUs are trashed the death rate is in double figures from 5% to 50% . Companies are not saying IF a CPU will fail but when will it fail

Intel knows that many home users are not crunching all day, plus some CPUs may be less affected, so such fixes will extend these peoples CPUs out to hopefully when they upgrade = thereby reducing liability

Anyway the truth will be out in a couple of months with lab testing. Plus Intel with have to grovel to it's partners with their corporate customers . They won't be snow jobbed like some on here

Past actions are a good predictor of current and future actions. Not the first time Intel is lying , cheating and scamming people of their rights and a fair market place

Thing is server CPU's run usually on 50% lower clocks (averaging from 2.2Ghz to 3.5Ghz) and boost never exceeds the chips specified TDP. So you could say or state that server chips should have a longer lifespan then desktop parts, since these are tuned different.

But if the information is correct about failing CPU server chips now then intel has a serious problem. Its usually RMA's (returning the product) and perhaps a mass claim, but it won't bankrupt Intel. Its likely reputation damage which is costing Intel far more then ever.

Now that AMD is catching up with Single Core IPC or even better then Intel's right now, Intel has a problem. Big problem.

Endymio · Jul 24, 2024

HardReset said:
First, they must have microcode testing program that runs through every possible starting value and gives maximum result for every starting value combination.

Oops! Every possible result from every possible initial condition is far more possibilities than atoms in the observable universe. Even for just *one* single initial state, the possible outputs can in cases be infinite. Google the incomputability of the Turing Machine Halting Problem for details.

HardReset said:
In other words, there is testing in place that ensures no possible combinations give more than allowed voltage. Additionally there is also hard limit for maximum voltage. Both are very basic stuff.

I'm sorry, but it's far less "basic" than you believe. The PWM already has a hard maximum voltage limit. But there are many other 'maximum' voltages the CPU can have, depending on load, core usage and temperature, and other factors. To repeat: testing "every possible combination" would take longer than the lifespan of the universe.

HardReset said:
Perhaps you are either lucky or use CPUs lightly. That Oodle compression thing is one popular scenario where those CPUs crash.

Now the truth outs. Oodle isn't a compression algorithm as you claimed , but rather a specific library of algorithms ... and one used only for certain video games (e.g. those using the Unreal Engine.)

So your claim that, all Intel testers needed to do to find this issue was run a common compression algorithm and instantly see a fault is ... well, rubbish.

gamerk2 · Jul 24, 2024

HardReset said:
First, they must have microcode testing program that runs through every possible starting value and gives maximum result for every starting value combination. That way it's impossible not to see there may be excessive voltage.

You would think they'd have that, but I've supported a number of massive programs where the SW test set is...lacking to say the least.

Also consider your sampling; how long is the higher then detected voltage on the line? We could be talking a new nanoseconds, which is unlikely to get detected even by BIOS/UEFI level programs; you'd have to throw a scope on the CPU to catch that type of transient.

HardReset · Jul 24, 2024

Endymio said:
Oops! Every possible result from every possible initial condition is far more possibilities than atoms in the observable universe. Even for just *one* single initial state, the possible outputs can in cases be infinite. Google the incomputability of the Turing Machine Halting Problem for details.

Then limit variables so that there are less possibilities. Simple as that. If Intel really wants to do simple microcode so complex it cannot be reliably tested, then it's flawed. They most likely have thresholds for values and therefore this bug claim is simply ridiculous.

Endymio said:
I'm sorry, but it's far less "basic" than you believe. The PWM already has a hard maximum voltage limit. But there are many other 'maximum' voltages the CPU can have, depending on load, core usage and temperature, and other factors. To repeat: testing "every possible combination" would take longer than the lifespan of the universe.

We are still talking about microcode. So if microcode "tries" to put too much voltage, you say it's impossible to hard limit that output? Also when using thresholds for values, options are well within calculation limits on todays computers.

Basically what you are saying is that Intel cannot limit amount of voltage microcode is requesting for CPU. If it's really that hard, we would have dozens CPU models fried in last 10 years.

Endymio said:
Now the truth outs. Oodle isn't a compression algorithm as you claimed , but rather a specific library of algorithms ... and one used only for certain video games (e.g. those using the Unreal Engine.)

Yeah, agreed.

Endymio said:
So your claim that, all Intel testers needed to do to find this issue was run a common compression algorithm and instantly see a fault is ... well, rubbish.

Not instantly but at least they knew that Unreal enigne using Oodle are causing crashes. Much easier than "random crashes happen on some programs sometimes and those programs are always different ones".

HardReset · Jul 24, 2024

gamerk2 said:
You would think they'd have that, but I've supported a number of massive programs where the SW test set is...lacking to say the least.

Also consider your sampling; how long is the higher then detected voltage on the line? We could be talking a new nanoseconds, which is unlikely to get detected even by BIOS/UEFI level programs; you'd have to throw a scope on the CPU to catch that type of transient.

This is about lack of testing and that seems to be problem. Again, it's almost two years after Raptor Lake launch (CPU samples have been ready for almost three years) and Intel claims that microcode bug was present all that time without they noticed anything?

I really do expect CPU manufacturer have excellent tools for monitoring CPU voltages, miles better than basic motherboard chips reporting them.

emmzo · Jul 24, 2024

Watzupken said:
I may be wrong here, but not all chips perform well at a given voltage/ power level. While the lowering of voltage may not affect all users, it may result in some lower quality chip to potentially run slower. The “default” voltage and power would have some sort of buffer to address this inequality so that they chips can perform at a certain expected level.

I have been addressing that in a later post if you'd bother to read. It could only be that they have been selling, in part, lower quality chips and needed the higher voltages. Because I'm not the only lucky one who managed to lower voltages and overclock, while keeping the system stable. That means keeping it default shouldn't be a problem. I don't know if most chips or some chips are good, but I never had a crash in over one year since I bought it and I have been stress testing it thoroughly.

MSIGamer · Jul 24, 2024

Hardware problems and voltage problems. What a mess.

Loadedaxe · Jul 26, 2024

HardReset said:
To put it another way, Intel's new patch will mean clock speeds will be lower.

That also means any results obtained with Raptor Lake CPUs using older microcode are NOT valid. Either CPU is not stable or speed is lower than using CPUs with fixed firmware. I advice Techspot leaves all Raptor Lake CPUs out from upcoming Zen5 article. Now that CPUs are defective, there is no reason to test them, right?

Undervolting does not mean lower clock speeds.

HardReset · Jul 26, 2024

Loadedaxe said:
Undervolting does not mean lower clock speeds.

Lower voltages will mean lower clock speeds too. Unless lower voltages are for other than high clock speeds scenarios which again makes zero sense. If that really is the case, then Intel really is full of morons like said previously.

Endymio · Jul 26, 2024

HardReset said:
Lower voltages will mean lower clock speeds too. Unless lower voltages are for other than high clock speeds scenarios which again makes zero sense.

Which part of "the voltage was calculated incorrectly" did you not understand? We've explained this to you repeatedly. Of course the voltage makes 'zero sense' -- it was never intended to be that high.

HardReset · Jul 26, 2024

Endymio said:
Which part of "the voltage was calculated incorrectly" did you not understand? We've explained this to you repeatedly. Of course the voltage makes 'zero sense' -- it was never intended to be that high.

No matter how it's calculated, it could be limited on certain value in microcode. That's very simple task to do. And it took Intel almost two years to notice that, "hey, we have too much voltage! We should perhaps limit that!"

To put things on perspective, it took AMD three days to hard limit (AGESA) SOC voltage on 1.3V after few 3D cache CPUs burned. Non-3D cache CPUs probably handled more.

Once again: If Intel thinks CPU cannot handle more voltage than x.xx volts, why Intel didn't limit that voltage on microcode? So that microcode can not give more than x.xx volts on CPU on any case? That's trivial thing to do. If what Intel is saying is true, then:

1. Intel "forgot" to limit maximum voltage on microcode.

2. It took Intel about two years AFTER CPUs were released (samples were ready much before) to notice this kind of "bug" that should not have ever even existed.

So yeah, Intel is full of morons if that's true. It makes zero sense.

What really happened is following: Intel thought CPU can handle voltage x.xx volts. Then they limited voltage on that value. Now Intel has to admit CPUs cannot handle that x.xx volts. They just cannot admit they supplied too much voltage on purpose, so they decided it's better to talk about "bug", not "feature".

Endymio · Jul 26, 2024

HardReset said:
Once again: If Intel thinks CPU cannot handle more voltage than x.xx volts, why Intel didn't limit that voltage on microcode?

They did. As we've repeatedly explained to you. That microcode contained a calculation error that -- in certain rare cases -- miscalculates the limit. Stop being a drama queen.

HardReset said:
What really happened is following [sic] ... Intel just cannot admit they supplied too much voltage on purpose, so they decided it's better to talk about "bug", not "feature".

Nice conspiracy theory. Got any proof?

HardReset · Jul 26, 2024

Endymio said:
They did. As we've repeatedly explained to you. That microcode contained a calculation error that -- in certain rare cases -- miscalculates the limit. Stop being a drama queen.

Miscalculates. Limit.

See? If there is limit, no problem with miscalculations. No matter what result is, actual voltage cannot be more than limit. Again, microcode is quite simple to simulate. Running all possible choices does not take too much time because there are not too much possibilities.

Endymio said:
Nice conspiracy theory. Got any proof?

Just using brains. So if microcode requests like 3 volts, motherboard gladly delivers it. CPU probably burns instantly. That obviously doesn't happen because there are maximum limit. So if microcode has a bug, and it wants like 3 volts, then what? Well, there are defined maximum value so that CPU cannot get more than that, no matter how much it requests. This is so simple that Intel's explanation is utterly stupid. Simple as that. And what is "too much voltage"? You probably know what is too much AFTER something breaks down. BEFORE it happens, that "too much voltage" is NOT too much.

And just saying microcode has bug for around two years on CPUs that Intel has sold something like 100M units and there has been reports on broken CPUs for around that two years. Really

Endymio · Jul 26, 2024

HardReset said:
Miscalculates. Limit.

See? If there is limit, no problem with miscalculations. No matter what result is, actual voltage cannot be more than limit.

I feel like I'm arguing with a character in an early Cronenberg film. There is such a hard limit, enforced by the PWM, which doesn't require calculation and thus is never exceeded.

There are other, lower limits, however, based on what portions of the chip are active and what load(s) they're running. Unavoidably, a dynamic response like this *does* require a calculation. You can argue that Intel shouldn't design chips this way, but you're simply arguing for a return to the fixed-power, fixed-clock rate CPUs of 30 years ago. No thanks. It isn't just Intel taking this approach, but AMD and all other CPU designers also.

HardReset said:
Again, microcode is quite simple to simulate. Running all possible choices does not take too much time because there are not too much possibilities.

Now you're just being silly. Intel's microcode is thousands of lines of code with hundreds of execution paths -- and each single path also depends on hundreds of register and other values. One single 64-bit register alone can hold 18 million million million values; the total number of all possible scenarios is a value larger than all the atoms in the observable universe.

The mere fact that countless millions of these CPUs run code day in and day out, for months and years on end without experiencing the problem is definitive proof it's not a scenario that crops up often. Unless, of course, you're running the Unreal Engine.

HardReset · Jul 27, 2024

Endymio said:
I feel like I'm arguing with a character in an early Cronenberg film. There is such a hard limit, enforced by the PWM, which doesn't require calculation and thus is never exceeded.

There are other, lower limits, however, based on what portions of the chip are active and what load(s) they're running. Unavoidably, a dynamic response like this *does* require a calculation. You can argue that Intel shouldn't design chips this way, but you're simply arguing for a return to the fixed-power, fixed-clock rate CPUs of 30 years ago. No thanks. It isn't just Intel taking this approach, but AMD and all other CPU designers also.

You finally understood there are limits and now you say those limits are not enforced because they require calculation. Again, calculation doesn't matter if there are limit. No matter what result calculation gives, if it cannot exceed certain value, it doesn't really matter.

Endymio said:
Now you're just being silly. Intel's microcode is thousands of lines of code with hundreds of execution paths -- and each single path also depends on hundreds of register and other values. One single 64-bit register alone can hold 18 million million million values; the total number of all possible scenarios is a value larger than all the atoms in the observable universe.

We are talking about voltages here, not that much code execution. What do we need to consider when deciding how much voltage is needed?

- CPU clock speed, that is, multiplier. That gives at most 60 values.
- Temperature, say 1 degrees intervals, that gives at most 100 values.
- CPU load, say 1 percent intervals, that gives at most 100 values.
- Consider these for at most 8 P-cores and 4 clusters of 4 cores, that gives number that depends but are not too big.
- Something else like amperage.

We are pretty far from even 32-bit integer values here. Every possible scenario can be easily calculated because there simply are not too many possibilities. Now when every calculation is done, it's pretty much impossible there are too high values for any scenario.

To put it another way, how Intel did figure out there was a bug? They had to, eh, calculate huge amount of values to notice that some calculation give too big value? That's about only way to notice it.

Endymio said:
The mere fact that countless millions of these CPUs run code day in and day out, for months and years on end without experiencing the problem is definitive proof it's not a scenario that crops up often. Unless, of course, you're running the Unreal Engine.

Right. Now you are claiming some CPUs don't get broken because they do not experience this "rare scenario"? On other words, if this scenario happens once, CPU is then broken then? Those who have not failed CPU yet simply have not experienced this "rare" scenario EVER? That's just BS.

Again, some CPU units can handle certain voltage while some units cannot. It's very evident that Intel thought all CPUs can manage with certain voltage limits but not it's clear some cannot. Because Intel cannot admit mistake (limit was too high), they just say it's a bug.

Intel has already lied about Raptor Lake issues (motherboard makers are ones to blame) but now Intel admits it's indeed their own fault. Because Intel has already lied, only *****s consider Intel trustworthy on this issue any more.

Endymio · Jul 27, 2024

HardReset said:
now you say those limits are not enforced because they require calculation.

No, I didn't say anything even approximately close to that. Why not read posts before replying to them? It's all there in black and white, above.

HardReset said:
We are talking about voltages here, not that much code execution. What do we need to consider when deciding how much voltage is needed? (...list omitted...)

Even your own list generates several million combinations. But it ignores many factors -- it's not just the number of cores alone, but which internal portions of each core are active, and there are multiple temperature readings, rather than one single value.

But by far the worst mistake is pretending that these calculations depend only on "significant" value. You don't see why a 0.1, or even a 0.001 degree temperature change should matter, so you pretend those tiny gradations don't exist. But we're in the world of binary computing here. The difference between 2,147,483,647, and 2,147,483,648 can, in some cases, be astronomical. Write code sometime, particularly assembler code. You'll learn this fast.

HardReset · Jul 27, 2024

Endymio said:
No, I didn't say anything even approximately close to that. Why not read posts before replying to them? It's all there in black and white, above.

Again, if there is hard limit, then if calculation gives value that is over or under that limit, then hard limit enforces value. In other words, if calculation gives number out of range, then either CPU should crash (perfect for testing) or then max/min value is used.

It's just stupid excuse that motherboard supplied too much voltage because CPU microcode had error that requested it.

Endymio said:
Even your own list generates several million combinations. But it ignores many factors -- it's not just the number of cores alone, but which internal portions of each core are active, and there are multiple temperature readings, rather than one single value.

But by far the worst mistake is pretending that these calculations depend only on "significant" value. You don't see why a 0.1, or even a 0.001 degree temperature change should matter, so you pretend those tiny gradations don't exist. But we're in the world of binary computing here. The difference between 2,147,483,647, and 2,147,483,648 can, in some cases, be astronomical. Write code sometime, particularly assembler code. You'll learn this fast.

How many bits are those microcode values? Are they hex or binary? You can easily limit amount of possible combinations that way.

What we are talking about are voltages that broke CPU. I have developed good BS detector that usually tells if manufacturer is talking BS.

You stll haven't answered following questions:

1. Why Intel didn't notice this "bug" before release?
2. Why it took Intel about two years after release to notice this "bug"? Despite reports that CPUs broke down.
3. How Intel noticed this bug actually? Did Intel check microcode manually or did they just brute force every possible combination?
4. No matter what number 3 is, how do we explain 1 and 2?

My theory is one that actually makes any sense. That is, Intel thought voltage was OK but it wasn't.

Intel identifies cause behind Raptor Lake crashes, says mobile CPUs aren't affected

Posts: 3,599 +3,492

Posts: 2,743 +2,198

Posts: 213 +378

Posts: 3,599 +3,492

Posts: 2,747 +2,033

Posts: 1,364 +1,138

Posts: 1,364 +1,138

Posts: 2,743 +2,198

Posts: 658 +451

Posts: 3,599 +3,492

Posts: 1,345 +1,571

Posts: 2,743 +2,198

Posts: 2,743 +2,198

Posts: 1,076 +1,578

Posts: 213 +378

Posts: 408 +502

Posts: 2,743 +2,198

Posts: 3,599 +3,492

Posts: 2,743 +2,198

Posts: 3,599 +3,492

Posts: 2,743 +2,198

Posts: 3,599 +3,492

Posts: 2,743 +2,198

Posts: 3,599 +3,492

Posts: 2,743 +2,198

Similar threads