Intermittent Crashing with Strange Temperature Fluctuations

Hey all,

In the past two weeks, I started experiencing a problem that has been baffling me. I apologize for the length of this post, but I'm trying to be as detailed as I can.
It began while I was watching a TV show on my computer (streaming). My laptop (http://speccy.piriform.com/results/sAc2sdkoQHbUKAc6xgrxpI6), experienced a complete system shutdown. I was rather shocked because the only time I've ever seen such a crash was due to overheating, and video streaming barely stresses my rather capable, though outdated, graphics card.
I attempted to power my system back up, but all I got was the indicator lights and keyboard back-lighting coming on and immediately going back off. Again, consistent with overheating. So I let it sit for several minutes. I came back to it, and everything powered back up normally. So while I'm sitting there looking at my temperatures (Core Temp and GPU-Z), thinking everything looks just fine, I hear my fan start picking up speed. Now, normally the laptop has about 6 fan speed settings, including off. I say 'about' because ASUS saw fit to lock fan speed controls, so I can't really do much of anything except guess at what's happening based on the sound. Anyway. Normally, when temperatures reach a particular threshold, the fan simply picks one of these 6 speed settings, and goes directly to that RPM with very little spool-up time. As I'm sitting there, the fan went from speed 1 to full blast over the course of about 10 - 12 seconds, as if someone was slowly turning a knob to crank the voltage. No steps, just gradual spool-up. Very strange.
Stranger still, was that once it reached max RPM, it stayed there changing for only one reason. Once the GPU temperature dropped to 40, the fan would cut-off entirely (which is the normal threshold, as it has always been). Then when it climbed back up to 50, it would kick back on full blast (again, normal threshold, but nowhere near normal speed).
So at this point, I figure that something must've glitched and I reboot my system. Same exact behavior as the initial boot-up, it just took a bit longer for the fan to decide to spool up. I decided to let it be for a while, see if anything would change. Half-hour later, still the same. So now I decide to do a proper boot cycle, flush as much as possible. I power down, unplug, and remove the battery. At this point, I decide to check if maybe it had in fact gotten dirty in the last 6 months since I last cleaned and re-pasted. Nope, quite clean. But for the hell of it I cleaned the fan and cooling fins, and re-lubed the fan (haven't decided to re-paste again because I'm quite sure that isn't at fault, but am happy to do so if the community advises).
So I cross my fingers and boot back up. I let it sit at the desktop for a couple minutes listening for the fan, but nothing. Everything seems to be fine, temperatures at nominal. "OK", I think, "technology gremlins are messing with me" and decide that I can go back to watching my show, since at the very least I need to catch the behavior again before I can really do anything. I watch for at least an hour, checking on temperatures periodically, and everything is fine. Then, like someone snapped their fingers, crash. When I attempt to power up, this time instead of flashing the lights, I get constant lights. Hard drive doesn't spool up, fan blasts at speed setting 5 (just under max). I follow the same procedure as before and unplug/remove battery.
When I boot up, This time when I boot up the GPU temperatures are not normal. It idles quite high (in the 60s), and even the smallest action, (mouse click, even without opening anything) causes the temperature to spike to 87/88 and build up to as high as 93/94. Now, normal gaming and high usage is between 75 and 85 for my GPU. At 85, fan kicks into speed 5 and cools to 75, and drops back down to speed 3/4, depending on CPU temperature (for which I'm honestly not positive what the thresholds are, because the GPU drives fan speed most of the time). During long gaming sessions stressing both CPU and GPU (I'm talking 3+ hours of constant gameplay) the most the GPU ever reaches is 92, maybe 93 on a hot day, and the CPU rarely sees the 70s, mostly staying in the 50-60 range. Both have their safety cut-off set at 105.
At this point, I begin logging my GPU temperatures while reading articles online and watching something on youtube. It stays in the 87 to 94 range. Until suddenly it stops. And temperatures drop back down to nominal and I don't see it again the rest of the day. I have never seen temperature fluctuations of +/- 20 degrees in a second.
The computer worked fine all of the next day, but the day after I heard the fan start acting slightly strange, running on speed 3/4 when I was just browsing the internet (usually 1 or 2). I checked GPU temps, but they were fine. This time it was the CPU running way hotter than usual, and spiking from activity, idling in the high 40s and spiking up to mid 60s. Don't ask me why I didn't log this behavior, my brain just didn't click. But I've been logging ever since and haven't seen a repeat for the CPU yet. Anyway, I decided to stress test, see what a game would do, but it didn't reach above 74. Far below the cut off of 105. No system crash
In subsequent days, I managed to log the GPU beginning its erratic behavior while on the internet, so I have both the start and end for the GPU, but only nominal behavior for the CPU. Furthermore, while the temperature reads in the 90s, the air coming out, the feel of my palm rest under which the GPU sits, the feel of the cooling fins. Everything is consistent with how it feels when my computer is in the 50s. And I guess it's important to mention that during these times, the fan runs in the higher RPM speeds, consistent with the temperature being indicated. It's probably also important to mention that stressing the GPU does not raise the temperatures any further above the already ridiculous temperatures it seems to be running at. Additionally, at no point has the computer crashed while this erratic behavior is being displayed. It only crashes while under seemingly normal operating conditions, both at cooler web-browsing temperatures, and at higher gaming temperatures. Extended idle periods have yet to produce a crash.
The only potential correlation I've been able to note is how I leave my computer when I turn it off. When I unplug it and remove the battery in between sessions, I seem to have 1 or 2 days of normal operation. When I simply power it off and leave it plugged in, I seem to see these issues within 1 or 2 days. It's not a very promising lead. And I have not seen a recurrence of the original fan speeding up symptom.
I thought it may have been a faulty nVidia driver causing this, as I had updated it about 2 weeks before seeing the initial crash, but I have updated it, and the problem persists. I have not yet attempted reverting because I feel like other people would have noticed this over two driver releases if that was the problem.

Any and all comments/suggestions are appreciated. My thoughts are:
a) Maybe the temp sensors are malfunctioning, but they've never logged a critical temperature during one of the crashes.
b) The GPU has just begun to wear out from age, but that doesn't explain the CPU behavior (though, of course, there's no log of it... yet)
c) Motherboard is dying, which is probably the most likely, but why is it so intermittent?

I've uploaded the GPU-Z log info showing normal operation under stress (gaming), beginning of erratic behavior, and end. The end happened while browsing the internet, the beginning has happened while gaming and browsing a like. I've attempted to re-create, but have been unsuccessful. Also, note the strange GPU clock behavior compared to load during the problem versus normal operation. The periodic drops in usage, clocks and temperature during gaming was me checking on temperatures or otherwise alt-tabbing.
 

Attachments

  • GPU Temps_Combined.zip
    116.8 KB · Views: 0
"c) Motherboard is dying, which is probably the most likely, but why is it so intermittent?" The temp sensors are located on the motherboard, so I would say the motherboard needs to be replaced
 
"c) Motherboard is dying, which is probably the most likely, but why is it so intermittent?" The temp sensors are located on the motherboard, so I would say the motherboard needs to be replaced


Yea, that's pretty much what I assumed, but I was really hoping that someone could cheer me up by pointing out something obvious I overlooked :/
Thank you for the reply though.

Since it's a laptop, replacing the motherboard isn't really something I'm looking to do, so is there any advice you could provide for prolonging the current ones life while I save up for a new one?
 
Laptop motherboards are replaced all the time, and can cost less than $150 most of the time. You have replaced the CPU thermal paste.Does or did the motherboard use thermal pads? Sometimes these pads need to be replaced, and thermal paste can't be used in place of the pads
 
Really? Most of what I hear seems to indicate that swapping a laptop motherboard is somewhat of a nightmare. But considering the relative ease of access I have to the internals on this laptop, I'll look into it. Thanks :)

There are indeed three (if I remember correctly) thermal pads, about 1 sq. cm each, maybe 2 mm thick, towards the end of the heat guides closest to the GPU. They sit between some chassis parts & the end of the heat guide (past the GPU, away from the cooling fins) and the aluminum air guide, which is also part of the fan's chassis assembly, allowing the fan to pull the greatest amount of air over the GPU. I did not replace them as they didn't show any signs of damage, had plenty of stick and were still quite cushy. I'm quite confident they serve minimum heat dissipation purposes, and mostly keep the aluminum piece properly leveled.
 
I repair computers, both laptops and desktops. So I have changed a lot of motherboards in both types. As long as there are no tears or holes in the pads they should be okay
 
I've used Ali express a few times and got good results. The CPU should be fine but if the GPU is hosed, at least it may be able to be replaced separately... from the motherboard
 
Back