Random Reboots

navinag

Posts: 8   +0
I have a rebooting issue with my NEW build. I get a minidump the machine tells me that windows has recovered from a serious error after rebooting. I have minidump files that I will post if that helps.

My system has never crashed while I was using it. It always seems to happen if I leave it on overnight or during the day and I see the BSOD when I come home. I am running a dual monitor setup. I will change the "restart on Failure" and post the BSOD message the next time it occurs.

System information
  • Windows XP Pro X64
  • 2 x HD 500G|WD WD5001AALS
  • PSU ANTEC|EA500 500W
  • MB GIGABYTE GA-EP45-UD3P P45 775
  • CASE COOLERMAS|CAC-T05-UW BLK
  • 8GB total --> MEM 2Gx2|GKS F2-8000CL5D-4GBPQ
  • VGA XFX PVT98WYDFH 9800GTX+ 512M

Processors Information
------------------------------------------------------------------------------------

Processor 1 (ID = 0)
Number of cores 4 (max 4)
Number of threads 4 (max 4)
Name Intel Core 2 Quad Q9550
Codename Yorkfield
Specification Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
Package Socket 775 LGA (platform ID = 4h)
CPUID 6.7.A
Extended CPUID 6.17
Core Stepping E0
Technology 45 nm
Core Speed 1999.8 MHz (6.0 x 333.3 MHz)
Rated Bus speed 1333.2 MHz
Stock frequency 2833 MHz
Instructions sets MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, EM64T
L1 Data cache 4 x 32 KBytes, 8-way set associative, 64-byte line size
L1 Instruction cache 4 x 32 KBytes, 8-way set associative, 64-byte line size
L2 cache 2 x 6144 KBytes, 24-way set associative, 64-byte line size
FID/VID Control yes
FID range 6.0x - 8.5x
max VID 1.238 V
Features XD, VT

Memory SPD
------------------------------------------------------------------------------

DIMM #1

General
Memory type DDR2
Module format Regular UDIMM
Manufacturer (ID) G.Skill (7F7F7F7FCD000000)
Size 2048 MBytes
Max bandwidth PC2-6400 (400 MHz)
Part number F2-8000CL5-2GBPQ

Attributes
Number of banks 2
Data width 64 bits
Correction None
Nominal Voltage 1.80 Volts
EPP yes (1 profiles)
XMP no

Timings table
Frequency (MHz) 266 400
CAS# 4.0 5.0
RAS# to CAS# delay 4 5
RAS# Precharge 4 5
TRAS 10 15
TRC 16 24

EPP profile 1 (full)
Voltage level 2.100 Volts
Address Command Rate 2T
Cycle time 2.000 ns (500.0 MHz)
tCL 5.0 clocks
tRCD 5 clocks (10.00 ns)
tRP 5 clocks (10.00 ns)
tRAS 15 clocks (30.00 ns)
tRC 45 clocks (90.00 ns)
tWR 12 clocks (24.00 ns)

I have included the minidump file.

Intel Core 2 Quad Q9550 hardware monitor

Temperature sensor 0 38°C (100°F) [0x3E] (Core #0)
Temperature sensor 1 38°C (100°F) [0x3E] (Core #1)
Temperature sensor 2 40°C (103°F) [0x3C] (Core #2)
Temperature sensor 3 36°C (96°F) [0x40] (Core #3)

GeForce 9800 GTX/9800 GTX+ hardware monitor

Temperature sensor 0 56°C (132°F) [0x38] (GPU Core)

I ran memtest with no errors

I am also running the latest Bios for MB F8

I have reseated the memory.

I believe that I have all the updated drivers for all my hardware.

It would be great if I could get some information from the minidump that could narrow the debug process.

Any help would be appreciated. I would be happy to provide any other information that someone might need.

--Navin
 
Your error is 0x0000009C: MACHINE_CHECK_EXCEPTION

This is a hardware issue: an unrecoverable hardware error has occurred. The parameters have different meanings depending on what type of CPU you have but, while diagnostic, rarely lead to a clear solution.

Most commonly it results from overheating, from failed hardware (RAM, CPU, hardware bus, power supply, etc.), or from pushing hardware beyond its capabilities (e.g., overclocking a CPU).

1. Have you overcloked? If so try easing back on your timings/voltage.

2. Take a small fan, take off the side panel and on the fans lowest setting let it blow into your system. Does it crash over time? We're checking for a heat issue here.

3. Are all fans working?

4. How long did you run memtest?

5. If you have a multimeter test your PSU. Craftsmann makes a nice digital one for $20.
 

1. Have you overcloked? If so try easing back on your timings/voltage.

2. Take a small fan, take off the side panel and on the fans lowest setting let it blow into your system. Does it crash over time? We're checking for a heat issue here.

3. Are all fans working?

4. How long did you run memtest?

5. If you have a multimeter test your PSU. Craftsmann makes a nice digital one for $20.


1. I am running stock settings and not overclocking anything.
2. I try the fan experiment next and repost
3. All my fans are working. I can see their RPM readouts and can physically see them running.
4. I ran the memtest for two hours. HOw long should I run it?
5. Where do you suggest I test my PSU. Into the MB coming out of the wall? Can you clarify?

On a side note. I have only seen this error when I have a screensaver enabled and the my screen saver is one of those virtual aquarium ones. I turned that off today and left the machine off...so the monitors should just go to sleep. If I don't see a failure today...that would lead me to believe that its my video card somehow. Perhaps dynamic fan speed control is not working on that card while the screensaver is running. Who knows. If that is the case...>Then running an external fan with the case open could light switch that issue.


What should the average case temperature be? Meaning the temp of the MB and surrounding components? I would expect that to be relatively cool.

I will post the new results of these experiments.
The downside is the one experiment/day............takes along time.

--Navin
 
On a side note. I have only seen this error when I have a screensaver enabled and the my screen saver is one of those virtual aquarium ones. I turned that off today and left the machine off...so the monitors should just go to sleep. If I don't see a failure today...that would lead me to believe that its my video card somehow. Perhaps dynamic fan speed control is not working on that card while the screensaver is running. Who knows. If that is the case...>Then running an external fan with the case open could light switch that issue.

I mean I left the machine on
 
So I got home this afternoon and a I found my system was not at BSOD. The main difference was that I left the screen saver disabled this morning before I went to work. My monitors went to sleep and the video card was not doing anything.

So having said that. I think it might be related to the video card. I have been running Prime95 all afternoon for about 4 hours now and all 4 cores are cranking and maxing out at 70'C. and my memory usage is about 2.2Gb. So far no errors. So tomorrow I will turn the screensaver back on and open the case and stick a fan in front of it. Who would have thought that the Aquarium screensaver would stress my video card that much. I think that it would not have except that I am driving dual monitors. I will post tomorrow on the latest results. Does anyone know if there is a bug with Nvidia 9800gtx+ related to cooling when the screen savers are on...perhaps some dynamic fan control etc...Just a thought. I know it is a bit far fetched. Perhaps I can Crank my Nvidia fan to max and run the same test when the case is closed. Is there a way to manually do that?
 
This is good diagnostic work. We'll definitely look forward to the further steps you take and the ensuing results.

As for your question about your 9800gtx+ related to cooling when the screen savers are on I have not come across anything concerning your inquiry.
 
So I tried to stress my CPU's, memory and Video card using Prime 95 and Video Stability test. The temps of my CPU go up to 70'C and my GPU got to 81'C. I can not imagine that the Aquarium screen saver is more intensive. This leads me to believe that my experiment to turn on the screen saver and open the case and stick a fan on it will still cause a BSOD. If that is the case. Then it could be an issue of this screen saver running in XP Pro x64. I will try other normal windows screen savers to see if the problem persists later. I have been a this computer on an off doing crap (mem tests, video tests, cpu tests) all afternoon approx. 7 hours and no failures. I would certainly think that this constitutes a stable system.....but I could be wrong........

--Navin
 
So............It turns out that am an *****. I had an epiphany this morning when I awoke. "Where the hell is all the air for the GPU being vented?" Ahhhhhhhhhhh....How could I be so stupid. I went back to look at my GPU installation. Low and behold......I forgot to remove the second slot cover from the case to expose the GPU vents. As a result...this was probably making all the components below my GPU run hotter than expected. (the lower half of the mother board, my harddrives, sata connections etc..etc.. Although my stress test on the GPU did not indicate that it went above 81'C...That doesn't mean that other components weren't roasting in my case. Other components on the GPU could have been heating up that don't normally get hot as well. The possibilities could be endless. Memory getting to hot....harddrive, MB, etc,,etc,,etc.....My Case temp did not show anything too hot though. I would have to guess that some components were heating up that don't even have a temperature monitor that would normally be well cooled.

What does this mean? I have to run a new baseline. Hopefully I am right. I will leave to computer on today with the screen saver cranking and see if I get another BSOD. If I am wrong...then back to the drawing board.
 
the message the windows has recovered from a serious error is cause by a faulty hard drive. The only way to fix it is to replace the drive. sorry


Check the definition of 0xC9 again. It can be due to several causes and though the hardddrive could be a possibility heat seems to be an issue which navinag is taking steps to determine if it is.
 
Will a harddrive diagnostic detect this fault? I will try running one.

It could as the definition of this error notes it can be caused by many hardware issues. It would not hurt to do a diagnostic look at your HD, but I was responding that to say that this is the definitive reason without further information is the proverbial shooting in the dark.

Again the definition of possible causes: This is a hardware issue: an unrecoverable hardware error has occurred. The parameters have different meanings depending on what type of CPU you have but, while diagnostic, rarely lead to a clear solution.

Most commonly it results from overheating, from failed hardware (RAM, CPU, hardware bus, power supply, etc.), or from pushing hardware beyond its capabilities (e.g., overclocking a CPU).
 
One can make some educated guesses. I think I have make some reasonable ones. I really do think that problem is from overheating. It may still occur and I will keep m eye on the harddrives.

I did run a bunch of hardrive diagnostics and found zero errors. I could probably still move the haraddrives apart a bit further to increase air flow around them. They are pretty crammed close together.

In the meantime I might clone my main harddrive or do some sort of backup that is easy to restore.

I will post an update in a few days if all is well.
 
Back