How to troubleshoot a defective motherboard

Jskid

Posts: 348   +1
Where I last worked we received a new server. I set it up and installed Linux, but each time the OS got corrupt. I searched for damaged files in the file system, I tried burning a new installation disk, I ran memtest and contacted the provider to make sure the server is designed to run the particular version of Linux. Eventually my supervisor told me "when all other trouble shooting fails it's safe to say it's the motherboard". So the motherboard came defective.

Any suggestions on how to test for a faulty motherboard?

I went to a job interview recently and was retelling the story and someone at the interview said "to take it a step lower, it was probably a heat problem where the controller was sending an invalid signal". Is this true and what does it mean?
 
"it was probably a heat problem where the controller was sending an invalid signal"... This may mean that the motherboard heat sensor is sending an invalid "overheat" signal to the CPU. This is a very rare problem though. New server, have the motherboard replaced
 
Hmm; "motherboard" failures are rare. A chip may fail (eg onboard graphics controller, usb controllers and even the bios), but to induce the "OS corruption" (rather non-specific statement isn't it) is hard to believe.

Without the specific failure information we can only lapse into gross generalities and I cite examples of
PAGE-FAULT in non paging area, IRQ-NOT-EQUAL and a rash of other bizarre symptoms. These make you believe that the code has become 'corrupt' and certainly 'the code did really fail'. But the root cause of the code failing however is one of two unseen errors (in the order of most likely & frequent):
  1. bad memory
  2. hd write error
Obviously software does not tire out, get brittle nor rust. If it ran once, it will run 2^32 epochs into the future.

Because our Windows client systems are run by millions of naive and untrained people of all ages, Microsoft created tools to make software failures more easy to diagnose and analyze. Linux crash dumps are no cake walk and take lots of experience to even read, let alone analyze and tools are almost non-existent.

As to HEAT being a root cause - - absolutely possible and frequently occurring. Poorly regulated power is another root cause. BTW: both of these also effect our client systems and no tools will ever find these as a root cause.
 
Back