SBS 2003 Crashing regularly

Status
Not open for further replies.
It actually did not crash last weekend, but it did crash again last night with the same exact error. The problem has existed to some degree since the server was built, but it has grown more frequent since then. I have updated and reverted about every piece of software and driver I could think of while trying to eliminate the problem, but it doesn't seem to have helped. We are planning on adding additional memory, which is why I was taking so long to reply. I figured that I would see if that helped first. I still haven't looked at heat, which I need to do as well.
 
Ok, good luck - looking more like a hardware incompatibility then - have you checked for a motherboard bios update? Tried direct contact with motherboard and backup drive manufacturers technical support experts? Be persistently polite on the phone until you get someone who really does know what they are talking about.

I note from an earlier post you do not have the benefit of a certified combination of hardware and OS. That makes it tough to get any help. But until it is solved you could be at legal risk, not having proper backups in place. I have no less than three backups of critical data - tape (where I can go back 18 months or so by the day), a local pc copy taken at night via SyncBackSE software (absolutely great stuff), and finally a warm standby server, also via SyncBackSE nightly.

When you get some more memory - it should be certified for motherboard compatibility. I would then only replace the memory initially, as just adding more may not eliminate the real problem. If you still get the problem with completely fresh memory, then you can cheerfully add the original as well and try the effect of more memory in total.

As an expert once told me, if you factor in all the costs of time involved in trouble-shooting stuff, you can often exceed the costs of throwing it all out and buying a complete new system......
 
I could definitely believe that last statement.

And I figured that I would correct myself from before: We have 2 Xeon processors, not 4. They show as 4 in the Device Manager, I'm assuming because it's a dual-type processor. Not that this is relevant, but I didn't want to leave that stated incorrectly.

Thanks for the tips on the new memory. I probably would have just added to it if you hadn't said that. And the memory we will get is recommended for our motherboard specifically, so that should be OK.

It may be a while before I post the results, as getting this company to pay for tech upgrades is like pulling teeth. Once we manage to convince them that this will likely save money, and then get it shipped, I'll let everyone know how it turns out.
 
I'd still be a bit surprised if memory were the root of this. I feel it must be connected with the tape drive.

However, relevant or not, let me tell you of an experience of mine with our earlier server (Netware5). It would regularly run 200+ days without reboot. The only reasons were to replace a tape drive (!) or once because the power company wanted to replace the company meter !

However, suddenly I would come in in the morning to find the server had rebooted itself during the night, which you would not know if you didn't spot the elapsed server time suddenly becoming hours rather than tens or hundreds of days. This went on for two to three weeks, during which, like you, several things were tried without success. I finally decided to replace the power supply - quite expensive - but no change! Suddenly, it failed during the day, and the solution was found. It was the big, strong UPS, which always appeared fine, passed all the tests etc, etc. But, the batteries were, actually, shot. Why did it only fail at night ? No idea.

Replace that and everything was cured. Then we got a new server, where I fought a battle to stay with Novell, when all around me wanted SBS2003. For the first few months, I had to reboot at ten to twelve weeks running time because of falling free memory, and things were looking sad. But at SP3 Novell finally sorted the memory management, and it has since run for the hundreds of days at a time, as always expected. I have no reason to regret the decision to avoid Windows, except once software that was designed to run on SBS2003 had to be installed on a PC instead.
 
Wow. The UPS? I hardly ever even think about that. It's pretty crazy that that only failed at night, too. How did you ever figure out that that was the problem?

Yeah, my reaction to SBS is pretty much: "Meh." I certainly don't think that it is terrible, but it sure isn't great, either. Like you, I think that the biggest pro for Windows in general is software compatibility. I guess that's to be expected with Microsoft's market penetration.
 
Found it was the UPS by eliminating it. Ran the server for a few days on mains only. By then the UPS was supporting a simple PC, which when power fails does NOT restart automatically. Bingo.

If there are lessons, they are
(a) you cannot necessarily trust hardware tests
(b) suspect everything and anything
(c) test everything possible by substitution
 
Well, I thought that the problem may have been fixed. I found a newer driver version for the RAID card that I had never seen before and I used it. The server acted more stable than usual for the better part of a week--through backups and all, but it then decided to crash 3 times this weekend. It reported a very similar error to what it used to, but now the fourth parameter of the stop error is slightly different. I looked at the minidumps, and it still seems to point to aac.sys, but it reports a little differently than before. Does any of this seem to point more specifically to something? I can post the new dumps if anyone would like to see them. And the new error in the event log is: Error code 000000d1, parameter1 00000024, parameter2 d0000007, parameter3 00000000, parameter4 f74fba62. And I also just noticed that the minidumps used to say "unable to verify timestamp for aac.sys, probably cause by aac.sys," but they now have "unable to verify timestamp for SCSIPORT.SYS, probably caused by aac.sys" as well as "unable to verify timestamp for hal.dll, probably caused by aac.sys." So while it still appears to be the same problem, it is manifesting itself a bit differently than before.

I saw mention of the IRQL problem for one of the devices that may be causing an issue, but I can find no IRQ conflicts. I also tried to find a new driver for the scsi controller, thinking that that may be causing a conflict, but I cannot find one any newer than the one we have--even though it was created in 2001.

I haven't had a chance to do many of the earlier mentioned troubleshooting, but I thought that the changed error might shed some light on the problem.

Again, any help is appreciated!
 
Is there any possibility that the various cards in your server have been changed to a different slot since the OS was installed ? This can be particularly important for server OS's and raid units.

There does seem little doubt that you have a hardware incompatibility somewhere.

One solution to that would be a complete server OS reinstall, but what a nightmare that could be. You could research the risks of an in-place reinstall (i.e. hardware re-detect).

MUCH simpler all round to buy a new server with warranteed hardware support from a reputable manufacturer, set it up with users etc etc and then transfer the data across. Keep the old server and update it nightly. It's called a warm standby, and last week I was VERY grateful of it when our main Dell server blew it's motherboard. One lesson learned from that...do everything you can to make the servers identical in users/printers/volumes and especially compatible tape drives.

You should point out to your bosses the consequences of losing your server. After 24 hours 50% of companies suffer severe losses or go out of business. After 3 days that rises to 95% of companies. Against that sort of situation, trying to save a few thou does NOT make sense.
 
I agree with you on the hardware issue being the cause. I think that it may be a tape drive that doesn't like the driver, even though it is supposed to be certified with the server we have.

Your suggestion of a warm server may be a distinct possibility, because we are shutting down our secondary site in the next month, and we'll have the server from there just laying around. That may just be our answer.

Thanks for all you help, and I'll let everyone know how it goes.
 
re drivers, have you had a look at www.driverguide.com ? I have found this site useful in the past. Free registration required then you can search by all sorts of detailed references, plus read contributers comments.
 
I went ahead and registered for that site. They didn't have the drivers I was looking for earlier, but it seems like a good site to use.
 
Status
Not open for further replies.
Back