SBS 2003 Crashing regularly

Status
Not open for further replies.

chance1138

Posts: 60   +0
We are running SBS 2003, and every Friday it seems to crash--usually during our backups. It is always accompanied by this error:
The reason supplied by user WALLACE\Administrator for the last unexpected shutdown of this computer is: System Failure: Stop error
Reason Code: 0x805000f
Bug ID:
Bugcheck String: 0x000000d1 (0x00000024, 0xd0000007, 0x00000000, 0xf74fbe02)
Comment: 0x000000d1 (0x00000024, 0xd0000007, 0x00000000, 0xf74fbe02)

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

It seems to be a device driver as far as I can tell, but I have loaded new drivers for about every device that I thought would be associated with this error, and it continues to crash every weekend. The strangest part about this is that we don't run anything any differently on the weekend than we do during the week, yet it rarely crashed during the week. If anyone has any ideas, I would appreciate the help!
 
Is your SBS2003 Service Pack'd? There is SP1 available for SBS2003.

Does Event Viewer tell you anything more about the error?

Are there any scheduled tasks running at the time?

Did your server come pre-installed or did you install any diagnostic software (HP, Dell, etc)? It may be bad sectors on a disk if it's to do with your back-up and the copying of large data amounts. But it's hard to tell.

Out of hours, you need to run some diagnostic tools on the server to determine whether or not it is software or hardware causing the problem.

Good luck and I'll help you further if I can. :grinthumb
 
Yeah, we are actually up to Service Pack 2.
There is a separate System event in the Event Viewer that has the same error code listed, but there are no corresponding errors in the Application or any other Event Log.
There are no scheduled tasks, other than that it seems to crash at a time that kill either one or both of our backups. (but only on the weekend, not during the week)
We don't have any diagnostic software that I am aware of. If you know of any good ones, I'm all ears!
I appreciate the help and the fast reply!
 
Post some minidump files if you have them, C:\Windows\Minidump. I remember that one version of norton would crash the system when doing backups.
 
All the dumps crash at aac.sys, which is an Adaptec RAID Miniport Driver. Did you update this driver?

BugCheck D1, {24, d0000007, 0, f74fbe02}
Probably caused by : aac.sys ( aac+4e02 )
f74f7000 f7500820 aac aac.sys Sat Apr 23 00:08:30 2005 (4269217E)
 
I had updated to the newest drivers, but I am going to try to reinstall or rollback to see if that helps matters any.
Thanks for your help, and I'll let you know if that resolves our issue.(Though I will have to wait until the weekend, since that is the time when we experience the crashing)
 
Well reloading the same version of drivers didn't work. As a matter of fact, it crashed earlier than usual! I guess I'll try an older version, though I can vaguely remember having this problem with our old one too. Does anyone know why the Adaptec drivers would kill everything roughly once a week? I haven't found any related info on their website.
 
I haven't heard of a RAID controller crashing only once a week. Whether this is the cause of the problem is still undetermined, although by the minidump analysis it is the most likely cause.

I would be thinking about re-building the RAID array with a new RAID controller, but that is quite a big thing! Not something you want to rush into doing ...but it may well work.
 
Yeah, that's something we wanted to try to avoid, but if it keeps up with the crashing, we may have to go that route. I'll try an older set of drivers first, just to make sure I don't have to do extra work! Thanks for all the help!
 
Is there anything sharing IRQ 7. One of the dumps has network routines on the stack, so I'm wondering what IRQ your NIC is using. As it's crashing at IRQL 7 then it may be a hardware error with the controller card.

BugCheck D1, {24, d0000007, 0, f74fbe02} <-- IRQL 7
 
The Resource listing of the IRQ for my enabled NIC is shown as 24. I don't know that that tells you everything you need to know, though, considering that the RAID card that the minidump points to shows IRQ 48 under Resources.
 
I don't think it's relevant after finding the info below.

The IRQL values are divided into two groups: Software ( 0,1,2 ) / Hardware IRQL ( >= 3). Hardware IRQL is for device ISRs and system, it is similar to (but distinguished with) the level of hardware IRQ, which implemented by i8259, but IRQL is only an action of Windows OS, not hardware’s. It’s realized by Windows OS.
 
So does the best bet seem to be re-creating the RAID array? If so, would you say that I should try with the same RAID controller, or getting a new one? The thing that still confuses me is that it happens pretty regularly on the weekends, but rarely other times. Nothing is any different at those times, which makes me wonder if I haven't missed what could have been causing the problem. Thank you guys for your help! I'll definitely stick around these forums, as there is a lot of good advice and some very knowledgeable people here.
 
What kind of server is it? Dell or HP? Or something generic?

Look int obtaining some dserver diagnostic software before re-building the array.
 
It's basically generic. It was built by Supermicro and sold through a local company to my company a couple of years before I got here.

Does anyone know of any good server diagnostic software? I'll look around myself today, but any suggestions would be appreciated. Thanks again!
 
1) there MUST be something different on Fridays, think what it is.
2) if it crashes during backup, then surely that is the prime suspect.

On a novell server, I have known backup tape hardware crash the system when old tapes are used. This is just poor design in the tape driver microcode, where if the device gets too many unreadable errors, it just crashes instead of sensibly reporting a failure.

By analogy, do you use a certain tape every friday? Take my advice, change the whole lot of tapes every 12-18 months, clean the tape drive once per week, replace the tape drive every 3 years. Anything else is false economy.

The same arguments apply if you use some other backup device. It is not hard to believe that a device error will appear as a raid error. You only have to experience what happens in XP sometimes when you try to write to a bad CD or even floppy disc......BANG ! over goes your entire system. Who to blame? MS of course.
 
We have three different "weekly" tapes that are used every 1st, 2nd, and 3rd Friday of the month, so it is a different tape every week. And while it usually crashes on Friday, there are occasions when it crashes with the same error during the week. Several of the tapes are brand new, none are older than one year, and we run cleaning jobs every couple of weeks. The tape drive is a possibility, because I believe that it has been roughly three years since they were installed. Does anyone have any ideas on how to determine if that is the cause, before actually replacing the drive?
 
Whilst the tape drive remains a possibility, is the weekly backup different to the 'other' backups, in the sense that it is a full backup as opposed to an incremental/differential.

If this is true, then one might also suspect because that backup reads the whole server or volume, it reads data that other backups or other applications never access. And that data could be corrupt or unreadable, causing a hard failure.

Or it may simply be the volume of data causing the server raid unit to improperly time out trying to feed the slow tape drive, or buffer overflow.

You could run a drive test to hope to find any unreadable spots. You might also try testing for raid read problems if the manufacturer can supply software. Finally, if the raid and disks check out, you are back down to the tape drive.

What model of tape drive is it? DAT48, DAT72 or other? Seagate drives are very popular (supplied with most Dell servers and Dell branded), but their life is limited. I have gone through 3 drives in 10 years. You normally dont have much doubt if it has failed, but sometimes it is not so clear-cut.
 
They are full as opposed to incremental. I hadn't thought of that. I have run chkdisk a few times...are there any other programs to try on the volumes that might be better?

We have two drives--one for system data, and one for everything else. The system drive is a single tape Exabyte VXA-2 drive. The other is a 10 tape carousel Exabyte VXA-3.
 
Chkdsk will tell you nothing. You need software to actually read every byte (ideally a thorough pattern test read/write, but your problem is possibly just a hard read fault).

Um, well Dell and HP both have full server test suites, but yours is neither. Can't suggest anything other than to check the raid mainufacturer site (if any).

Try appealing on the hardware/networking section of this site.
 
Just one other possibility - heat stress.

Your backup software is what? Most backups compress the data on the fly. That takes huge computing power. On a Novell box normally running at 1 to 2% load, I will suddenly see the load jump to between 50 and 60% when a tape is backing up.

If the backup lasts for anything up to an hour your processor(s) will get hot. This could cause a crash, or a simple stop if the processor is a pentium 4 which can halt for thermal protection.

This would also explain why only the weekend backup mainly crashes - the others are shorter and dont heat the processor so much. If you have temperature monitoring, watch it during a long backup. At least, open the case, clean the dust out (you may get a surprise!) and check the fan is giving adequate flow.
 
I've never actually checked the heat at any point, which I will do, but I don't think that's our problem. It is a four Intel Xeon 3.2GHz processor system, and our system monitor rarely reports CPU usage getting above an average of about 1.7%. I'm actually more concerned about RAM, which has been seeing excessive usage more of late. I may look at getting more, since it only has 2GB right now. I will still check the heat, though. Thanks for mentioning that.
 
8 days since you posted - what has happened? Could do with more info like when did this problem start? Did it start after some software or hardware update, or has it always been this way ?
 
Status
Not open for further replies.
Back