I Think I'm Going To Cry
Saturday 12th May, 2007 02:50 Comments: 10
One of my brand new hard disks appears to have died on me. Until I replace it, it appears I can't access my new software RAID array (kind of makes sense). Which means I can't move the data off the other disk that I've been recovering data onto (which, possibly related to the dead disk, was displaying error messages and despite having moved lots of data off the disk, the files appear to still be there, so it could be fiddly to move valid data off it - I may have to re-recover it from the original failed array). In addition, until I've moved off the data, I can't easily continue recovering the rest of the data as I'm running out of disks to store it on (I might rebuild my main machine and move some of the data about, as I trust those disks that are two years old). I hoped to be able to recover the rest of the data over the weekend, except I can't Remote Desktop into my PC for some reason - the connection died and now all the ports are coming back filtered, but another machine on that hub is still up and running okay so it must be the machine that's played up. I also discovered during the recovery that I think I have a dodgy disk. I'm not having much luck. And I'm very worried that just as it looked like I'd be able to recover most of the data from the failed RAID-5 array, I might have another - twice the size - RAID-5 array going wrong on me. I don't like that all of my data is currently stored on two RAID-5 arrays running with the absolute minimum of hard disks. All I want is a stable fileserver with trustworthy disks storing my data with a bit of redundancy. If one more thing goes wrong, I think I'm going to cry.
EDIT: The disk is definitely dead, you can tell from the scorch marks on the disk that was below it and the burnt out chip on the failed disk (the chip should be flat aside from a small circular indentation in the top right corner of the first pic). Sorry about the quality, they were taken by my cameraphone (and then cropped):
The good news is I only had to reactivate the array and although it says "Failed Redundancy" for now, it does mean I can still get to my data - if necessary - when I'm missing a disk. Think I might leave the fileserver off for now though, play it safe.
EDIT: The disk is definitely dead, you can tell from the scorch marks on the disk that was below it and the burnt out chip on the failed disk (the chip should be flat aside from a small circular indentation in the top right corner of the first pic). Sorry about the quality, they were taken by my cameraphone (and then cropped):
The good news is I only had to reactivate the array and although it says "Failed Redundancy" for now, it does mean I can still get to my data - if necessary - when I'm missing a disk. Think I might leave the fileserver off for now though, play it safe.
Robert - Saturday 12th May, 2007 03:36
And now it appears that my 2 port SATA controller card in my main machine has died on me.
*cries*
*cries*
Note to self:
If the disk containing part of the RAID-5 volume cannot be reactivated and the volume does not return to the Healthy status, you should replace the failed disk region in the RAID-5 volume.
1. Open Computer Management (Local).
2. In the console tree, click Computer Management (Local), click Storage, and then click Disk Management.
3. Right-click the portion of the RAID-5 volume on the failed disk, click Repair Volume, and then follow the instructions on your screen.
Notes
To perform this procedure on a local computer, you must be a member of the Backup Operators group or Administrators group on the local computer, or you must have been delegated the appropriate authority. To perform this procedure remotely, you must be a member of the Backup Operators group or Administrators group on the remote computer. If the computer is joined to a domain, members of the Domain Admins group might be able to perform this procedure. As a security best practice, consider using Run as to perform this procedure. For more information, see Default local groups, Default groups, and Using Run as.
To open Computer Management, click Start, click Control Panel, double-click Administrative Tools, and then double-click Computer Management.
To replace a disk region in the RAID-5 volume, you must have a dynamic disk with unallocated space that is at least as large as the region to repair. If you do not have a dynamic disk with enough unallocated space, the Repair Volume command is unavailable. (To verify that you have enough space, right-click the disk, click Properties, and then check the size in Unallocated Space. This size may be slightly smaller than shown in the graphical and list views.)
When a member of a RAID-5 volume fails in a severe manner (such as a loss of power or a complete hard disk failure), computers running Windows Server 2003 operating systems can regenerate the data from the remaining members of the RAID-5 volume.
If the RAID-5 failure is due to a power or cabling failure on a single device, you can regenerate the data within the failed member of the RAID-5 volume once the hardware state is restored.
The RAID-5 volume will not display Healthy status in Disk Management until regeneration is complete.
You can only regenerate RAID-5 volumes with Failed Redundancy status. If the RAID-5 volume has Failed status, try to return it to Failed Redundancy status by using the Reactivate Disk command on the remaining disks.
If the disk containing part of the RAID-5 volume cannot be reactivated and the volume does not return to the Healthy status, you should replace the failed disk region in the RAID-5 volume.
1. Open Computer Management (Local).
2. In the console tree, click Computer Management (Local), click Storage, and then click Disk Management.
3. Right-click the portion of the RAID-5 volume on the failed disk, click Repair Volume, and then follow the instructions on your screen.
Notes
To perform this procedure on a local computer, you must be a member of the Backup Operators group or Administrators group on the local computer, or you must have been delegated the appropriate authority. To perform this procedure remotely, you must be a member of the Backup Operators group or Administrators group on the remote computer. If the computer is joined to a domain, members of the Domain Admins group might be able to perform this procedure. As a security best practice, consider using Run as to perform this procedure. For more information, see Default local groups, Default groups, and Using Run as.
To open Computer Management, click Start, click Control Panel, double-click Administrative Tools, and then double-click Computer Management.
To replace a disk region in the RAID-5 volume, you must have a dynamic disk with unallocated space that is at least as large as the region to repair. If you do not have a dynamic disk with enough unallocated space, the Repair Volume command is unavailable. (To verify that you have enough space, right-click the disk, click Properties, and then check the size in Unallocated Space. This size may be slightly smaller than shown in the graphical and list views.)
When a member of a RAID-5 volume fails in a severe manner (such as a loss of power or a complete hard disk failure), computers running Windows Server 2003 operating systems can regenerate the data from the remaining members of the RAID-5 volume.
If the RAID-5 failure is due to a power or cabling failure on a single device, you can regenerate the data within the failed member of the RAID-5 volume once the hardware state is restored.
The RAID-5 volume will not display Healthy status in Disk Management until regeneration is complete.
You can only regenerate RAID-5 volumes with Failed Redundancy status. If the RAID-5 volume has Failed status, try to return it to Failed Redundancy status by using the Reactivate Disk command on the remaining disks.
Ouch.. I feel for you, mate. But on the other hand, this is what RAID5 was born for: at least your data will be OK, and sata controllers are cheap.
I have a spare 2 port controller at my parents' house that's sitting there doing nothing, but I took a closer look and discovered that my fileserver has an onboard Promise SATA RAID controller (4 port) as well as what appears to be a couple of normal Intel onboard SATA connectors, so I'm not too upset about the card as I have plenty of ports to play with. I think I was upset because it was one more thing had died on me, the final straw.
The good news is everything's looking fairly healthy again, and Chris will sort out the RMA on the faulty disk on Monday so I should hopefully get a new disk on Tuesday. I also had a quick look at the Event Log and it looks like the 2 port controller had been playing up for about 6 hours, so I'm not entirely surprised that it was dead after I moved it between machines. There doesn't appear to be any sign of the RAID array's disk dying/failing to be read, so I'm pretty sure that it failed when I tried to start the server (when I noticed a strange burning smell and quickly held down the power button to turn it off).
And as you point out, that's what RAID-5 is for. It'd just be nice if it could have happened a month from now, seeing as I'm still recovering data from the last RAID-5 array that played up.
The good news is everything's looking fairly healthy again, and Chris will sort out the RMA on the faulty disk on Monday so I should hopefully get a new disk on Tuesday. I also had a quick look at the Event Log and it looks like the 2 port controller had been playing up for about 6 hours, so I'm not entirely surprised that it was dead after I moved it between machines. There doesn't appear to be any sign of the RAID array's disk dying/failing to be read, so I'm pretty sure that it failed when I tried to start the server (when I noticed a strange burning smell and quickly held down the power button to turn it off).
And as you point out, that's what RAID-5 is for. It'd just be nice if it could have happened a month from now, seeing as I'm still recovering data from the last RAID-5 array that played up.
the *only* RAID-5 array that's played up: the current problem is the disk, not the array ;)
(we've got to put up a united front after singing the relative praises of software raid :) )
(we've got to put up a united front after singing the relative praises of software raid :) )
Both of the old hardware RAID-5 arrays used to play up quite regularly*, but (because I was fed up of rebooting and rebuilding) I had a hot spare so it was a bit less dangerous. Sometimes I'd come home to a beeping machine, only to discover it had already rebuild itself onto the hot spare.
The only reason my software RAID-5 array has played up is due to hardware failure, which would have been a problem for hardware RAID, and disastrous if I weren't using RAID at all. I still think software RAID is less hassle, it's very straightforward to use under Windows 2003, and at least I know how to reactivate/resync the array if something goes wrong (something I'd hoped to test on the second software array with some sample data once it was up and running).
* It's possible they were being dropped when the drives briefly went offline to perform SMART checks, turning off SMART might have made them a lot more stable
The only reason my software RAID-5 array has played up is due to hardware failure, which would have been a problem for hardware RAID, and disastrous if I weren't using RAID at all. I still think software RAID is less hassle, it's very straightforward to use under Windows 2003, and at least I know how to reactivate/resync the array if something goes wrong (something I'd hoped to test on the second software array with some sample data once it was up and running).
* It's possible they were being dropped when the drives briefly went offline to perform SMART checks, turning off SMART might have made them a lot more stable
No, no, my point is your (software) RAID hasn't played up at all - it's working exactly as it's meant to. The disk is playing up, not the array ;)
When the disk was initially classed as "Missing" after I booted into Windows, the array appeared as "Failed" in Disk Management, rather than "Failed Redundancy", forcing me to "Reactivate" the array before I could access the data again. It said it had "Failed", but obviously it's up and running again - even without the extra disk for redundancy - so it hasn't failed and lost all my data, which is why I went for the rather vague wording of "played up".
I get the impression it says "Failed Redundancy" if a disk fails while Windows is up and running; and after I shut down the fileserver and brought it up again it remained at "Failed Redundancy". It presumably only goes to "Failed" if it's lost/"Missing" a disk on bootup, but then remembers the state of the array after that.
The RAID array I'm recovering data from definitely played up, and definitely failed, partly because of what I think is a dodgy disk, but not helped by the fact Highpoint's controllers have a nasty habit of dropping drives for no good reason. I still prefer sotware RAID. If nothing else, the Highpoint controllers don't let you mix and match disks, even if you have two cards and 16 similar disks, while I can theoretically plug a disk (including larger disks) into any controller and use the dynamic disk under Windows. I haven't tried it, but I did think about it: it's presumably possible to use the onboard SATA RAID controller to present a couple of 250GB drives as a striped drive, which Windows would see as a single 500GB drive and let me add to a RAID-5 array of 500GB disks that was missing one disk. Far more flexible!
I get the impression it says "Failed Redundancy" if a disk fails while Windows is up and running; and after I shut down the fileserver and brought it up again it remained at "Failed Redundancy". It presumably only goes to "Failed" if it's lost/"Missing" a disk on bootup, but then remembers the state of the array after that.
The RAID array I'm recovering data from definitely played up, and definitely failed, partly because of what I think is a dodgy disk, but not helped by the fact Highpoint's controllers have a nasty habit of dropping drives for no good reason. I still prefer sotware RAID. If nothing else, the Highpoint controllers don't let you mix and match disks, even if you have two cards and 16 similar disks, while I can theoretically plug a disk (including larger disks) into any controller and use the dynamic disk under Windows. I haven't tried it, but I did think about it: it's presumably possible to use the onboard SATA RAID controller to present a couple of 250GB drives as a striped drive, which Windows would see as a single 500GB drive and let me add to a RAID-5 array of 500GB disks that was missing one disk. Far more flexible!
See I told ya more cooling is worth it! Stick a couple of big fans onto those hard disks. I am impressed you managed to scorch one. I have never seen that before.
I already had a 120mm fan over those 4 disks (directly over the disk that's now dead)!
Oddly enough the other 4 disks are fine, despite - at the time - not having any fans near them (they were running much hotter than the first 4). I've since put two 80mm fans by them to keep them cool.
The problem doesn't really appear to be heat related, I think it was purely some sort of hardware fault in the drive that caused it to fail spectacularly when I turned the fileserver on.
Oddly enough the other 4 disks are fine, despite - at the time - not having any fans near them (they were running much hotter than the first 4). I've since put two 80mm fans by them to keep them cool.
The problem doesn't really appear to be heat related, I think it was purely some sort of hardware fault in the drive that caused it to fail spectacularly when I turned the fileserver on.