R6 Degraded - 2 simultaneous drive failures

  • 20 Views
  • Last Post 2 weeks ago
Morgan Samuel posted this 2 weeks ago

Hi all, 48TB Pegasus R6 running RAID 5 (40TB) as a periodic backup area on Mac OSX. I have just had 2 drives fail at the same time.  The unit is communicatin with the Promise Utility, but I can't really do anyhthing with it except tell which drives are marked "DEAD". I haven't been able to find anything in the manual with troubleshooting this, other than replacing the drives;

1) does that seem unusual for more than 1 drive to fail at the same time?

2) how do I tell if the drives are actually dead, or if there is another issue at play?

3) is it possible for the Promise Utility repair the issue if it's not hardware related, then rebuild the whole array again, or do I need to just buy 2 new 8TB drives?

Order By: Standard | Latest | Votes
R P posted this 2 weeks ago

Hi Morgan,

Two new hard drives going bad at the same is very unusual unless there is a bad batch of drives. But when the drives are old, it's more likely. There was a study done by a large datacenter several years back and one of the more interesting findings was that when one drive had failed, another drive was ready to fail. And I've seen this in action, I got a call from IT many years back complaining that a rebuild in progress had stopped. Looking at it I saw that during the rebuild a second drive had failed. Luckily it was RAID6, I manually started a rebuild on the spare and told him to replace the other failed drive after the rebuild was completed.

You have a RAID5. If we assume both drives are not bad, it's still important to do things in the proper order. If one drive failed then another, then one drive is stale and you cannot use it. The procedure would be to force the last drive that went offline, the one that took the LD offline then start a rebuild to the remaining drive as it can't be put back in the array.

The event logs should show the order of things and allow the correct procedure to be determined. Without the event logs, you don't know what happened or what to do about it. The important question what happened, was there a brownout? If so both drives are probably OK. Do you see lots of bad block errors and command timeout errors on both drives? If so they are probably bad drives and recovery is problematic.

Morgan Samuel posted this 2 weeks ago

Thanks for the prompt reply! There is nothing in the event logs that shows the time when the drives went offline, for some reason I can only see today's activity. i've powered the R6 down and back up a few times since the drive failures in hope of the drives coming back online - perhaps this removed all the event history. Either way, I've tried to force the drives back online using terminal, but had no success. 

I think I possibly just need to accept that the drives have died, and replace them. Luckily this is a rolling backup of an active work area, so unless that work area dies between now and when I get new drives in (touch wood), we won't have lost anything. Just wanted to do my due diligence before rushing out to buy 2 new drives. 

Cheers again for the help. 

 

 

R P posted this 2 weeks ago

Hi Morgan,

Please check the NVRAM events, these will go back a long way.

Also, if you could upload a service report I could get a better idea of the state of the system.

Or if you prefer, you could delete the array, replace the drives, create a new array/LUN and restore from backups (good job having backups). But given that this is a Pegasus 1 and the drives are old, I think it would be better to replace the dead drives than take the chance they will fail again later and leave you in the same predicament.

Close