Drives going down like flies

  • 435 Views
  • Last Post 22 October 2021
Alexander Snelling posted this 12 October 2021

Hi there. I've got terrible problems with a Pegasus 2 R4 2TB x 4 array in RAID 5 (ie 6TB).

It's quite old but never had many problems with it apart from the occasional drive going down (every two years or so).

A week or so ago a drive appeared dead so I orderd a new one and replaced it. Turned the RAID off until the new drive arrived, replaced it and rebuilt (I've done this several times before and very familiar with the process). Rebuild was successful and all drives appeared to be good until this morning when I swapped all the drives into another R4 unit (I use them intechangeably and have been doing this for five years or more). This time, THREE drives appeared with red light and flagged up as dead. After some swapping out and restarting etc, three are now appearing as Staleconfig and the other one is Unconfigured. Terminal says they are online even though they are not so I cannot attempt to force them online to transfer data off.

Have taken each drive into a separate USB bay and run EaseUS recovery software on them all and EaseUS can see most if not all the files that were on the array by just looking at a single drive but I havent tried recovering it - I can't see how it can be recovered from a single drive but it can see this data on a single drive, even the drive that was reported as dead so there is something there and all hope is not lost!

My main concern is to get this data transferred off as there is no real backup (yes I know, but this happened between backups) - but I cannot read the files on the drives even thoughg they are clearly all there.

Is there any hope? It appears the partition map is corrupted or missing and I cannot find a way to recover this.

 

Thank you

Order By: Standard | Latest | Votes
R P posted this 13 October 2021

Hi Alexander,

If you have the drives inserted in the same slots as before it should be possible to recreate the array/LUN. But if the disks were moved often, then it's possible that the proper sequence is not known.

Do you have any service reports from before this happened to show what was configured previously? Or can you be sure that they are in the proper sequance order now (this means that all the disks are in the slots they where they were when the RAID was created)?

The files are striped across all the disks, so you won't be able to recover them with any normal recovery software.

  • Liked by
  • Alexander Snelling
Alexander Snelling posted this 14 October 2021

Hi RP. Thanks for your reply. 

Unfortuntalely I don't have this info (unless I have an old service report - I'll look but I doubt this exists). In Promise Utility there appears to an array number on each drive (0,1,2,3) - is that the sequence? I'll insert using that order - what is the next step?

(Failing that, as it's an R4 there are only a manageable number of permutations (24 if my brain is working) but how can I tell if it's the right sequence chosen?)

 

thanks Alex

Alexander Snelling posted this 14 October 2021

Hi - I've had another look at the drives in Promise Util and also Terminal promiseutil and there appears to be no array present or information available. Is there anything I can do? Surely there must be some sort of flag present to ID which drive is which?

I've seen a couple of threads here where people were told how to revive a staleconfig - is this possible?

I enclose a service report from when this first happens but this alos reports no arrays present.

Attached Files

Alexander Snelling posted this 14 October 2021

Just uploaded the status of my drives in PU. Seem PU is recognising the array phy drives (3 out of 4) but reporting them as dead, which they are almost ceretainly not. Feel there is a fix here but I don't know what it is.

These drives contain some unrepeatable footage from an as yet unproduced film (I really don't need a lecture on backing up here, I am more than aware of that). 

R P posted this 14 October 2021

Hi Alexander,

First: This is not guranteed to work. Also, if there is damage to the partition map, it may work but you still may not see the volume on the desktop and will need to repair the volume with diskutil or disk warrior.

----

The image shows the sequence numbers, it's clear that PD3 and PD4 are in the same slots, but PD1 and PD2 have the same drive model number and it's not clear whether they are in the same position or reversed.

Assuming that everything is in the same order, we can proceed...

This will have to be done from the promise CLI.

First, we need to clear the slateconfig status.

cliib> phydrv -a clear -p 1 -t staleconfig

cliib> phydrv -a clear -p 2 -t staleconfig

cliib> phydrv -a clear -p 3 -t staleconfig

PD4 is showing PFA status, we will need to clear that.

cliib> phydrv -a clear -p 4 -t pfa

Now the array can be recreated...

cliib> array -a add -p 1,2,3,4 -l "raid=5,forcesynchronized=yes"

If everything is correct you should see your drive appear on the MAC desktop shortly. Please verify some files, ideally if you have any videos on the drive play one and make sure they are OK.

If everything is not right then the array should be deleted immediatly.

Do not try to repair the drive with diskutil, if the disk order is wrong a repair will do damage to the filesystem. If we have the disk order wrong (don't know about PD1 and PD2) the solution is to delete the array and try again with a different sequence.

Lastly, PD4 was in PFA condition, most likely SMART said it was failing, it may go out again. If you don't have a spare drive it would be a good idea to have one on hand. 2TB drives are pretty inexpensive today.

Alexander Snelling posted this 14 October 2021

Hi RP

 

Thank you so much for offering some hope here. I've tried the above and get the following (see attached).

Don't want to mess further without knowing what I might be doing.

Thanks

Alex

R P posted this 14 October 2021

Hi Alexander,

I don't see an attachment.

Alexander Snelling posted this 14 October 2021

Just to give some more background that might be helpful. Originally this array was made of 4x toshiba drives. One went down a while ago - can't remember when but less than a year ago. I still have the old one (now labelled "wrong.") This still has data on it. The second Toshiba went down a week ago and that was replaced too - I still have that and again it still has data on it. In other words I have 6x drives from the same array until they all disppeared. Seems 3 drives are labelled dead and one unconfigured (as I just cleared it). The other two older drives; one is labelled Unconfigured (I cleared the stale status using PU) and the second one is labelled as stale. The second spare is almost certainly upto date in terms of data as nothing was done after I rebuilt the latest time and I am not convinced there is anything wrong with it.

 

I also have another Pegasus chassis with drive cases, so could use this to test if that is useful.

R P posted this 14 October 2021

Hi Alexander,

Did you replace the drives after they failed and let a new drive rebuild? Did the rebuilds complete?

The default configuration is RAID5, it will stay online with one drive missing, but not two.

So you'll need at least 3 original drives before a recovery is possible.

And we will have to get the sequence correct.

The directions posted won't work unless all of the drives are valid members of the array and we know the sequence order.

Alexander Snelling posted this 14 October 2021

Yes I replaced one drive twice. Once a year or so ago and once last week. Rebuilt successfully both times. The most recent time (last week), The rebuild completed successfully (or so I thought) amd then I moved the drives to my other enclosure (I have a Pegasus 1 and 2). At this point on power up, two drives showed a red light (PU flagged them as dead) and shortly after that another one went down. Unfortunately I didn't record the order of the drives at this point. 

I have a spare 2TB drive that I can erase and use as a spare and rebuild using the other three if we can find the correct order. I'm aware this might be time consuming but the alternative is about a months worth of hard file wrangling, so this is way more preferable. 

the issue I have now is that three drives (which I think are OK are showing as dead. Promiseutil cannot fix them by clearing stale or PFA (it refuses) so I'm not sure how to revive them in order to try to reconfigure the array - is there a "clear dead" command? I'm convinced they are not dead. 

thank you

alex

Alexander Snelling posted this 14 October 2021

Also it has occurred to me I can't remember which chassis originally configured the array - the Pegasus 1 or 2 - if I have the drives in the wrong chassis would that impact a reconfigure?

R P posted this 14 October 2021

Hi Alexander,

the issue I have now is that three drives (which I think are OK are showing as dead.

The service report you uploaded shows 3 drives as staleconfig and one drive PFA with none dead.

===============================================================================
PdId Model Type Capacity Location OpStatus ConfigStatus
===============================================================================
1 TOSHIBA DT01 SATA HDD 2TB Encl1 Slot1 Stale StaleConfig
2 TOSHIBA DT01 SATA HDD 2TB Encl1 Slot2 Stale StaleConfig
3 ST2000DM008- SATA HDD 2TB Encl1 Slot3 Stale StaleConfig
4 ST2000DM001- SATA HDD 2TB Encl1 Slot4 PFA Unconfigured

Without accurate information about the state of the drives a recovery solution won't be possible.

Alexander Snelling posted this 15 October 2021

Yes this changed. I didnt do anything apart from swap an old one in and out to check if the older one was showing any different behaviour or status.

 

Just came in this morning and the array numbers had changed from last night (no3 was at the top)

Swapped drives 1 and 4 over and then got the wrong config again. Swapped them back to original placings and now they seem in the right order. (There are two Toshibas from the original stripe and two Seagate replacements.)

 

I'm thinking I could try putting the last failed Toshiba back in so there are three from the original stripe then add a newly formatted drive. Could then try to rebuild from there is I can find the right drive placement (could be a long weekend!) 

Attached Files

Alexander Snelling posted this 15 October 2021

Problem now is getting the three "dead" drives back online.

Alexander Snelling posted this 15 October 2021

Tried phydrv -a online -p 2

Now all back online and even saw the array and folder structure momentarily but needed to connect another raid to allow me to get these files off. Now it's not mounting but does appear to allow me to rebuild. Ideally I want to back up my data before rebuild...

 

Attached Files

Alexander Snelling posted this 15 October 2021

Current status. Volume mounted this morning but I had to power down to connect a drive to offload media. When restarted volume would not mount.

Rebuild does not seem possible now either. I am assuming the array needs to be deleted and recreated but am aware how dangerous this could be. Have started media patrol. Not going to touch anything until I hear back as I sense this is now quite close to a solution.

cliib> phydrv

===============================================================================

PdId Model        Type      Capacity  Location      OpStatus  ConfigStatus     

===============================================================================

1    ST2000DM001- SATA HDD  2TB       Encl1 Slot1   OK        Unconfigured     

2    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot2   Media Pat Array0 No.1      

3    ST2000DM008- SATA HDD  2TB       Encl1 Slot3   OK        Array0 No.2      

4    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot4   OK        Array0 No.3      

 

 

R P posted this 15 October 2021

Hi Alexander,

If the array is degraded when you boot the Pegasus you will have to accept the array before it will come online.

The CLI command is...

array -a accept -d 0

I would suggest copying the files off and not worry about the rebuild for now.

R P posted this 15 October 2021

Hi Alexander,

Just came in this morning and the array numbers had changed from last night (no3 was at the top)

This is not possible, the drives cannot move themselves.

Alexander Snelling posted this 15 October 2021

Hi RP

Just came in this morning and the array numbers had changed from last night (no3 was at the top)

"This is not possible, the drives cannot move themselves."

 

I'm not suggesting the drives moved themselves. I intentionally physically swapped 1 and 4 around in order to get the right order (ie Array 0,1,2,3) as I thought that might be important - I suspect it isn't. Sorry the time difference is making this doubly difficult but I so appreciate what you are doing - as I said in another post, I think (hope) I am nearly there, but dont want to speak too soon. Will focus on getting the media offloaded first using 

array -a accept -d 0

Alexander Snelling posted this 15 October 2021

Now getting this:

cliib> phydrv

===============================================================================

PdId Model        Type      Capacity  Location      OpStatus  ConfigStatus     

===============================================================================

1    ST2000DM001- SATA HDD  2TB       Encl1 Slot1   OK        Unconfigured     

2    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot2   OK        Array0 No.1      

3    ST2000DM008- SATA HDD  2TB       Encl1 Slot3   OK        Array0 No.2      

4    TOSHIBA DT01 SATA HDD  2TB       Encl1 Slot4   Media Pat Array0 No.3      

 

cliib> array -a accept -d 0

Accepting this array can result in offline logical drives and lost data

The disk array does not have an incomplete condition to accept

 

Show More Posts
Close