VTrak M500i All drives marked STALE

  • 157 Views
  • Last Post 12 November 2024
Homer Ajax posted this 02 November 2024

The array was last online in Sep working fine no major problems.  It was powered down gracefully through CLU.

When it was powered back up, all the logical configuration was gone and physical drives were all marked as Stale config.  No events were recorded in the NVRAM or RAM to explain this.

The drives have all the array data, I have two encrlosures and can move arrays by simply moving drives. Thus I suspect I just need to clear something in the chassis.  Unfortunately I followed a misleading guide and Cleared StaleConfig on one drive which simply wiped it's data.

There was only 1 logical drive with a 14 drive RAID50 and a hot spare.

I would really appreciate help or a CLI guide with instructions, this array was used to store old data that might be important in the future.

Thank you,

Homer

 

TECH data

===============================================================================

PdId Model        Type  CfgCapacity Location       OpStatus   ConfigStatus     

===============================================================================

1    Hitachi HDS7 SATA  931.32GB    Encl1 Slot1    Stale      StaleConfig      

2    Hitachi HDS7 SATA  931.32GB    Encl1 Slot2    Stale      StaleConfig      

3    Hitachi HDS7 SATA  931.32GB    Encl1 Slot3    Stale      StaleConfig      

4    Hitachi HDS7 SATA  931.32GB    Encl1 Slot4    Stale      StaleConfig      

5    Hitachi HDS7 SATA  931.32GB    Encl1 Slot5    Stale      StaleConfig      

6    Hitachi HDS7 SATA  931.32GB    Encl1 Slot6    Stale      StaleConfig      

7    Hitachi HDS7 SATA  931.32GB    Encl1 Slot7    Stale      StaleConfig      

8    Hitachi HDS7 SATA  931.32GB    Encl1 Slot8    Stale      StaleConfig      

9    Hitachi HDS7 SATA  931.32GB    Encl1 Slot9    Stale      StaleConfig      

10   Hitachi HDS7 SATA  931.32GB    Encl1 Slot10   Stale      StaleConfig      

11   Hitachi HDS7 SATA  931.32GB    Encl1 Slot11   Stale      StaleConfig      

12   Hitachi HDS7 SATA  931.32GB    Encl1 Slot12   Stale      StaleConfig      

13   Hitachi HDS7 SATA  931.32GB    Encl1 Slot13   Stale      StaleConfig      

14   Hitachi HDS7 SATA  931.32GB    Encl1 Slot14   Stale      StaleConfig      

15   Hitachi HDS7 SATA  931.32GB    Encl1 Slot15   OK         Unconfigured    

 

==============================================================================

LdId Alias       OpStatus    Capacity  Stripe RAID    CachePolicy     SYNCed

==============================================================================

 

There are not any logical drives.

 

 

 

 BootLoaderVersion     : 2.02.0000.00   BootLoaderBuildDate   : Mar 3, 2011

 FirmwareVersion       : 2.39.0000.00   FirmwareBuildDate     : Mar 3, 2011

 SoftwareVersion       : 2.39.0000.00   SoftwareBuildDate     : Mar 3, 2011

 

3190  Ctrl 1         Info     Sep 20, 2024 21:00:34 The system is started

3191  Fan 1 Enc 1    Minor    Sep 20, 2024 21:00:50 PSU fan is malfunctioning

3192  Port 1 Ctrl 1  Info     Sep 20, 2024 21:01:25 Host interface link is up

3193  Port 2 Ctrl 1  Info     Sep 20, 2024 21:01:26 Host interface link is up

3194  Ctrl 1         Info     Nov 1, 2024 16:31:28  The system is started

3195  Port 1 Ctrl 1  Info     Nov 1, 2024 16:32:50  Host interface link is up

Order By: Standard | Latest | Votes
R P posted this 05 November 2024

Hi Homer,

Under normal conditions a lost arral/LD can simply be recreated. This requires that the array and LD details be known.

In this case we know only that there is a RAID50 on a 14-drive array with 1 hot spare. But the post details make this questionable.

Unfortunately I followed a misleading guide and Cleared StaleConfig on one drive which simply wiped it's data.

There is one drive unconfigured, which is what you get when a staleconfig is removed. So that would seem to indicate that PD15 was an array member and that it was a 15-drive array with no spare, which was a common configuration.

Can you clafify which drive you removed staleconfig status from?

 

Homer Ajax posted this 12 November 2024

I appreciate you getting back to me. Here's the info you were requesting.

This was the condition of the array immediately after the failure:

I then followed the Promise product manual/knowledgebase which basically instructed me to:

(Stale – The physical drive contains obsolete disk array information. Click on the Clear tab. )

I chose drive 15 as a guine pig, Clearing Stale condition.  I was expecting that drive to come back online and indicate a degraded offline array.  However it simply came back as an unconfigured new drive.  This is the current state:

I'm confident that this was configured as a Single 12TB usable RAID 50 array with a hot spare.  I'm much less confident regarding the array configuration, specifically the stripe size I chose.  

I would like a bit more information regarding recreating the array, are these the correct steps?

1. Clear stale config on all Physical drives

2. Identify the old hot spare.

3. Build a new 14 member array without initializing use identical settings (would using wrong settings simply not work or destroy array?)

4. Build a new Logical drive with identical settings (what are the caveats here?  also same question regarding wrong settings)

5. Last question according to Promise documentations there's a reserved sector on all Physical drives that contains all the information I need to recover.  That data isn't encrypted and could be recovered, however it's all very vague regarding specifically where this data is stored nor whether it's clear text or obfiscated in some way.  

I can make an exact physical duplicate of an array physical drive, dump it with DD and view the raw data, is that an option?  

Thank you much,

Homer

 P.S. I was wondering if you knew where the physical drive having a stale condition recorded?  I presume it's saved directly on the drives to prevent a bad drive being used in another chassis.  Can I use a hex editor to simply remove that flag without destroying the config?

 

R P posted this 12 November 2024

Hi Homer,

Since all drives are marked stale and there was only 1 powerdown/reboot event, most likely all the drives have consistent data.

Your procedure looks sound, regarding the questions...

3. Build a new 14 member array without initializing use identical settings (would using wrong settings simply not work or destroy array?)

An array is just a group of disks, there are no array settings per se, the settings belong to the logical drive(s).

Initialization, as you surmise, will wipe the data from the disks.

4. Build a new Logical drive with identical settings (what are the caveats here?  also same question regarding wrong settings)

This is where settings matter. If you get the stripe size, block size, the number of RAID50 axles, or the sequence order of the drives in the array wrong then as soon as the sync starts it will overwrite the data. Even if default settings were used, sometimes they change between firmware releases. For example, the original Pegasus default stripe size was 64K, later it became 1M, so if the LD was created and the firmware is updated, it is not impossible that a default has changed. I'm not familiar enough with the 500i to know if any defaults changed across firmware releases. But if the array/LD was created with the firmeare installed now (which is the latest release) then everything should be fine WRT default values.

Another issue is that you have a spare, if a disk had failed and a rebuild took place, then the disks will not be in default sequence order and you need to know the sequence order to recreate an array.

In the M500i after creating a logical drive a synchronization should start immediately. So you get one shot.

5. Last question according to Promise documentations there's a reserved sector on all Physical drives that contains all the information I need to recover.  That data isn't encrypted and could be recovered, however it's all very vague regardingspecifically where this data is stored nor whether it's clear text or obfiscated in some way. 

The disk array information is stored on a disk in an area known as DDF. This is managed by the 500i firmeware and is not accessible to the user. In this case we know what it says about all the drives, it says that the DDF is stale.

I can make an exact physical duplicate of an array physical drive, dump it with DD and view the raw data, is that an option? 

The data is striped across all the drives in the RAID50 set, and this is true for any RAID set except a 1-drive RAID0. You won't find any files on an individual drive.

P.S. I was wondering if you knew where the physical drive having a stale condition recorded?  I presume it's saved directly on the drives to prevent a bad drive being used in another chassis.  Can I use a hex editor to simply remove that flag without destroying the config?

I don't know where the DDF is located on disk, and I don't think it's as simple as resetting a flag.

Homer Ajax posted this 12 November 2024

Thank you for your response.  I really appreciate your help on such an old product.  

Per your advice, here's additional data.

1. The array was purchased in 2006 (Props on hardware quality) this array was built back then with the initial firmware.

2. Hot Spare drive today is PD12, and originally it was PD15 thus at least one drive had failed previously.

3. Would you mind looking into this specific platform.  This is a known issue documented on your site.  Every Firmware version included fixes for it, with the last version 6.3 purely dedicated to fixes to this problem. Being such a common issue, there should be a procedure somewhere in your intranet.  I realize that it's ancient but I don't require hand holding, just some platform specific documentation to get me going.

4. In regards to the DDF or superblock being inaccessible, that's true within the Promise environment.  However that restriction is done through software.  It's just a regular HD sector reserved for system data and I can read it in any other environment. 

5. We don't know where the STALE flag is, but the fix is exactly that to clear that glitched flag.  Fundamentally we have a good chassis, good drives, good copy of DDF, good copy of all the data. We simply have to remove the flag that's preventing the acceptance of the good config.

In fact I've dealt with this exact same issue twice before, once with 3ware and once with EMC.  Embarrassingly one of those was when I pulled a good drive from a degraded array (whoopsy :)) The jokes took a long time to live down.  But in both cases they simply patched out the flags from the firmware and the array was online immediately with no issues.

6. Unfortunately your proposal isn't an option. I appreciate why it's not a good idea to permit clearing a STALE flag in the interface itself.  However in this particular case that's exactly what we need to do.  Trying to fix it by erasing the DDF to remove a flag, only to bet on a one in 1 in a million chance that your DDF returns back exactly the same.  Initializing a huge 14 drive array on a new Firmware and a known glitch in the controller.  There are too many variables, litterally thousands valid ways to create this array.  It's impossible the operation is too complex.

It would really be nice to restore the array in place if you find the time.  Without that my only other option is to resort to tools like mdadm or some commercial product to go through an agonizing virtual RAID recovery.

 

 

 

 

 

R P posted this 12 November 2024

Hi Homer,

3. Would you mind looking into this specific platform.  This is a known issue documented on your site.  Every Firmware version included fixes for it, with the last version 6.3 purely dedicated to fixes to this problem.

Odd then that they don't mention it in the release notes.

Here are the 2.39 (SR 6.3) release notes.

2.2.1 FW Related Fixes
* Fix back‐end command process issues (OutOfResouces).
* Log and report events that cause drive failures in specific areas.
* Updated medium error threshold criteria for marking drive dead.

Here's the 2.38 release notes.

2.2.1    FW Related Fixes
* BT #25373: HDD time-out error handling needs to be consistent with Vess series and Mx10 series in which disk drives need to   be marked dead when encountering over 6 time-outs within 30 minutes.
* BT#16859: Spare Disk capacity becomes 8GB.
* BT#22690: Can't initiate rebuild automatically or at BGA when source drive has been changed.

In fact I've dealt with this exact same issue twice before, once with 3ware and once with EMC.  Embarrassingly one of those was when I pulled a good drive from a degraded array (whoopsy :))  But in both cases they simply patched out the flags from the firmware and the array was online immediately with no issues.

There are many potential ways to impliment DDF, I don't know how it's done on Promise arrays though. There is an informal body that sets standards for these things, but they are more suggestions than requirements. Every manufacturer impliments things differently.

But you do have access to the disks and can examine them.

However a solution of erasing DDF to remove a flag, only to bet on a one in 1 in a million chance that your DDF returns back exactly the same.

I've successfully done this numerous times on many storage systems. But having a service report assures that the exact same array+LD can be created and the sequence of events can be determined. We don't have one here. Also we now have a forced-sync option to prevent synchronization so the first attempt need not be the last. There have been many changes since the M500i.

Also the M500i is a few years before my time and I don't have much experience with it or more documentation than what's on the downloads page.

It would really be nice to restore the array in place if you find the time. 

I agree. But what we have discussed are the tools I have available. There is one place that I have access to that might potentially have something of use here. I will look but this is not a gurantee. In fact the last few times I've looked there I have not struck paydirt.

Without that my only other option is to resort to tools like mdadm or some commercial product to go through an agonizing virtual RAID recovery.

The Promise RAID engine is not mdadm, it's proprietary and now at v4.0. You won't be able to use standard Linux tools. There is a Promise driver in the modern Linux kernel, the STEX driver. I don't know if it's compatible with the M500i RAID, but I think they have are more of less contemporary and it might work.

 

Close