Here is the process SRM follows when doing a Test failover or Failover operation. Please read/ understand and mug this(if you have to J), since most of us are unaware of this.
Hope this clarifies your doubts. These steps are written by SRM-Guru CPD.
1. SRM issues the TestFailoverStart of Failover command to the SRA. In this command SRM provides the list of devices that are to be prepared for failover or testfailover. It also provides a list (ESXi) initiators to which this device is to be presented.
2. Once the SRA completes the command, it informs SRM (through an output XML file that the SRA generates and that SRM listens and waits for) that the devie in question is ready for access. At this point this device is supposed truly accessible and ready for access.
3. SRM now will issue a cluster wide rescan to all ESXi hosts (that designated as recovery hosts). In this step, SRM can wait for a certina period of time before issuing the rescan command. This is called a SRA rescan delay. We introduced this delay to speciafically latency problems in relation to the RecoverPoint SRA which is known to signal completion of TestFailover/Failover prematurely – in other wirds before the recovered device is truly ready for READ-WRITE access. This problem with that SRA was discussed close to two years ago and EMNC said they cannot do anything about that because of the ASYNCH way that that SRA works in.
By default the SRA delay is 0. But it is configurable as we have done in this case.
Note: Setting the SRA rescan delay to 5 minutes will be in line with the VMkernel 5 minute path evaluation interval. So that guarantees at least one rescan before the next VMKernel path evaluation.
4. SRM can also issue multiple successive rescan command followed by a single VMFS refresh command. The number of successive rescans that SRM can issue is also configurable. In this case, it is set to 2. The default value is 1.
5. After the rescan is completed, SRM will wait a certain of time for notification from VC Server as to which devices on ESX are attached and which ones are not. SRM is only interested in the recovered device in question.
6. On any ESX host where the recovered is in detached state, SRM will issue the ATTACH_SCSI_LUN command to have that device attached.
Note: If on an ESX has the recovered device is in APD state, the ESX will not report it as detached. As a result, SRM will not issue an command to that host to have that device attached. This is what is happening on some ESX hosts in this case as you see in other comments in the PR.
7. After SRM issues the ATTACH_SCSI_LUN on the ESXi hosts that have reported the recovered device in DETACHED state (i.e. these are now the good hosts or as SRM calls them ‘the requested hosts’), SRM then issues the command QUERY_UNRESOLVED_VOLUMES command to all these hosts.
Note: all good hosts should report the same number of unresolved volumes as the number of recovered devices or less. It would be less if say for example, the recovered devices contain VMFS datastore and RDMs.
8. SRM then selects one of these good hosts to issue the resignature command to.
Note: we have configured SRM in this case to use the resignaturing API (and not the old LVM resignaturing method – i.e. the LVM.EnableResiganture = 1).
9. SRM then issues another Rescan (Refresh = true) to these hosts.
10.SRM then waits for notifications from VC as which ESX hosts are now seeing the recovered VMFS datastore (that is on the recovered device)
Note: the time SRM waits for these VC notification of recovered VMFS datastores is configurable.
Note: SRM knows what VMFS datastore should be seen by ESX and SRM is able to verify if the good ESX hosts are now seeing the correct (i.e. expected VMFS datastore)
11. At this stage, SRM proceeds further by removing the placeholder VMs and registering the actual recovered VMs that on the recovered VMFS datastore.
Note: SRM will only use the good hosts to register these VMs on.
12. If IP customization is required, for these VMs, SRM will complete that as well.
There are other minor details that I have skipped. But this is pretty much the process.
In the Testfailover cleanup process, SRM follows these steps:
Note: There is no cleanup in the case of an actual failover operation.
1. Unregister the recovered VMs
2. Re-Register the placeholder VMs.
3. Unmounts the recovered VMFS datastore from all good hosts
4. Detach the recovered device from all good hosts (DETACH_SCSI_LUN)
6. Clear the unmount state of the VMFS volume
7. Clear the detach state of the SCSI LUN
8. SRM then issues the TestFailoverStop command to the SRA. The SRA now returns the device (on the storage array) to the state it was in before the TestFailoverStart command was issued and executed.