Permanent Device Loss (PDL):

  • A datastore is shown as unavailable in the Storage view
  • A storage adapter indicates the Operational State of the device as Lost Communication

All-Paths-Down (APD):

  • A datastore is shown as unavailable in the Storage view.
  • A storage adapter indicates the Operational State of the device as Dead or Error.

PDL:

In vSphere 4.x, an All-Paths-Down (APD) situation occurs when all paths to a device are down. As there is no indication whether this is a permanent or temporary device loss, the ESXi host keeps reattempting to establish connectivity. APD-style situations commonly occur when the LUN is incorrectly unpresented from the ESXi/ESX host. The ESXi/ESX host, still believing the device is available, retries all SCSI commands indefinitely. This has an impact on the management agents, as their commands are not responded to until the device is again accessible. This causes the ESXi/ESX host to become inaccessible/not-responding in vCenter Server.

In vSphere 5.x/6.x, a clear distinction has been made between a device that is permanently lost (PDL) and a transient issue where all paths are down (APD) for an unknown reason.

For example, in the VMkernel logs, if a SCSI sense code of H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 or Logical Unit Not Supported is logged by the storage device to the ESXi 5.x/6.x host, this indicates that the device is permanently inaccessible to the ESXi host, or is in a Permanent Device Loss (PDL) state. The ESXi host no longer attempts to re-establish connectivity or issue commands to the device.

Devices that suffer a non-recoverable hardware error are also recognized as being in a Permanent Device Loss (PDL) state.

Note: Some iSCSI arrays map LUN-to-Target as a one-to-one relationship. That is, there is only ever a single LUN per Target. In this case, the iSCSI arrays do not return the appropriate SCSI sense code, so PDL on these arrays types cannot be detected. However, in ESXi 5.1, enhancements have been made and now the iSCSI initiator attempts to re-login to the target after a dropped session. If the device is not accessible, the storage system rejects the host’s effort to access the storage. Depending on the response from the array, the host can now mark the device as PDL.

Vmkernel.log

++++++++++

2018-01-09T12:42:09.365Z cpu0:32888)ScsiDevice: 6878: Device naa.xxxxxxxxxxxxxxxxxxxxx APD Notify PERM LOSS; token num:1

2018-01-09T12:42:09.366Z cpu1:32916)StorageApdHandler: 1066: Freeing APD handle 0x430180b88880 [naa.xxxxxxxxxxxxxxxxxxxxx]

2018-01-09T12:49:01.260Z cpu1:32786)WARNING: NMP: nmp_PathDetermineFailure:2973: Cmd (0xc1) PDL error (0x5/0x25/0x0) – path vmhba33:C0:T3:L0 device naa.xxxxxxxxxxxxxxxxxxxxx – triggering path evaluation

2018-01-09T12:49:01.260Z cpu1:32786)ScsiDeviceIO: 2651: Cmd(0x439d802ec580) 0xfe, CmdSN 0x4b7 from world 32776 to dev “naa.xxxxxxxxxxxxxxxxxxxxx” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0.

2018-01-09T12:49:01.300Z cpu0:40210)WARNING: NMP: vmk_NmpSatpIssueTUR:1043: Device naa.xxxxxxxxxxxxxxxxxxxxx path vmhba33:C0:T3:L0 has been unmapped from the array

After some time passes you will see this message:

2018-01-09T13:13:11.942Z cpu0:32872)ScsiDevice: 1718: Permanently inaccessible device :naa.xxxxxxxxxxxxxxxxxxxxx has no more open connections. It is now safe to unmount datastores (if any) and delete the device.

In this case the lun was unmapped from the array for this host and that is not a transient issue. Sens data 0x5 0x25 0x0 corresponds to “LOGICAL UNIT NOT SUPPORTED” which indicates the device is in Permanent Device Loss (PDL) state. Once ESXi knows the device is in PDL state it does not wait for the device to return back.

ESXi only checks ASC/ASCQ and if it happens to be 0x25/0x0 or  0x68/0x0, it marks device as PDL.

 

All-Paths-Down (APD)

If PDL SCSI sense codes are not returned from a device (when unable to contact the storage array, or with a storage array that does not return the supported PDL SCSI codes), then the device is in an All-Paths-Down (APD) state, and the ESXi host continues to send I/O requests until the host receives a response.

As the ESXi host is not able to determine if the device loss is permanent (PDL) or transient (APD), it indefinitely retries SCSI I/O, including:

  • Userworld I/O (hostd management agent)
  • Virtual machine guest I/ONote: If an I/O request is issued from a guest, the operating system should timeout and abort the I/O.

Due to the nature of an APD situation, there is no clean way to recover.

  • The APD situation needs to be resolved at the storage array/fabric layer to restore connectivity to the host.
  • All affected ESXi hosts may require a reboot to remove any residual references to the affected devices that are in an APD state.

Note:

  • Performing a vMotion migration of unaffected virtual machines is not possible, as the management agents may be affected by the APD condition, and the ESXi host may become unmanaged. As a result, a reboot of an affected ESXi host forces an outage to all non-affected virtual machines on that host.
  • vSphere 6.0 and later have a powerful new feature as part of vSphere HA called VM Component Protection (VMCP). VMCP protects virtual machines from storage related events, specifically Permanent Device Loss (PDL) and All Paths Down (APD) incidents.

vmkernel log:

++++++++++

2018-01-10T13:04:26.803Z cpu1:32896)StorageApdHandlerEv: 110: Device or filesystem with identifier [naa.xxxxxxxxxxxxxxxxxxxxx] has entered the All Paths Down state.

2018-01-10T13:04:26.818Z cpu0:32896)StorageApdHandlerEv: 110: Device or filesystem with identifier [naa.xxxxxxxxxxxxxxxxxxxxx] has entered the All Paths Down state.

vobd log:

+++++++

2018-01-10T13:04:26.905Z: [scsiCorrelator] 475204262us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage devicenaa.xxxxxxxxxxxxxxxxxxxxx. Path vmhba33:C0:T1:L0 is down. Affected datastores: “Green”.

2018-01-10T13:04:26.905Z: [scsiCorrelator] 475204695us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage devicenaa.xxxxxxxxxxxxxxxxxxxxx. Path vmhba33:C0:T0:L0 is down. Affected datastores: “Grey”.

To clean up an unplanned PDL:

  1. All running virtual machines from the datastore must be powered off and unregistered from the vCenter Server.
  2. From the vSphere Client, go to the Configuration tab of the ESXi host, and click Storage.
  3. Right-click the datastore being removed, and click Unmount.
    The Confirm Datastore Unmount window displays. When the prerequisite criteria have been passed, the OK button appears.
    If you see this error when unmounting the LUN:Call datastore refresh for object <name_of_LUN> on vCenter server <name_of_vCenter> failed

    You may have a snapshot LUN presented. To resolve this issue, remove that snapshot LUN on the array side.

  4. Perform a rescan on all of the ESXi hosts that had visibility to the LUN.Note: If there are active references to the device or pending I/O, the ESXi host still lists the device after the rescan. Check for virtual machines, templates, ISO images, floppy images, and raw device mappings which may still have an active reference to the device or datastore.
  5. If the LUN is still being used and available again, go to each host, right-click the LUN, and click Mount.Note: One possible cause for an unplanned PDL is that the LUN ran out space causing it to become inaccessible.