Type of errors that we find in Memory :

    • DIMM Error
      • ECC(Error Correcting Code) Error
        • Multibit = Uncorrectable
          • POST it is mapped out by BIOS, OS does not see DIMM
          • Runtime usually causes OS reboot
        • Singlebit = Correctable
          • OS continues to see memory, performance could degrade
      • Parity Error
      • SPD (Serial Presence Detect) Error
    • Configuration Error
      • Unpaired DIMMs
      • Mismatch errors
        • Not supported DIMMs
        • Not supported DIMM population
    • Identity unestablishable error
      • Check and update the catalog

We need to understand what is a Correctable and Uncorrectable error in order to troubleshoot on any Memory related issues on UCS box.

Whether a particular error is correctable or uncorrectable depends on the strength of the ECC code employed within the memory system. Dedicated hardware is able to fix correctable errors when they occur with no impact on program execution.

The DIMMs with correctable error are not disabled and are available for the OS to use. The Total Memory and Effective Memory be the same (taking memory mirroring into account). These correctable errors reported in UCSM operability state as Degraded while overall operability Operable with correctable errors.

Uncorrectable errors generally cannot be fixed, and may make it impossible for the application or operating system to continue execution. The DIMMs with uncorrectable error is disabled and OS does not see that memory. UCSM operState change to “”Inoperable”” in this case.

To Check Errors from CLI

These commands are useful when troubleshooting errors from CLI.

scope server x/y -> show memory detail
scope server x/y -> show memory-array detail
scope server x/y -> scope memory-array x -> show stats history memory-array-env-stats detail

From memory array scope you can also get access to DIMM.

scope server X/Y > scope memory-array Z > scope DIMM N

From there then you can obtain per-DIMM statistics or reset the error counters.

bdsol-6248-06-B /chassis/server/memory-array/dimm # reset-errors                
bdsol-6248-06-B /chassis/server/memory-array/dimm* # commit-buffer               
bdsol-6248-06-B /chassis/server/memory-array/dimm # show stats memory-error-state

If you see a correctable error reported that matches the information above, the problem can be corrected by resetting the BMC instead of reseating or resetting the blade server. Use these Cisco UCS Manager CLI commands:

Resetting the BMC does not impact the OS running on the blade.

UCS1-A# scope server x/y
UCS1-A /chassis/server # scope bmc
UCS1-A /chassis/server/bmc # reset
UCS1-A /chassis/server/bmc* # commit-buffer

With UCSM releases 3.1 and 2.2.7, the thresholds for memory corrected errors have been removed.

Therefore, memory modules (DIMM) shall no longer be reported as “Inoperable” or “Degraded” solely due to corrected memory errors.

As per whitepaper http://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-manager/whitepaper-c11-736116.pdf

Industry demands for greater capacity, greater bandwidth, and lower operating voltages lead to increased memoryerror rates. Traditionally, the industry has treated correctable errors in the same way as uncorrectable errors, requiring the module to be replaced immediately upon alert. Given extensive research that correctable errors are not correlated with uncorrectable errors, and that correctable errors do not degrade system performance, the Cisco UCS team recommends against immediate replacement of modules with correctable errors. Customers who experience a Degraded memory alert for correctable errors should reset the memory error and resume operation. If you follow this recommendation, it avoids unnecessary server disruption. Future enhancements to error management are coming and  helps distinguish among various types of correctable errors and identify the appropriate actions, if any, needed.

It is recommended to be minimum of version 2.1(3c) or 2.2(1b) which has enhancement with UCS memory error management

Methods to Clear DIMM Blacklisting Errors:

UCSM GUI

UCSM CLI

UCS-B/chassis/server # reset-all-memory-errors

If the above troubleshooting did not help please feel free to raise a support request for assistance.