Memory errors are encountered when an attempt is made to read a memory location. The value read from the memory does not match the value that is supposed to be there. Classification of Memory Errors Detected Versus Undetected Errors A system without error-correcting code (ECC) memory will not detect hardware errors. Hence, memory errors will silently lead to data corruption, incorrect processing of the operating system or application, and eventually system failures. Cisco Unified Computing System™ (Cisco UCS® ) servers use ECC memory. Therefore, powerful error correcting codes such as those provided by the Intel® Xeon® processors in Cisco UCS servers can detect memory errors so that silent data corruption does not occur.

Hard Versus Soft Errors:

Errors that are caused by a persistent physical defect are traditionally referred to as “hard” errors. A hard error may be caused by an assembly defect such as a solder bridge or cracked solder joint, or may be the result of a defect in the memory chip itself. Rewriting the memory location and retrying the read access will not eliminate a hard error. This error will continue to repeat. Errors caused by a brief electrical disturbance, either inside the DRAM chip or on an external interface, are referred to as “soft” errors. Soft errors are transient and do not continue to repeat. If the soft error was the result of a disturbance during the read operation, then simply retrying the read may yield correct data. If the soft error was caused by a disturbance that upset the contents of the memory array, then rewriting the memory location will correct the error. Hard errors are typically detected by memory tests run by the Cisco UCS BIOS at boot time, and any modules containing hard errors are mapped out so that they cannot cause errors during runtime. Cisco UCS servers employ memory patrol scrubbing to automatically detect and correct soft errors during runtime.

Hard errors are typically detected by memory tests run by the Cisco UCS BIOS at boot time, and any modules containing hard errors are mapped out so that they cannot cause errors during runtime. Cisco UCS servers employ memory patrol scrubbing to automatically detect and correct soft errors during runtime.

Correctable Versus Uncorrectable Errors:

Whether a particular error is correctable or uncorrectable depends on the strength of the ECC code employed in the memory system. Dedicated hardware is able to fix correctable errors when they occur with no impact on program processing. Uncorrectable errors generally cannot be fixed and may make it impossible for the application or operating system to continue processing.

Cisco UCS B-Series and C-Series Operating in UCSM 2.2 and 3.1 :

To reset memory-error counters on a Cisco UCS B-Series or C-Series server in UCSM 2.2 and 3.1, run the following script on the CLI:

ca-1-A# scope server 1/8

ca-1-A /chassis/server # reset-all-memory-errors

ca-1-A /chassis/server* # commit

Cisco UCS B-Series and C-Series Operating in UCSM 2.1 :

To reset memory-error counters on a Cisco UCS B-Series or C-Series server in UCSM 2.1, run the following script on the CLI:

Switch-A # scope server 1/1

Switch-A /chassis/server # scope memory-array 1

Switch-A /chassis/server/memory-array # scope dimm 2

Switch-A /chassis/server/memory-array/dimm # reset-errors

Cisco UCS C-Series Rack Servers Operating in Standalone Mode

To reset memory-error counters on a Cisco UCS C-Series Rack Server operating in standalone mode, run the following script on the CLI:

C240-FCH092779J# scope reset-ecc

C240-FCH092779J /reset-ecc # set enabled yes

C240-FCH092779J /reset-ecc *# commit

 

For additional information about memory, please refer to these resources: DIMMs: Reasons to Use Only Cisco Qualified Memory on Cisco UCS Servers