How to identify defective DIMM from EDAC error on Linux

DIMM error is rare, but sometime still happens. It's easy to identify them if they are completely dead, however, if a DIMM has some corrected errors, how to identify it?

I have another article listed memory testing tools on linux, this time, I use EDAC error report utility

Here is an example show you how to identify defective DIMM on an AMD_x64 archtecture machine, syslog reorted kernel error from EDAC (Error Detection and Correction kernel module).

Here is a piece of typical error message from EDAC

kernel: [Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.
kernel: EDAC amd64 MC1: CE ERROR_ADDRESS= 0xf075b2410
kernel: EDAC MC1: CE page 0xf075b2, offset 0x410, grain 0, syndrome 0xa082, row 6, channel 0, label "": amd64_edac
kernel: [Hardware Error]: Error Status: Corrected error, no action required.
kernel: [Hardware Error]: CPU:6 (10:8:0) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c414000a0080813
kernel: [Hardware Error]: MC4_ADDR: 0x0000000f075b2410
kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

You may get confused by the message above, here is a quick way to show you what are they:

The structure of the message is:

 the memory controller                   (MC1)
 Error type                              (CE)
 memory page                             (0xf075b2)
 offset in the page                      (0x410)
 The byte granularity                    (grain 0)
 The error syndrome                      (0xb741)
 memory row                              (row 6) 
 memory channel                          (channel 0) 
 DIMM label                              Not given
 Module name                             amd64_edac

More explain about info given

EDAC is composed of a "core" module (edac_core.ko) and several Memory Controller (MC) driver modules. On a given system, the CORE is loaded and one MC driver will be loaded. Both the CORE and the MC driver (or edac_device driver) have individual versions that reflect current release level of their respective modules. Thus, to "report" on what version a system is running, one must report both the CORE's and the MC driver's versions.

The example server I used in this article has these two edac module loaded:

# lsmod | grep -i edac
amd64_edac_mod         21913  0 
edac_core              46645  4 amd64_edac_mod
edac_mce_amd           15615  1 amd64_edac_mod

Memory Controller (mc) Model, the memory controller's model abstracted in EDAC. Each 'mc' device controls a set of DIMM memory modules. These modules are laid out in a Chip-Select Row (csrowX) and Channel table (chX).

There can be multiple csrows and multiple channels. Memory controllers allow for several csrows, with 8 csrows being a typical value.

Channel, each channel represents a DIMM module. Dual channels allows for 128 bit data transfers to the CPU from memory. Some system supports more channels.

Csrow, Chip-Select Row, shows how memory module assembled, single or dual rank or more, the actual number of csrows depends on the electrical "loading" of a given motherboard, memory controller and DIMM characteristics.

For single rank DIMM module, a pair of DIMMs merge into one csrow, typically, you will see only csrow0, while csrow1 will be empty.

See more detail about EDAC in EDAC error detection and report

Use edac-util tool to identify

See more examples about edac-util

Check MC info and status

# edac-util -vs
edac-util: EDAC drivers are loaded. 2 MCs detected:
  mc0:F10h
  mc1:F10h

Check EDAC error report by edac-util tool

# edac-util -v 
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: ch0: 0 Corrected Errors
mc0: csrow0: ch1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: ch0: 0 Corrected Errors
mc0: csrow1: ch1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: ch0: 0 Corrected Errors
mc0: csrow2: ch1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: ch0: 0 Corrected Errors
mc0: csrow3: ch1: 0 Corrected Errors
mc0: csrow4: 0 Uncorrected Errors
mc0: csrow4: ch0: 0 Corrected Errors
mc0: csrow4: ch1: 0 Corrected Errors
mc0: csrow5: 0 Uncorrected Errors
mc0: csrow5: ch0: 0 Corrected Errors
mc0: csrow5: ch1: 0 Corrected Errors
mc0: csrow6: 0 Uncorrected Errors
mc0: csrow6: ch0: 0 Corrected Errors
mc0: csrow6: ch1: 0 Corrected Errors
mc0: csrow7: 0 Uncorrected Errors
mc0: csrow7: ch0: 0 Corrected Errors
mc0: csrow7: ch1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: ch0: 0 Corrected Errors
mc1: csrow0: ch1: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: ch0: 0 Corrected Errors
mc1: csrow1: ch1: 0 Corrected Errors
mc1: csrow2: 0 Uncorrected Errors
mc1: csrow2: ch0: 0 Corrected Errors
mc1: csrow2: ch1: 0 Corrected Errors
mc1: csrow3: 0 Uncorrected Errors
mc1: csrow3: ch0: 0 Corrected Errors
mc1: csrow3: ch1: 0 Corrected Errors
mc1: csrow4: 0 Uncorrected Errors
mc1: csrow4: ch0: 0 Corrected Errors
mc1: csrow4: ch1: 0 Corrected Errors
mc1: csrow5: 0 Uncorrected Errors
mc1: csrow5: ch0: 0 Corrected Errors
mc1: csrow5: ch1: 0 Corrected Errors
mc1: csrow6: 0 Uncorrected Errors
mc1: csrow6: ch0: 7 Corrected Errors
mc1: csrow6: ch1: 0 Corrected Errors
mc1: csrow7: 0 Uncorrected Errors
mc1: csrow7: ch0: 0 Corrected Errors
mc1: csrow7: ch1: 0 Corrected Errors

This is consistent with the error what we saw in the syslog, same information can also be found in /sys/devices/system/edac/mc/mc[0|1]

# ls /sys/devices/system/edac/mc/mc1
ce_count     csrow0  csrow2  csrow4  csrow6  device   reset_counters    seconds_since_reset  ue_count
ce_noinfo_count  csrow1  csrow3  csrow5  csrow7  mc_name  sdram_scrub_rate  size_mb         ue_noinfo_count

Info Analysis

So, according to above info we get, we can tell the server has two memory controllers, 8 pairs of dual ranked memory modules.

As we know the memory error located at

mc1: csrow6: ch0: 7 Corrected Errors

What it tells us is the physical DIMM:

In the second memory controller(mc1).
Fourth pair of DIMM (csrow6 and csrow7)
ch0 tells it's the first DIMM in the DIMM pair.

Use the info above, you can easily find it according the hardware info of the server(usually you can find the motherboard articheture map)

Another way of locating the defective DIMM(May not work on some systems )

Use dmidecode to show each memory module's address, see more dmidecode examples

#dmidecode -t 20
Handle 0x0058, DMI type 20, 19 bytes
Memory Device Mapped Address
    Starting Address: 0x00F00000000
    Ending Address: 0x00FFFFFFFFF
    Range Size: 4 GB
    Physical Device Handle: 0x0057
    Memory Array Mapped Address Handle: 0x0038
    Partition Row Position: 2
    Interleave Position: 2

Use the memory error address reported in syslog to locate the defective DIMM.

kernel: EDAC amd64 MC1: CE ERROR_ADDRESS= 0xf075b2410

How to identify defective DIMM from EDAC error on Linux

How to identify defective DIMM from EDAC error on Linux

More explain about info given

Check MC info and status

Check EDAC error report by edac-util tool

Info Analysis

Another way of locating the defective DIMM(May not work on some systems )

results matching ""

No results matching ""