Noticed that an older Dell PE860 server filled the system log with lots of messages like these:
Jan 12 06:25:06 srv1 kernel: [390027.492118] EDAC MC0: CE page 0xb041c, offset 0x480, grain 128, syndrome 0x86, row 1, channel 1, label "": i3000 CE Jan 12 06:25:09 srv1 kernel: [390030.492108] EDAC MC0: CE page 0xb041d, offset 0x0, grain 128, syndrome 0x86, row 1, channel 1, label "": i3000 CE Jan 12 06:55:05 srv1 kernel: [391826.520108] EDAC MC0: CE page 0xb0494, offset 0x0, grain 128, syndrome 0x86, row 1, channel 1, label "": i3000 CE
Kernel messages are not something you like to see in your logs, and certainly not so many of them. But what the heck do they mean?
They come from the EDAC subsystem. From EdacWiki:
EDAC Stands for “Error Detection and Correction”. The Linux EDAC project comprises a series of Linux kernel modules, which make use of error detection facilities of computer hardware, currently hardware which detects the following errors is supported:
- System RAM errors (this is the original, and most mature part of the project) – many computers support RAM EDAC, (especially for chipsets which are aimed at high-reliability applications), but RAM which has extra storage capacity (“ECC RAM”) is needed for these facilities to operate
- RAM scrubbing – some memory controllers support “scrubbing” DRAM during normal operation. Continuously scrubbing DRAM allows for actively detecting and correcting ECC errors.
- PCI bus transfer errors – the majority of PCI bridges, and peripherals support such error detection
- Cache ECC errors
The particular error messages above mean that EDAC has detected problems with at least one ECC memory module in this server. Actually kind of pre-detected, because we did not notice any problems or crashes of applications or VMs on the server.
This EdacWiki page has more information about EDAC memory messages and how to diagnose and deal with these issues.
One particular interesting takeaway for me: The above page recommends
- Don’t enable BIOS “quick boot”.
- Don’t manually skip BIOS memory check
After enabling the BIOS memory check and reboot, the number of EDAC messages dropped massively. Not fully gone, though. This indicates that there is an issue with some memory module. Further Checking with Dell OpenManage [lang:german] confirmed a bad memory module.