Dell OpenManage on Dell R300 Server – network gone

After installing Dell OpenManage 7.4.0 (latest version from Dell apt repository) on a PE860 Server running Debian 7 worked without issues [lang:german] and proved to be useful [lang:german], I proceeded to do the same on an R300 Server. That didn’t go well..

Installing the packages went fine, next step was starting the OpenManage “dataeng” service. Right after that all hell broke loose: iSCSI connections were cut, VMs using iSCSI disks were hung, SSH connections were dropped. Lucky this wasn’t a remote server.

Console worked, no signs of kernel crash or kernel errors in the logs. But the system seemed to have lost network connectivity and could not ping anywhere except localhost, which explains the observed behavior.

As OpenManage wants to run its own SNMP Server, I figured the already running SNMP server might conflict with it. Disabled the Debian snmpd service, left OpenManage dataeng enabled. Reboot, same problem.

Disable OpenManage dataeng service, reboot, no problem. So OpenManage is part of the problem. Need to investigate.

I do not think Debian is part of the problem, because I have seen the same behavior about 3 years ago with bigger Dell servers (R710?) and Suse Linux Enterprise 11.

Dell OpenManage auf Debian Linux installieren

Dell OpenManage ist eine Software-Suite zur Statusabfrage und Konfiguration von Dell Server Hardware. Damit kann per Kommandozeile oder Web Interface der Hardware- und Systemstatus ausgelesen werden, um z.B.die Hardware-Konfiguration zu ermitteln oder ausgefallene Festplatten, Systemlüfter usw. zu erkennen.

Die Software wird von Dell kostenfrei für Windows, VMware ESXi und einige Linux Distributionen angeboten, für Debian/Ubuntu Linux gibt es ein APT Repository, aus dem die OpenManage Software einfach installiert werden kann. Die Installationsanleitung von Dell kurz zusammengefasst:

  • Datei /etc/apt/sources.list.d/ wird für das APT Repository angelegt. Hier muss der Name der installierten Debian/Ubuntu Version (z.B.”wheezy” für Debian 7) angegeben werden
  • GPG Key von Dell importieren und Repository initialisieren (“apt-get update”)
  • Pakete mit apt-get install installieren

Das Paket “srvadmin-all” installiert alle Software-Komponenten. Die Liste der neu zu installierenden Pakete ist recht lang und umfasst u.a. Java und etliche Libraries. Wenn auf die Web-GUI verzichtet werden kann, ist es evtl. ausreichend, nur das Paket “srvadmin-base” zu installieren.

Die Installation läuft problemlos. Um die OpenManage Funktionalität im Anschluss zu nutzen, muss der Server entweder rebootet werden, oder der neue Dienst “dataeng” einmal manuell gestartet werden (“service dataeng start”).

In Fehlersuche auf Dell Hardware mit OpenManage wird die Verwendung des OpenManage Kommandozeilen-Tools gezeigt. Dell OpenManage Web-Interface einrichten und benutzen beschreibt das OpenManage Web-Interface.

EDAC Linux Kernel Messages

Noticed that an older Dell PE860 server filled the system log with lots of messages like these:

Jan 12 06:25:06 srv1 kernel: [390027.492118] EDAC MC0: CE page 0xb041c, offset 0x480, grain 128, syndrome 0x86, row 1, channel 1, label "": i3000 CE
Jan 12 06:25:09 srv1 kernel: [390030.492108] EDAC MC0: CE page 0xb041d, offset 0x0, grain 128, syndrome 0x86, row 1, channel 1, label "": i3000 CE
Jan 12 06:55:05 srv1 kernel: [391826.520108] EDAC MC0: CE page 0xb0494, offset 0x0, grain 128, syndrome 0x86, row 1, channel 1, label "": i3000 CE

Kernel messages are not something you like to see in your logs, and certainly not so many of them. But what the heck do they mean?

They come from the EDAC subsystem. From EdacWiki:

EDAC Stands for “Error Detection and Correction”. The Linux EDAC project comprises a series of Linux kernel modules, which make use of error detection facilities of computer hardware, currently hardware which detects the following errors is supported:

  • System RAM errors (this is the original, and most mature part of the project) – many computers support RAM EDAC, (especially for chipsets which are aimed at high-reliability applications), but RAM which has extra storage capacity (“ECC RAM”) is needed for these facilities to operate
  • RAM scrubbing – some memory controllers support “scrubbing” DRAM during normal operation. Continuously scrubbing DRAM allows for actively detecting and correcting ECC errors.
  • PCI bus transfer errors – the majority of PCI bridges, and peripherals support such error detection
  • Cache ECC errors

The particular error messages above mean that EDAC has detected problems with at least one ECC memory module in this server. Actually kind of pre-detected, because we did not notice any problems or crashes of applications or VMs on the server.

This EdacWiki page has more information about EDAC memory messages and how to diagnose and deal with these issues.

One particular interesting takeaway for me: The above page recommends

  • Don’t enable BIOS “quick boot”.
  • Don’t manually skip BIOS memory check

After enabling the BIOS memory check and reboot, the number of EDAC messages dropped massively. Not fully gone, though. This indicates that there is an issue with some memory module. Further Checking with Dell OpenManage [lang:german] confirmed a bad memory module.