Lab: Debian 8

Heute einen Test-Server mit Debian 8.3 “jessie” auf Dell PE1850 eingerichtet. Alles unspektakulär.

Ich habe bewusst gewartet, wie sich Debian 8 entwickelt und ob grössere Probleme auftreten. Da ist diese “systemd” Sache, die ich eher skeptisch sehe. Und ich habe selbst Erinnerungen an Debian major-release Upgrades: schlechte (6 => 7) und schlimme (4 => 5). Und aktuelle Regression Bugs stärken das Vertrauen auch nicht.

Ich wollte auf meinen eigenen Servern mit Debian abschliessen und auf den RedHat Enterprise Clone CentOS umstellen, aber damit warte ich noch einen Moment, Core Infrastruktur sollte nicht ohne guten Grund umgestellt werden.

Praktische Erfahrungen mit Debian 8 brauche ich auch, weil einige Kunden Debian 7 einsetzen und in absehbarer Zeit upgraden müssen.

Ich werde mein Standard DevOps Werkzeug von Puppet auf Ansible umstellen, dafür ist der Test-Server perfekt.

Last not least soll der Server für Tests mit OpenStack und Docker verwendet werden.

Dell OpenManage on Dell R300 Server – network gone

After installing Dell OpenManage 7.4.0 (latest version from Dell apt repository) on a PE860 Server running Debian 7 worked without issues [lang:german] and proved to be useful [lang:german], I proceeded to do the same on an R300 Server. That didn’t go well..

Installing the packages went fine, next step was starting the OpenManage “dataeng” service. Right after that all hell broke loose: iSCSI connections were cut, VMs using iSCSI disks were hung, SSH connections were dropped. Lucky this wasn’t a remote server.

Console worked, no signs of kernel crash or kernel errors in the logs. But the system seemed to have lost network connectivity and could not ping anywhere except localhost, which explains the observed behavior.

As OpenManage wants to run its own SNMP Server, I figured the already running SNMP server might conflict with it. Disabled the Debian snmpd service, left OpenManage dataeng enabled. Reboot, same problem.

Disable OpenManage dataeng service, reboot, no problem. So OpenManage is part of the problem. Need to investigate.

I do not think Debian is part of the problem, because I have seen the same behavior about 3 years ago with bigger Dell servers (R710?) and Suse Linux Enterprise 11.

Dell OpenManage auf Debian Linux installieren

Dell OpenManage ist eine Software-Suite zur Statusabfrage und Konfiguration von Dell Server Hardware. Damit kann per Kommandozeile oder Web Interface der Hardware- und Systemstatus ausgelesen werden, um z.B.die Hardware-Konfiguration zu ermitteln oder ausgefallene Festplatten, Systemlüfter usw. zu erkennen.

Die Software wird von Dell kostenfrei für Windows, VMware ESXi und einige Linux Distributionen angeboten, für Debian/Ubuntu Linux gibt es ein APT Repository, aus dem die OpenManage Software einfach installiert werden kann. Die Installationsanleitung von Dell kurz zusammengefasst:

  • Datei /etc/apt/sources.list.d/ wird für das APT Repository angelegt. Hier muss der Name der installierten Debian/Ubuntu Version (z.B.”wheezy” für Debian 7) angegeben werden
  • GPG Key von Dell importieren und Repository initialisieren (“apt-get update”)
  • Pakete mit apt-get install installieren

Das Paket “srvadmin-all” installiert alle Software-Komponenten. Die Liste der neu zu installierenden Pakete ist recht lang und umfasst u.a. Java und etliche Libraries. Wenn auf die Web-GUI verzichtet werden kann, ist es evtl. ausreichend, nur das Paket “srvadmin-base” zu installieren.

Die Installation läuft problemlos. Um die OpenManage Funktionalität im Anschluss zu nutzen, muss der Server entweder rebootet werden, oder der neue Dienst “dataeng” einmal manuell gestartet werden (“service dataeng start”).

In Fehlersuche auf Dell Hardware mit OpenManage wird die Verwendung des OpenManage Kommandozeilen-Tools gezeigt. Dell OpenManage Web-Interface einrichten und benutzen beschreibt das OpenManage Web-Interface.

EDAC Linux Kernel Messages

Noticed that an older Dell PE860 server filled the system log with lots of messages like these:

Jan 12 06:25:06 srv1 kernel: [390027.492118] EDAC MC0: CE page 0xb041c, offset 0x480, grain 128, syndrome 0x86, row 1, channel 1, label "": i3000 CE
Jan 12 06:25:09 srv1 kernel: [390030.492108] EDAC MC0: CE page 0xb041d, offset 0x0, grain 128, syndrome 0x86, row 1, channel 1, label "": i3000 CE
Jan 12 06:55:05 srv1 kernel: [391826.520108] EDAC MC0: CE page 0xb0494, offset 0x0, grain 128, syndrome 0x86, row 1, channel 1, label "": i3000 CE

Kernel messages are not something you like to see in your logs, and certainly not so many of them. But what the heck do they mean?

They come from the EDAC subsystem. From EdacWiki:

EDAC Stands for “Error Detection and Correction”. The Linux EDAC project comprises a series of Linux kernel modules, which make use of error detection facilities of computer hardware, currently hardware which detects the following errors is supported:

  • System RAM errors (this is the original, and most mature part of the project) – many computers support RAM EDAC, (especially for chipsets which are aimed at high-reliability applications), but RAM which has extra storage capacity (“ECC RAM”) is needed for these facilities to operate
  • RAM scrubbing – some memory controllers support “scrubbing” DRAM during normal operation. Continuously scrubbing DRAM allows for actively detecting and correcting ECC errors.
  • PCI bus transfer errors – the majority of PCI bridges, and peripherals support such error detection
  • Cache ECC errors

The particular error messages above mean that EDAC has detected problems with at least one ECC memory module in this server. Actually kind of pre-detected, because we did not notice any problems or crashes of applications or VMs on the server.

This EdacWiki page has more information about EDAC memory messages and how to diagnose and deal with these issues.

One particular interesting takeaway for me: The above page recommends

  • Don’t enable BIOS “quick boot”.
  • Don’t manually skip BIOS memory check

After enabling the BIOS memory check and reboot, the number of EDAC messages dropped massively. Not fully gone, though. This indicates that there is an issue with some memory module. Further Checking with Dell OpenManage [lang:german] confirmed a bad memory module.

Debian 7.9 network driver regression

So I thought I use this quiet weekend to install the pending Debian 7.9 updates to a couple of servers. Worked fine, except for the last and most important one. Of course.

We’re talking about 2 physical hosts in a remote datacenter. One of them running Xen and 5 VMs, the other one being a warm-spare backup for the first. All servers (2 hosts, 5 VMs) run Debian 7 “wheezy” which was installed at 7.0 and got upgraded all the way to 7.8 without problems.

I was already composing an “update done” email when I noticed that the console to that last server (the host running the VMs) was frozen. No pings to the system, and the VMs were not reachable either. Kicked out of the network, had to trigger Ctrl-Alt-Del remotely as remote access was cut off.

The machine came back, I could login and view logs. After a few minutes it happened again: dead, reboot, repeat. The smell of trouble. Quickly grabbed the logs before it froze once again. There (edited):

Jan  2 19:12:45 srv4 kernel: [  551.326415] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Jan  2 19:12:45 srv4 kernel: [  551.327507] Pid: 0, comm: swapper/0 Not tainted 3.2.0-4-amd64 #1 Debian 3.2.73-2+deb7u1
Jan  2 19:12:45 srv4 kernel: [  551.327549] Call Trace:
Jan  2 19:12:45 srv4 kernel: [  551.327574]  <IRQ>  [<ffffffff81046ded>] ? warn_slowpath_common+0x78/0x8c

Kernel stack trace, don’t you love to see that on production servers.. At least that gave me something to google for, but the results were discouraging. Looks like an old kernel 2.6.x bug from around 2010-2012 that hit users of various Linux distributions. Basically an ethernet driver crash cutting off connectivity.

But: Debian 7.9 has kernel 3.2.0, just like 7.8 before which did not show this problem. How the hell did this issue resurface? Maybe because Xen is involved. And no one else got hit by that?

Anyway, as $CUSTOMER tends to get unhappy if his servers are not reachable anymore, I needed a quick solution or at least a workaround. Passing certain kernel boot parameters might be an option.

The workaround for now is to disable gigabit autonegotation for this interface and force it to 100Mb/s full-duplex using this command:

ethtool -s eth0 speed 100 duplex full autoneg off

This is a workaround, not a fix. But it DOES was claimed to prevent the machine from crashing after a few minutes. Need to investigate.

To me, this looks like a regression in Debian 7.9.

Update Jan 3 2016

Some more googling for NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out returned various bugtracker messages from Debian, Ubuntu, RedHat and others. Earliest from 2008, latest from 2015. Affected network cards were r8169, e1000 and bnx. Several hypotheses and suggested fixes were posted, including:

  • passing kernel boot options such as pcie_aspm=off (didn’t work for everyone)
  • try a newer kernel
  • try to compile driver from source
  • Jumbo Frames >1500 may be involved (not here)
  • Xen 4.1 may be involved (my theory)

Since this is a production server, I will certainly not be playing reboot games with kernel options that may or may not work. I won’t compile custom kernels or drivers either. I made the ethtool call described above permanent with these settings in /etc/network/interfaces:

auto  eth0
iface eth0 inet static
  address   x.x.x.x
  netmask   x.x.x.x
  gateway   x.x.x.x
  up sleep 5; ethtool -s eth0 speed 100 duplex full autoneg off