So I thought I use this quiet weekend to install the pending Debian 7.9 updates to a couple of servers. Worked fine, except for the last and most important one. Of course.
We’re talking about 2 physical hosts in a remote datacenter. One of them running Xen and 5 VMs, the other one being a warm-spare backup for the first. All servers (2 hosts, 5 VMs) run Debian 7 “wheezy” which was installed at 7.0 and got upgraded all the way to 7.8 without problems.
I was already composing an “update done” email when I noticed that the console to that last server (the host running the VMs) was frozen. No pings to the system, and the VMs were not reachable either. Kicked out of the network, had to trigger Ctrl-Alt-Del remotely as remote access was cut off.
The machine came back, I could login and view logs. After a few minutes it happened again: dead, reboot, repeat. The smell of trouble. Quickly grabbed the logs before it froze once again. There (edited):
Jan 2 19:12:45 srv4 kernel: [ 551.326415] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out Jan 2 19:12:45 srv4 kernel: [ 551.327507] Pid: 0, comm: swapper/0 Not tainted 3.2.0-4-amd64 #1 Debian 3.2.73-2+deb7u1 Jan 2 19:12:45 srv4 kernel: [ 551.327549] Call Trace: Jan 2 19:12:45 srv4 kernel: [ 551.327574] <IRQ> [<ffffffff81046ded>] ? warn_slowpath_common+0x78/0x8c [...]
Kernel stack trace, don’t you love to see that on production servers.. At least that gave me something to google for, but the results were discouraging. Looks like an old kernel 2.6.x bug from around 2010-2012 that hit users of various Linux distributions. Basically an ethernet driver crash cutting off connectivity.
But: Debian 7.9 has kernel 3.2.0, just like 7.8 before which did not show this problem. How the hell did this issue resurface? Maybe because Xen is involved. And no one else got hit by that?
Anyway, as $CUSTOMER tends to get unhappy if his servers are not reachable anymore, I needed a quick solution or at least a workaround. Passing certain kernel boot parameters might be an option.
The workaround for now is to disable gigabit autonegotation for this interface and force it to 100Mb/s full-duplex using this command:
ethtool -s eth0 speed 100 duplex full autoneg off
This is a workaround, not a fix. But it DOES was claimed to prevent the machine from crashing after a few minutes. Need to investigate.
To me, this looks like a regression in Debian 7.9.
Update Jan 3 2016
Some more googling for
NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out returned various bugtracker messages from Debian, Ubuntu, RedHat and others. Earliest from 2008, latest from 2015. Affected network cards were r8169, e1000 and bnx. Several hypotheses and suggested fixes were posted, including:
- passing kernel boot options such as
pcie_aspm=off(didn’t work for everyone)
- try a newer kernel
- try to compile driver from source
- Jumbo Frames >1500 may be involved (not here)
- Xen 4.1 may be involved (my theory)
Since this is a production server, I will certainly not be playing reboot games with kernel options that may or may not work. I won’t compile custom kernels or drivers either. I made the ethtool call described above permanent with these settings in
auto eth0 iface eth0 inet static address x.x.x.x netmask x.x.x.x gateway x.x.x.x up sleep 5; ethtool -s eth0
speed 100 duplex full autoneg off