There is a design flaw in Broadcom’s “bnx2″ NetXtreme II BCM5709 PCI Express NICs (not to be confused with the older PCI-X version, BCM5708) and the BCM56314 and BCM56820 switch-on-a-chip OEM Ethernet switches.
These NICs are extremely popular, Dell and HP use them throughout their PowerEdge and ProLiant standalone and blade server ranges. The OEM switching chips are not that common, but found e.g. in Dell’s hi-end top-of-the-rack PowerConnect 6248 Gigabit Ethernet and PowerConnect 8024 10G switches.
The flaw is in the flow control (802.3x) implementation and results in a switch-wide or network-wide loss of connectivity. As is common in major failures, there is more than one underlying cause.
The problem begins when a host with a BCM5709 crashes or otherwise stops accepting data from the NIC’s RX buffer. The NIC is not quiesced on host system crash and when its buffer fills up, it starts sending flow control PAUSE frames. Arguably, this is a design flaw in itself – the situation is handled properly e.g. by the BCM5708 NIC.
Now, this event alone would not cause a network-wide disruption. PAUSE frames are transmitted point-to-point – between a NIC and a switch port. When a switch receives a PAUSE frame on a port it simply buffers or drops frames destined for that port. Under no circumstances should a switch propagate or generate and transmit PAUSE frames by itself. It’s actually allowed by the standard, to accommodate for blocking switching matrices, but it’s insanely, criminally brain dead. No respectable network equipment vendor allows the switch to send PAUSE frames.
However, this is exactly what the BCM56314 and BCM56820 switches do. Not only is flow control enabled by default (since software 3.x for the BCM56314 and since the first release for the BCM56820), it is configured to send PAUSE frames by default as well! This behavior cannot be disabled – flow control has to be switched off globally, as per-port setting is not available.
When the culprit host’s NIC starts sending PAUSE frames, the switch also sends PAUSE frames to every port trying to communicate with that NIC, causing a switch-wide loss of connectivity. It gets worse if flow control is left enabled on inter-switch connections – expectedly, all ports on all switches eventually cease to forward frames.
In my experience, Broadcom (and Dell) have been notoriously difficult to work with when it comes to firmware. There is practically no chance they will do anything about this (namely, fix BCM5709′s firmware, disable flow control by default and introduce the possibility to separately disable TX/RX flow control – per-port or at least globally).
Below are some guidelines on how to avoid this issue:
disable flow control on all switches, but there is a caveat here: some vendors (e.g. of enterprise iSCSI solutions) actually require – for reasons unknown – RX and TX flow control to be enabled. Other than that, most networks do not need flow control at all.
If you can’t disable flow control on all switches, at least disable it on your core switches. If you use it in the core, you’re Doing It Wrong™.
If you can’t disable flow control on the switches at all, disable TX flow control (TX PAUSE) on all BCM5709 NICs. However, I found these settings to be utterly unreliable: disabling TX flow control on some BCM5709s resulted in a complete loss of connectivity with some, but not all (!), BCM56314 and BCM56820-based switches.
Do not use BCM56314 and BCM56820-based OEM switches (e.g. Dell PowerConnect 6248, M8024, 8024F). Get your switches from a respectable network hardware vendor – there are quite a few these days. It’s all about support: switches are core business for networking vendors. They have in-house teams familiar with their specific implementation and they’re able to quickly provide software patches if required. With OEM switches you’re pretty much on your own. Chances are there’s no one in your vendor’s team familiar with the actual software source code, since only the original manufacturer has access.
UPDATE (2011-10-27): Sven Ulland of Opera Software has just posted some excellent research on this issue on the linux-powerdge mailing list.
It seems the problem has been noticed by Broadcom and at least partially addressed in the 4.x switch firmware. This is an excerpt from the 4.x release notes for the 80xx series:
Asymmetric flow control is implemented for the PC8024X, PCM8024, PCM6348, PC70XX, and PCM8024-k switches. The switch does not generate pause frames when congested. It will honor pause frames as per industry standards.
I have yet to check if it actually works. Sadly, 62xx switches have been omitted – a grave mistake since they are much more abundant than the 80xx series.
NIC-wise, Sven has confirmed that the bnx2 driver v2.0.18c (distributed with firmware 6.0.17 for the 5709) fixes the issue (i.e. the NICs no longer send PAUSE frame floods on host system crashes). The driver is included in the mainline kernel 2.6.37.
Unfortunately, I can still easily reproduce the problem on vanilla 2.6.38, but I do recommend that you try it yourself: it seems to be network dependent.