Flow control flaw in Broadcom BCM5709 NICs and BCM56xxx switches

There is a design flaw in Broadcom’s “bnx2″ NetXtreme II BCM5709 PCI Express NICs (not to be confused with the older PCI-X version, BCM5708) and the BCM56314 and BCM56820 switch-on-a-chip OEM Ethernet switches.

These NICs are extremely popular, Dell and HP use them throughout their PowerEdge and ProLiant standalone and blade server ranges. The OEM switching chips are not that common, but found e.g. in Dell’s hi-end top-of-the-rack PowerConnect 6248 Gigabit Ethernet and PowerConnect 8024 10G switches.

The flaw is in the flow control (802.3x) implementation and results in a switch-wide or network-wide loss of connectivity. As is common in major failures, there is more than one underlying cause.

The problem begins when a host with a BCM5709 crashes or otherwise stops accepting data from the NIC’s RX buffer. The NIC is not quiesced on host system crash and when its buffer fills up, it starts sending flow control PAUSE frames. Arguably, this is a design flaw in itself – the situation is handled properly e.g. by the BCM5708 NIC.

Now, this event alone would not cause a network-wide disruption. PAUSE frames are transmitted point-to-point – between a NIC and a switch port. When a switch receives a PAUSE frame on a port it simply buffers or drops frames destined for that port. Under no circumstances should a switch propagate or generate and transmit PAUSE frames by itself. It’s actually allowed by the standard, to accommodate for blocking switching matrices, but it’s insanely, criminally brain dead. No respectable network equipment vendor allows the switch to send PAUSE frames.

However, this is exactly what the BCM56314 and BCM56820 switches do. Not only is flow control enabled by default (since software 3.x for the BCM56314 and since the first release for the BCM56820), it is configured to send PAUSE frames by default as well! This behavior cannot be disabled – flow control has to be switched off globally, as per-port setting is not available.

When the culprit host’s NIC starts sending PAUSE frames, the switch also sends PAUSE frames to every port trying to communicate with that NIC, causing a switch-wide loss of connectivity. It gets worse if flow control is left enabled on inter-switch connections – expectedly, all ports on all switches eventually cease to forward frames.

In my experience, Broadcom (and Dell) have been notoriously difficult to work with when it comes to firmware. There is practically no chance they will do anything about this (namely, fix BCM5709′s firmware, disable flow control by default and introduce the possibility to separately disable TX/RX flow control – per-port or at least globally).

Below are some guidelines on how to avoid this issue:

  • disable flow control on all switches, but there is a caveat here: some vendors (e.g. of enterprise iSCSI solutions) actually require – for reasons unknown – RX and TX flow control to be enabled. Other than that, most networks do not need flow control at all.

  • If you can’t disable flow control on all switches, at least disable it on your core switches. If you use it in the core, you’re Doing It Wrong™.

  • If you can’t disable flow control on the switches at all, disable TX flow control (TX PAUSE) on all BCM5709 NICs. However, I found these settings to be utterly unreliable: disabling TX flow control on some BCM5709s resulted in a complete loss of connectivity with some, but not all (!), BCM56314 and BCM56820-based switches.

  • Do not use BCM56314 and BCM56820-based OEM switches (e.g. Dell PowerConnect 6248, M8024, 8024F). Get your switches from a respectable network hardware vendor – there are quite a few these days. It’s all about support: switches are core business for networking vendors. They have in-house teams familiar with their specific implementation and they’re able to quickly provide software patches if required. With OEM switches you’re pretty much on your own. Chances are there’s no one in your vendor’s team familiar with the actual software source code, since only the original manufacturer has access.

UPDATE (2011-10-27): Sven Ulland of Opera Software has just posted some excellent research on this issue on the linux-powerdge mailing list.

It seems the problem has been noticed by Broadcom and at least partially addressed in the 4.x switch firmware. This is an excerpt from the 4.x release notes for the 80xx series:

Release 4.1.0.19
(…)
Asymmetric flow control is implemented for the PC8024X, PCM8024, PCM6348, PC70XX, and PCM8024-k switches. The switch does not generate pause frames when congested. It will honor pause frames as per industry standards.

I have yet to check if it actually works. Sadly, 62xx switches have been omitted – a grave mistake since they are much more abundant than the 80xx series.

NIC-wise, Sven has confirmed that the bnx2 driver v2.0.18c (distributed with firmware 6.0.17 for the 5709) fixes the issue (i.e. the NICs no longer send PAUSE frame floods on host system crashes). The driver is included in the mainline kernel 2.6.37.

Unfortunately, I can still easily reproduce the problem on vanilla 2.6.38, but I do recommend that you try it yourself: it seems to be network dependent.

This entry was posted in IT and tagged , , . Bookmark the permalink.

5 Responses to Flow control flaw in Broadcom BCM5709 NICs and BCM56xxx switches

  1. sshoxx says:

    all servers from dell with this chipset are having this problems. is there a solution to fix it on the server itself (not on the switches)? I found an workaround for RHEL (ring buffer change – https://bugzilla.redhat.com/show_bug.cgi?id=640026) but nor for windows based systems! this problem only exisits on windows 2008×64 and 2008R2 systems – not in windows 2003 x86/x64 when teaming or bonding is configured (BACS).

    are there other tips or guidelines on how to avoid this issue?

    greetins sshoxx

  2. tgr says:

    Hi sshoxx,

    I’m afraid there’s currently no definitive server-side fix.

    Try disabling TX flow control if it’s possible on Windows – but be aware it’s hit and miss on the 5709s and it depends on the particular NIC and switch – I’ve been able to make it work e.g. with some 5709s connected to some 6248s but not others from the same batch.

    I’ve been recently told that the updated Linux driver/firmware may fix the issue on some networks – however, I can reproduce it even with that driver. Perhaps there’s also an updated version for Windows?

    I’d really recommend disabling flow control on the switches if you can. There’s usually no reason to keep it enabled.

  3. racreek says:

    Hello, any further updates on this issue? We have been experiencing it on our Windows network while Flow Control is turned on. We have turned it off for now and are planning to apply firmware updates on our blade servers which have Broadcom BCM95709C 10/100/1000BASET Quad Port NIC’s. Any word if the A116.4.5 update resolves the problem? Thank you!

  4. ag says:

    Can you explain the comment or point me to some reference material…

    > If you can’t disable flow control on all switches, at least disable it on your core switches. If you use it in the core, you’re Doing It Wrong.

  5. Neal says:

    This is a great article. In a heavy bandwidth environment or a high database environment this is crucial to follow. Another great article that I read also has hleped us basically eliminate this problem all together. Thank you for your assistance! Here is the link.

    http://blogs.msdn.com/b/psssql/archive/2010/02/21/tcp-offloading-again.aspx

Comments are closed.