Increasing TCP’s initial congestion window

It’s been a while since TCP’s initial congestion window was last increased. Recently ICWND10 – increasing the window further to 10 – has been proposed as an IETF draft by Google. But how does it work in the real world and what are the implications for HTTP latency – the main reason for considering ICWND10?

In fact it’s been known and used for years on web proxies for satellite connections, although RFC 2488 (Enhancing TCP Over Satellite Channels using Standard Mechanisms) recommends a somewhat lower value. Top content providers pioneered its use in high-performance web serving and there’s a good chance that it will become a standard.

TCP uses two main window-based mechanisms: the receive window and the congestion window. The former is the receive-side limit. The latter is the number of segments that will be sent before waiting for an acknowledgement.

In a perfect world (no packet loss, orderly arrival of segments, compliant devices, etc.) increasing the ICWND lowers the initial latency – less round-trips are required to transfer the same amount of data.

The congestion window is best expressed in multiples of the MSS. RFC 3390 defines the allowed initial cwnd as:

min (4*MSS, max (2*MSS, 4380 bytes))

Hence the standard IW is 3 on Ethernet-based networks with a 1500 MTU: 1500 - IP header - TCP header = 1460 (without TCP options).

A related parameter is the ssthresh (slow start threshold). RFC 5681 states that congestion avoidance is used if cwnd > ssthresh or cwnd >= ssthresh, otherwise slow start is in effect. The default congestion avoidance algorithm in Linux 2.6.19+, CUBIC, sets the initial ssthresh to 0 (see /sys/module/tcp_cubic/parameters/initial_ssthresh), so initially congestion avoidance is used, unless an ssthresh metric higher than cwnd is cached from a previous connection:

# ip route show 10.0.0.1 table cache
10.0.0.1 from 192.168.0.1 via 192.168.0.2 dev eth0
cache mtu 1500 rtt 685ms rttvar 320ms ssthresh 12 cwnd 11 advmss 1460 hoplimit 64 initcwnd 10

Metric caching may be disabled: /proc/sys/net/ipv4/tcp_no_metrics_save. Slow start is also used after a connection has been idle, unless this behavior is disabled in /proc/sys/net/ipv4/tcp_slow_start_after_idle.

During congestion avoidance, the cwnd is incremented according to RFC 5681 section 3.1. However, CUBIC does not follow the recommended formula cwnd += min(N, SMSS) or cwnd += SMSS*SMSS/cwnd where N is the number of ACKed bytes and SMSS is the sender-side MSS. Instead, the window is set according to a cubic function of time since the last congestion; it does not rely on the ACKed byte count, allowing the window to grow at the same rate for low- and high-latency flows.
See http://netsrv.csc.ncsu.edu/export/cubic_a_new_tcp_2008.pdf.

Let’s take a look at how IW10 works in practice.

A word of warning: don’t change any TCP/IP tunables if you don’t know what you’re doing. TCP/IP stacks in major OSes auto-tune and adapt well enough for most scenarios.

Understand the implications of your changes. TCP is a complex protocol, its specification is spread over tens of RFCs. Every modification affects the behavior of several other mechanisms – e.g. increasing the ICWND increases the burstiness of your traffic.

SOHO routers may not keep up with 5+ bursty TCP flows with IW10 on a broadband link – and IW10 performs significantly worse than IW3 during high packet loss rate situations.

Several factors other than OS tunables may affect flow behavior:

  • TCP offload features may alter segmentation – details depend on the NIC/driver combination.

  • Nagle’s algorithm, i.e. data buffering. The algorithm waits for acknowledgement of all previously transmitted data before sending any new TCP segments. This behavior was fine for interactive sessions on slow links, but it introduces significant latency for HTTP traffic. It may be disabled at the application level (TCP_NODELAY socket option).

  • TCP_CORK socket option (Linux-specific) inhibits transmission of partial (< MSS) segments. The output is uncorked (flushed) either by the application via setsockopt() or after a 200 ms ceiling.

  • packet loss and/or segments arriving out of order (see RFC 2018 TCP Selective Acknowledgment Options)

  • writev() vs. sendfile() – YMMV, but on Linux, with Nagle and cork switched off, writev tends to send fewer and larger segments, which is great for TSO (see below), while sendfile prefers smaller segments.

  • bandwidth-delay product (BDP) – expressed as bandwidth*RTT. It is the maximum amount of data in transit on a path:

    10 Mbit/s, 10 ms RTT = 12.5 kB
    10 Mbit/s, 300 ms RTT = 375 kB

    Hence, increasing the ICWND won’t help much on low-latency networks, as there can only be BDP bytes of data on a path at any given moment.

Don’t expect to see any improvement when testing on your LAN. However, on high-bandwidth, high-latency networks (LFNs) the difference may be substantial.

Another problem is that the initial receive window is a bit on the low side on Linux. The code is quite self-explanatory:

net/ipv4/tcp_output.c

        /* Set initial window to value enough for senders,
         * following RFC2414. Senders, not following this RFC,
         * will be satisfied with 2.
         */
        if (mss > (1 << *rcv_wscale)) {
                int init_cwnd = 4;
                if (mss > 1460 * 3)
                        init_cwnd = 2;
                else if (mss > 1460)
                        init_cwnd = 3;
                if (*rcv_wnd > init_cwnd * mss)
                        *rcv_wnd = init_cwnd * mss;
        }

so by default, on 1460 MSS flows, the IRWND is set to 4*MSS = 5840 bytes.

Linux 2.6.33+ allows setting a custom IRWND. On Windows, the initial window is at 65k by default, without window scaling and timestamps (which may be switched on by setting Tcp1323Opts to 3 in \HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters. On Solaris, window size and scaling depends on the tcp_recv_hiwat tunable.

As this is a demonstration, not a benchmark, I made several assumptions unlikely to be always true in the real world:

  • no packet loss
  • a single flow
  • packets arriving in order (selective ACKs not required)
  • paths not saturated.

That said, it’s not a lab demo; data was collected from real paths on the Internet.

I used three paths:

  • avg RTT 7 ms (~350 km intercity, Ethernet)
  • avg RTT 33 ms (~1500 km international, Ethernet)
  • avg RTT 255 ms (3G – UMTS)

The MSS on all paths was 1448 (MTU 1500, IP header: 20 octets, TCP header: 20, TCP options: 12). All flows were HTTP.

The “sender” was Linux (2.6.32) running CUBIC, the “receiver” was Windows XP (65k receive window). At first I’d had some reservations about instrumentation accuracy on Windows, especially hi-res timing, but I verified the results on Solaris.

All values on the graphs are averages from 10 samples to account for jitter. Each data point is a TCP segment, received or transmitted – depending on the graph type. “Data transferred” values are, of course, absolute. “default” is initcwnd=3, “iw10″ is initcwnd=10.

Let’s take a look at the “sender” first:
10k-7ms-w-xmit
With iw10 all 10 kilobytes are sent just after the handshake:

client                   server
-----------------------------------------------
SYN →                    SYN+ACK →
ACK, [HTTP request] →    ACK, [HTTP response] →

Wait a moment… didn’t I just say the MSS was 1448 and each data point was a TCP segment? I did. Then how come over 10000 bytes were sent in one segment?
I also said it was going to be a real-world demo. This is LSO (large send offload) in effect, or – more accurately – TSO (TCP segmentation offload). The driver communicates the actual MSS to the NIC and the TCP stack on the host OS sends large segments for the NIC to split into MSS-sized segments. LSO tremendously reduces the interrupt rate and the amount of bus transactions to the NIC. Two things to beware of:

  • for LSO to work, a large amount of data should be available for transmission at a point in time. This is best assured by sending with a gather function like writev(). It’s an elegant solution for serving small objects, as it allows to send the app layer header and the content with a single syscall.

  • LSO is a form of bufferbloat, so it does affect RTTs.

I decided to keep TSO on – in the real world, almost all recent Gigabit Ethernet NICs (like the ubiquitous bnx2 – Broadcom NetXtreme II, found on most HP/Dell x86 servers) and all 10GE NICs provide TSO and it’s on by default.

10k-7ms-w-recv
Not much difference on the receiving side.

Let’s go to 50 kB:
50k-7ms-w-xmit
As expected, on a low-latency flow the difference is not as profound with a 50 kB object – even on the sender. The RTT is relatively short and the sending side doesn’t have to wait long for ACKs from the client.

This is how it looks on the receiving side:
50k-7ms-w-recv
The difference is next to none. Why? Same reason – short RTT, fast ACKs.

Let’s move to data from the 33 ms RTT flows:
10k-33ms-w-xmit
10k-33ms-w-recv
Over 50 ms saved (at the receiver) on a 10 kB object.

For the 50 kB object:
50k-33ms-w-xmit
50k-33ms-w-recv
Almost an 80 ms difference.

Now, testing over-the-air on 3G/UMTS is a bit tricky. Telcos love to force the customers to use transparent HTTP proxies/caches and other assorted nastiness to paint the grass green and boost TCP performance. You need to find the right combination of source/destination ports, TCP options, etc. to get around all that.

As expected, jitter is extremely high on 3G, making the 10 kB object graph pretty useless:
10k-255ms-w-recv
The 50k graph, however, is accurate enough and shows significant improvement:
50k-255ms-w-recv
And finally, a comparison between writev() and sendfile():

writev:
50k-33ms-w-xmit
sendfile:
50k-33ms-s-xmit
The difference, as I mentioned above, is segmentation.

A few things to remember:

  • this was HTTP traffic. HTTP is not the only protocol in the Internet and small object serving is not the only possible workload – investigate how icwnd10 affects all your services.

  • again, raising the icwnd increases flow burstiness – if/when everyone starts using it, Internet traffic will be more bursty

  • TCP tuning is one of the last things you should be doing if your CSS, JS, HTML, etc. objects are not optimized for size and speed or too many files have to be loaded to render a page.

  • there’s an ongoing discussion about IW10 on the TCPM list with some very fair points.

Recommended reading:

This entry was posted in IT and tagged , . Bookmark the permalink.

One Response to Increasing TCP’s initial congestion window

  1. Pingback: Quora

Comments are closed.