Increasing TCP’s initial congestion window

It’s been a while since TCP’s initial congestion window was last increased. Recently ICWND10 – increasing the window further to 10 – has been proposed as an IETF draft by Google. But how does it work in the real world and what are the implications for HTTP latency – the main reason for considering ICWND10?

In fact it’s been known and used for years on web proxies for satellite connections, although RFC 2488 (Enhancing TCP Over Satellite Channels using Standard Mechanisms) recommends a somewhat lower value. Top content providers pioneered its use in high-performance web serving and there’s a good chance that it will become a standard.

TCP uses two main window-based mechanisms: the receive window and the congestion window. The former is the receive-side limit. The latter is the number of segments that will be sent before waiting for an acknowledgement.

In a perfect world (no packet loss, orderly arrival of segments, compliant devices, etc.) increasing the ICWND lowers the initial latency – less round-trips are required to transfer the same amount of data.

The congestion window is best expressed in multiples of the MSS. RFC 3390 defines the allowed initial cwnd as:

min (4*MSS, max (2*MSS, 4380 bytes))

Hence the standard IW is 3 on Ethernet-based networks with a 1500 MTU: 1500 - IP header - TCP header = 1460 (without TCP options).

A related parameter is the ssthresh (slow start threshold). RFC 5681 states that congestion avoidance is used if cwnd > ssthresh or cwnd >= ssthresh, otherwise slow start is in effect. The default congestion avoidance algorithm in Linux 2.6.19+, CUBIC, sets the initial ssthresh to 0 (see /sys/module/tcp_cubic/parameters/initial_ssthresh), so initially congestion avoidance is used, unless an ssthresh metric higher than cwnd is cached from a previous connection:

# ip route show table cache from via dev eth0
cache mtu 1500 rtt 685ms rttvar 320ms ssthresh 12 cwnd 11 advmss 1460 hoplimit 64 initcwnd 10

Metric caching may be disabled: /proc/sys/net/ipv4/tcp_no_metrics_save. Slow start is also used after a connection has been idle, unless this behavior is disabled in /proc/sys/net/ipv4/tcp_slow_start_after_idle.

During congestion avoidance, the cwnd is incremented according to RFC 5681 section 3.1. However, CUBIC does not follow the recommended formula cwnd += min(N, SMSS) or cwnd += SMSS*SMSS/cwnd where N is the number of ACKed bytes and SMSS is the sender-side MSS. Instead, the window is set according to a cubic function of time since the last congestion; it does not rely on the ACKed byte count, allowing the window to grow at the same rate for low- and high-latency flows.

Let’s take a look at how IW10 works in practice.

A word of warning: don’t change any TCP/IP tunables if you don’t know what you’re doing. TCP/IP stacks in major OSes auto-tune and adapt well enough for most scenarios.

Understand the implications of your changes. TCP is a complex protocol, its specification is spread over tens of RFCs. Every modification affects the behavior of several other mechanisms – e.g. increasing the ICWND increases the burstiness of your traffic.

SOHO routers may not keep up with 5+ bursty TCP flows with IW10 on a broadband link – and IW10 performs significantly worse than IW3 during high packet loss rate situations.

Several factors other than OS tunables may affect flow behavior:

  • TCP offload features may alter segmentation – details depend on the NIC/driver combination.

  • Nagle’s algorithm, i.e. data buffering. The algorithm waits for acknowledgement of all previously transmitted data before sending any new TCP segments. This behavior was fine for interactive sessions on slow links, but it introduces significant latency for HTTP traffic. It may be disabled at the application level (TCP_NODELAY socket option).

  • TCP_CORK socket option (Linux-specific) inhibits transmission of partial (< MSS) segments. The output is uncorked (flushed) either by the application via setsockopt() or after a 200 ms ceiling.

  • packet loss and/or segments arriving out of order (see RFC 2018 TCP Selective Acknowledgment Options)

  • writev() vs. sendfile() – YMMV, but on Linux, with Nagle and cork switched off, writev tends to send fewer and larger segments, which is great for TSO (see below), while sendfile prefers smaller segments.

  • bandwidth-delay product (BDP) – expressed as bandwidth*RTT. It is the maximum amount of data in transit on a path:

    10 Mbit/s, 10 ms RTT = 12.5 kB
    10 Mbit/s, 300 ms RTT = 375 kB

    Hence, increasing the ICWND won’t help much on low-latency networks, as there can only be BDP bytes of data on a path at any given moment.

Don’t expect to see any improvement when testing on your LAN. However, on high-bandwidth, high-latency networks (LFNs) the difference may be substantial.

Another problem is that the initial receive window is a bit on the low side on Linux. The code is quite self-explanatory:


        /* Set initial window to value enough for senders,
         * following RFC2414. Senders, not following this RFC,
         * will be satisfied with 2.
        if (mss > (1 << *rcv_wscale)) {
                int init_cwnd = 4;
                if (mss > 1460 * 3)
                        init_cwnd = 2;
                else if (mss > 1460)
                        init_cwnd = 3;
                if (*rcv_wnd > init_cwnd * mss)
                        *rcv_wnd = init_cwnd * mss;

so by default, on 1460 MSS flows, the IRWND is set to 4*MSS = 5840 bytes.

Linux 2.6.33+ allows setting a custom IRWND. On Windows, the initial window is at 65k by default, without window scaling and timestamps (which may be switched on by setting Tcp1323Opts to 3 in \HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters. On Solaris, window size and scaling depends on the tcp_recv_hiwat tunable.

As this is a demonstration, not a benchmark, I made several assumptions unlikely to be always true in the real world:

  • no packet loss
  • a single flow
  • packets arriving in order (selective ACKs not required)
  • paths not saturated.

That said, it’s not a lab demo; data was collected from real paths on the Internet.

I used three paths:

  • avg RTT 7 ms (~350 km intercity, Ethernet)
  • avg RTT 33 ms (~1500 km international, Ethernet)
  • avg RTT 255 ms (3G – UMTS)

The MSS on all paths was 1448 (MTU 1500, IP header: 20 octets, TCP header: 20, TCP options: 12). All flows were HTTP.

The “sender” was Linux (2.6.32) running CUBIC, the “receiver” was Windows XP (65k receive window). At first I’d had some reservations about instrumentation accuracy on Windows, especially hi-res timing, but I verified the results on Solaris.

All values on the graphs are averages from 10 samples to account for jitter. Each data point is a TCP segment, received or transmitted – depending on the graph type. “Data transferred” values are, of course, absolute. “default” is initcwnd=3, “iw10″ is initcwnd=10.

Let’s take a look at the “sender” first:
With iw10 all 10 kilobytes are sent just after the handshake:

client                   server
SYN →                    SYN+ACK →
ACK, [HTTP request] →    ACK, [HTTP response] →

Wait a moment… didn’t I just say the MSS was 1448 and each data point was a TCP segment? I did. Then how come over 10000 bytes were sent in one segment?
I also said it was going to be a real-world demo. This is LSO (large send offload) in effect, or – more accurately – TSO (TCP segmentation offload). The driver communicates the actual MSS to the NIC and the TCP stack on the host OS sends large segments for the NIC to split into MSS-sized segments. LSO tremendously reduces the interrupt rate and the amount of bus transactions to the NIC. Two things to beware of:

  • for LSO to work, a large amount of data should be available for transmission at a point in time. This is best assured by sending with a gather function like writev(). It’s an elegant solution for serving small objects, as it allows to send the app layer header and the content with a single syscall.

  • LSO is a form of bufferbloat, so it does affect RTTs.

I decided to keep TSO on – in the real world, almost all recent Gigabit Ethernet NICs (like the ubiquitous bnx2 – Broadcom NetXtreme II, found on most HP/Dell x86 servers) and all 10GE NICs provide TSO and it’s on by default.

Not much difference on the receiving side.

Let’s go to 50 kB:
As expected, on a low-latency flow the difference is not as profound with a 50 kB object – even on the sender. The RTT is relatively short and the sending side doesn’t have to wait long for ACKs from the client.

This is how it looks on the receiving side:
The difference is next to none. Why? Same reason – short RTT, fast ACKs.

Let’s move to data from the 33 ms RTT flows:
Over 50 ms saved (at the receiver) on a 10 kB object.

For the 50 kB object:
Almost an 80 ms difference.

Now, testing over-the-air on 3G/UMTS is a bit tricky. Telcos love to force the customers to use transparent HTTP proxies/caches and other assorted nastiness to paint the grass green and boost TCP performance. You need to find the right combination of source/destination ports, TCP options, etc. to get around all that.

As expected, jitter is extremely high on 3G, making the 10 kB object graph pretty useless:
The 50k graph, however, is accurate enough and shows significant improvement:
And finally, a comparison between writev() and sendfile():

The difference, as I mentioned above, is segmentation.

A few things to remember:

  • this was HTTP traffic. HTTP is not the only protocol in the Internet and small object serving is not the only possible workload – investigate how icwnd10 affects all your services.

  • again, raising the icwnd increases flow burstiness – if/when everyone starts using it, Internet traffic will be more bursty

  • TCP tuning is one of the last things you should be doing if your CSS, JS, HTML, etc. objects are not optimized for size and speed or too many files have to be loaded to render a page.

  • there’s an ongoing discussion about IW10 on the TCPM list with some very fair points.

Recommended reading:

Posted in IT | Tagged , | 1 Comment

The folly of process existence checking

One of the most common mistakes when setting up service monitoring (besides defining lots of unnecessary probes with low thresholds, constantly giving false positives) is checking if a process exists. Let’s say we serve Fubar for our customers. It’s serviced by two daemons, /usr/sbin/{fubard,fubard-spool} and perhaps requires cron.

Do the customers care if an instance of /usr/sbin/fubard is running on your system? No. They just wish to have Fubar ready and available and they aren’t keen on the innards of your setup.

Should you care if the /usr/sbin/fubard process is running? Only if you’re trying to solve a problem with the Fubar service.

Probes checking whether fubard, fubard-spool and cron are running are misleading to say the least. What if the fubard process exists but is frozen by a bug? What if crond exists, but got stuck on I/O? Or fubard-spool seems to be there, but actually is in Z-state?

What you should do is query services exactly as their clients would and check for valid output. Design your software with intrinsic support for instrumentation (probing). Even if it’s closed software or OS components you’re monitoring, there’s always a way to check if it’s actually working. In our example, you can monitor crond simply by setting up a job touching a file every minute and monitoring that file’s mtime in your probe.

That said, you should not avoid monitoring discrete components of your software. In our example, monitoring the client side output of the Fubar service is obviously the goal, but you should also set up probes checking fubard, fubard-spool and crond operation.

If anything is wrong with the service, you will (hopefully) be able to determine which component is at fault simply by looking at your monitoring system’s dashboard. That’s usually one look at a web page – it does matter when your, um, mobile ventilation system has been, erm, impacted.

PS. all of the above also applies to checking PID files.

Posted in IT | Tagged , , | Comments Off

UDP Fragmentation Offload bug in 2.6.32.x

There’s a nasty bug in 2.6.32.x: when UFO (UDP Fragmentation Offload) is enabled on an interface and datagrams follow the software fallback path, NFS sessions get corrupted. It’s easy to reproduce – just copy a large (few hundred MB) file to a UDP NFS mount. The transfer will stall within 1-5 minutes and your copy process on the client will go into D state – stuck on nfs_wait_bit_uninterruptible.

This patch: Herbert Xu: udp: Fix bogus UFO packet generation fixes the problem in 2.6.35-rc4.

Sadly, it’s not fixed in, though I reported it back in early October.

The workaround is obvious: disable UFO.

BTW: this will also hit you if you use qdev on KVM, as it enables several offload features in the guests, including UFO.

UPDATE: still not fixed in See bug #33972.

Posted in IT | Tagged , , , | Comments Off

KVM or Xen?

Short answer: KVM

Long answer:

this question is actually easier to answer than you may think, particularly nowadays.

I’ve been working with KVM and Xen in three types of environments:

  • small, with one host and two-three guests
  • medium, with 50+ hosts and 1:5 host/guest ratio
  • large, with 300+ hosts, the ratio sometimes at 1:10 or even 1:20.

Between them, these guests run almost all possible workloads – CPU bound, I/O bound, network-heavy, you name it.

I’ll focus exclusively on the open-source, vanilla variants of both technologies and on Linux as host and guest OS.

I’m not going to cover performance in-depth, as it was discussed, flamed and benchmarked ad nauseam. There are, however, some myths I’d like to dispel. These myths seem to have originated from the times when KVM was in early beta stage development.

You can get decent performance from both platforms – it’s a question of how you use them, how well planned your virtualization setup is, how much you are overprovisioning, etc. Remember, Xen had quite a head start, but KVM has finally shown signs of maturing. I’ve considered it stable enough to use in production since late 2009.

Latency introduced by Xen and KVM is usually not a problem in most Internet/web environments. Here’s a test I did a year ago, comparing layer 7 (HTTP) latency on bare-metal and Xen. Expect similar results from KVM.

Bare metal:

Xen domU with pinned vcpus:

Disk I/O is pretty much the same – since late 2009 virtio-blk performance in KVM has been vastly improved and in most cases (magnetic disks, arrays, single SSDs) you can expect current hardware to be the limit.

Network I/O is a different thing altogether. Xen domUs with paravirtualized network drivers (yes! I’m comparing PV domUs to KVM guests!) can almost do low latency wire speed on Gigabit Ethernet interfaces. In current qemu-kvm versions there are two feasible options: emulated e1000 and virtio-net.

It’s a general consensus that e1000 introduces considerably lower (2-4x) latency than virtio-net, however the latter gives higher throughput. e1000, as emulated by qemu-kvm, is unfortunately a huge interrupt hog. By comparison, I couldn’t reach more than 600 Mbit/s with nttcp using e1000 (on a Core2-based Xeon), yet with virtio-net 900+ Mbit/s is easily achievable. Emulated e1000 is so interrupt intensive that when a virtualized firewall with two e1000 interfaces was given a single VCPU by mistake, it immediately started dropping large packets and even the console became noticeably less responsive. Changing the interfaces to virtio solved the problem.

As of mid-2010, I’m using virtio-net exclusively and have no complaints about performance nor stability.

There’s a new device API since qemu-kvm 0.12 – qdev. Instead of the usual -net/-net pair you can use:

-device virtio-net-pci,netdev=tap1 \
-netdev type=tap,id=tap1,ifname=tap1

qdev may be used for other devices:

-device virtio-blk-pci,drive=vda \
-drive if=none,id=vda,file=/dev/foo

It’s not just different option naming, the underlying mechanism is different. Sometimes you can expect performance improvements!

Ease of use – these days Xen is pretty much plug and play, though KVM has the edge here. With Xen, you have to run the hypervisor on bare metal. The host system – dom0 – is actually a special, privileged guest. As KVM is a part of the mainline Linux kernel and as much as possible is done in userland via qemu-kvm, running guests is (almost) as simple as running regular unix daemons. Unless, that is, you use libvirt.

Frankly, libvirt does more harm than good in most circumstances. If you’d like to run a large shop, chances are your setup is going to be non-standard and you’ll end up fighting libvirt’s quirky configuration interface. If all you want is one host and a guest or two, libvirt is too complex anyway – you can achieve the same thing with a simple init script. The only setup where I’d consider libvirt is a mixed KVM/Xen environment – but even then, preparing custom scripts is something to consider.

Community support – that’s the single most important aspect, more important than technical details and this is where Xen fails miserably.

It’s the fruit of the decision made a few years ago by Xen devs to base Linux dom0 and domU code on 2.6.18. Of course, Xen patches have not been included in the mainline kernel. There is a so called “xenified kernel”, i.e. patched 2.6.18. 2.6.18 was a stable, long-term support kernel and everything would be peachy, but Linux development went on. New features were added, people started to use iSCSI in production environments – and iSCSI definitely matured after 2.6.18.

Some (desperate) moves were made to adapt the patches to newer kernels. Novell/OpenSUSE forward-ported them and xenified kernels 2.6.2x and later 2.6.3x were made available. Xenified 2.6.26 even made its way to Debian Lenny.

The results were miserable. These new xenified kernels were so unstable they crashed regularly under high load – mostly SMP domUs, but dom0 crashes were not infrequent. Paravirt ops (only for domU) became usable around 2.6.30, but early pv_ops kernels also crashed under load. To this day there’s no dom0 support in mainline kernels. The dom0 kernel version list reads like a D-Day attack plan.

A few months ago they started to use 2.6.32.x as their official patch base – which is good. Too little, too late, though.

Check this out: xenified 2.6.27 “super stable”?! No, but seriously… The strategy:

  • “Constantly refactor code for upstream submission
  • Stabilize existing Dom0 branch (…)
  • Rebase existing branches against newer kernels to keep up”.

When I tried to report issues and assist in debugging – at that time I had multiple domUs crashing constantly – I usually got one of the following answers (at least that’s the gist of it):

  • use UP guests
  • try pv_ops
  • we don’t have time.

Running UP guests may be fine for cheap VPSes, but if you require decent performance under moderate to high load, SMP is simply a must – especially if you need your guest to be responsive when doing heavy I/O. SMP is a fact of life, world+dog have been using multi-core CPUs for quite a few years. Not supporting it properly in virtualized environments is a gross mistake.

When I got that reply, I decided to investigate other options for all my setups. It was late 2009 / early 2010 and KVM was already looking good. After extensive testing, I started to migrate VMs in March and so far I’ve been nothing but happy with KVM’s performance and stability.

Sure, I hit some bugs – two were particularly nasty. There was a huge memleak in virtio-blk that caused qemu-kvm processes to grow by 1 GB per day. KVM devs turned out to be quite helpful – I assisted in debugging the issue as usual and they had a working patch ready in a day. Later on, there was a problem (one of many) with pvclock in the kernel – one of my VMs crashed every few weeks. It was fixed before I reported it, along with other pvclock improvements in 2.6.32.x.

The community is superb – both on IRC and on the mailing lists. You get the feeling that KVM is actively developed and bugs are fixed all the time. KVM is also Red Hat’s virtualization technology of choice – they acquired its creator, Avi Kivity’s Qmranet, back in 2008. With RH behind it, KVM is here to stay and even if some code ends up exclusively in RHEL, most of it is completely free.

Some suggestions:

  1. KVM is actually two different parts – the kernel code and qemu-kvm, the userland process based on QEMU (IIRC, a merge is considered for the future). The kernel part is just around 50k lines of code and as much as possible is done in userland. This means you can use some of the standard debugging tools (not all! e.g. Valgrind doesn’t work well with qemu-kvm).

  2. Choose your hardware carefully. VT-x or similar (x86 hardware virtualization) is required. You need IOMMU (VT-d or similar) for device pass-through. If you’re purchasing new equipment, go for Nehalem-based systems. Guests on pre-Nehalem CPUs are known to exhibit bad performance e.g. in software build environments. That said, usually any recent Core2-based or later system will work with KVM. Remember, it is not only the CPU that has to support virtualization – motherboard and BIOS support is just as important. If you’re using server-grade hardware from popular vendors, you most likely have that support by default. Be careful with no-name machines and dedicated server services. If in doubt, always confirm with your vendor if the machine actually supports hardware virt. As vendors often don’t know their stuff, make sure you can try before you buy – or return.

  3. Plan. This is true for every virtualization technology.

    • What hardware resources are at your disposal?
    • What should be the host to guest ratio with your current and expected usage profiles?
    • Do you want local storage or a SAN?
    • What about fault tolerance, should it be service-based, guest-based or host-based?
    • Are you going to do live guest migration?
    • Do you really want to virtualize everything?
    • What exactly are your load profiles?

    You have to know the answers before you begin implementation. If you just happily throw everything in one basket, you’ll end up with mediocre performance at best. Don’t mix guests with different load profiles on one host! That’s a common mistake – I’ve seen environments where small web servers generating almost no CPU or I/O load at all, but heavily using the network and RAM were put on one physical host along with batch processing systems saturating CPU and disk. The result? Web servers need low I/O latency – in this case the load from the batch crunchers caused constant latency problems. The crunchers didn’t have enough network bandwidth to do their thing efficiently. Putting the web servers on one set of hosts, the batch processors on another set – and benchmarking them to see how many I can put on one host – did wonders for performance.

  4. Measure and record everything. “Graph first, ask questions later”. Simply getting a current snapshot of crucial system stats is not enough to fully assess and comprehend the situation. Host foobar’s CPU core 2 is at 60%, it’s doing 1000 context switches per second and its I/O svctm is 5 ms. OK, you can probably say that this host is not unusually overloaded, but is that actually normal usage? What *is* normal? What are the trends?

  5. Use SMP guests (unless you’re running a “constrained” workload and don’t care about performance). Don’t take “we don’t support SMP” for an answer, ever.

  6. Use virtio-blk and virtio-net. It’s the only feasible, long-term solution.

  7. Use raw devices! You’ll probably get tempted to use images like QCOW2 – don’t bother. Performance is quite poor and there are flushing issues in current versions. Use LVM and remember to set cache=none in drive definitions in qemu-kvm. With cache=none data won’t be cached twice (in host and guest page cache) – devices will be opened with O_DIRECT. The “noop” scheduler on the guests may help – YMMV.

  8. If you care about data integrity, read about barriers – what they are, how they are used by Linux, your filesystem, qemu-kvm, etc.

  9. Once again: KVM guests are [almost] normal processes. You can use standard tools like taskset. If you run CPU-intensive software (e.g. Java) – confine it to a subset of physical CPU cores on the host. If you do that, bear in mind qemu-kvm VCPUs are actual threads. Use “info cpus” in qemu-kvm’s monitor to get their ids and use taskset to bind them to cores.

  10. Avoid using libvirt or other helpers.

  11. Use recent kernels and qemu-kvm! Linux 2.6.32.x is the current long term support kernel – make sure you have the most recent minor version, as there’s a lot of bug fixing going on in that branch. If you encounter bugs, try a fresh mainline kernel before reporting. Don’t use legacy versions of qemu-kvm. What you have in your “stable” distros is probably too old, unless it’s RHEL – they tend to backport a lot.

Posted in IT | Tagged , , , , | 1 Comment

Backend control in Varnish

Recently I’ve been working on shortening the path between web clients and the app servers. One of the steps was to remove the common combination of Apache httpd and JK, sitting between Varnish and Tomcat backends. The rationale behind that decision warrants a separate post, but for now let’s focus on one aspect: load balancing.

Varnish does load balancing with its directors. Round-robin and random algorithms are available, the latter with IP based or hash based stickiness, and now (in trunk) a DNS based director. There’s no direct backend control, though – you can’t (yet) disable a backend.

One solution is to configure an unused static file as the target of the health probes:

.probe = {
   .url = "/static/file";
   .interval = 5s;
   .timeout = 5s;
   .window = 5;
   .threshold = 3;

If you wish to disable a backend, remove the static file. After a few failed probes Varnish will mark the backend as sick and no more traffic will be sent to it until the file reappears.

The downside is that you have to wait until the configured probe threshold is reached. It’s not very practical if you have lots of backends and want to disable them in sequence (when you’re rolling out an upgrade). You also lose the possibility to use real dynamic actions for health probes.

I had to have some form of backend control via Varnish CLI. This is what I came up with:

backend.disable directorname backendname
  Disable backend backendname in the director specified by
  directorname.  No new client requests will be sent to the
  backend, but health probes will continue. If all backends
  in a director are disabled, normal grace rules will apply.

backend.enable directorname backendname
  Enable a previously disabled backend backendname in the director
  specified by directorname.  Traffic to the backend will resume
  immediately, unless the backend is sick.

  List all backends in all directors, their administrative and health
  status in the following format:

    director: directorname backend: backendname <Enabled|Disabled>

The patch adds backend control commands to the CLI. They work for all backends that are members of a director. Controlling directorless backends – “simple” directors – is not possible.

It works for me (TM) in production and it’s fine as an interim patch, but it will never be committed to 2.1 branch. There’s an issue with backend naming when multiple VCLs are loaded.

Varnish backend objects survive VCL reloads if their definitions in the VCL remain unchanged. However, if a definition changes, a new instance of the changed object is created. It’s possible then to have multiple instances of the same backend or director.

A consistent naming scheme for those instances is required. It has not been decided yet, but it may look like this:


It seems reasonable to always disable all instances – that’s exactly what I do in my patch. There’s a catch, though: if you disable a backend and load a new VCL, a new instance of each director will be created with all its backends in their default state (enabled). You have to do backend.disable again before vcl.use.

See also:

phk has agreed to look into this issue before 3.0.


Posted in IT | Tagged | Comments Off

Frobnitzer calibration

One, two, three… we’re not on the air yet, are we?

OK, I suppose I’m not the blogger type. Let’s hope I can post something once in a while.

Prepare for tenacious rants on IT Ops and other stuff I find interesting!

Posted in Uncategorized | Comments Off