The fairy tale of paid hardware support

Yes, you heard me right. Run-of-the-mill hardware support contracts are getting less and less useful.

We’ve seen severe cost cuts throughout tech support departments at several major server hardware vendors as the late-2000s “financial crisis” unfolded. Entire call centers and first line support teams have been merged and relocated. New procedures have been implemented with a single purpose in mind: to delay (or avoid) the actual shipment of spare parts to the customer and booking local support techs. Well, at least it looks that way.

Let’s take a look at the pre-2010 support procedure of a major server vendor, at the highest standard level of support offered:

  1. Receive a service request, the clock starts ticking.
  2. Ask additional questions if necessary.
  3. Acknowledge the problem.
  4. Book spare parts and/or a support technician, done.

Now it looks more like this:

  1. Receive a service request, the clock starts ticking.

  2. Stall for time: make the customer update each and every piece of firmware in the machine, even the most obscure and unrelated to the problem. If the machine is completely unavailable, go to step 3.

  3. Stall some more: make the customer book a pointless visit to the data center or engage the DC staff in equally pointless circus tricks: physically cycle the power or remove and re-insert the machine into its slot if it’s a modular affair. Don’t forget to open the device and move some stuff around from slot to slot, like DIMMs, line cards or other modules. Hello? If there’s been a major hardware failure, components must be replaced, not moved around. This is not poker and you’re not shuffling cards! The clock’s still ticking.

  4. Under no circumstances accept evidence of component failure from industry acclaimed hardware testing utilities. Make the customer download a 3.9 GB DVD image with an obscure internal test suite – which fails to detect even common and straightforward failures.

  5. All this requires several days worth of “work” for all involved: meanwhile, your service request will be passed from person to person and from team to team – sort of a ticket tag. Every new techie assigned to your case knows next to nothing about it, despite the full description, data, logs, etc. available in their ticketing system. Which is just fine – unless you have more than three servers and two switches to worry about and deal with hardware failures on a daily basis in a large infrastructure.

  6. Finally, when all other options have been exhausted, book CRU parts or a tech. Unless…

…it’s firmware. In fact there’s so much of it, every major hardware component has a uC and runs some kind of firmware. Even the damn fans. That wouldn’t be bad by itself, if not for one little problem they don’t tell you about: hardware support contracts only really cover – you guessed it – hardware. Firmware is rarely prepared by your vendor. Rather, it’s a mix of obscure pieces of code written by disgruntled developers hired by outside contractors, old proprietary PoC OEM code (remote video consoles, anyone?) and some LGPLed libraries, typically mangled beyond recognition. That’s why it’s nearly impossible for the vendor to fix any serious firmware bugs within the few hours specified in your contract. Remember the DRAC/MC? The firmware was clearly FUBAR and yet all Dell techs could do was to offer you a replacement DRAC/MC module – which unsurprisingly fixed nothing since the problems were not related to hardware. Remember the EDAC issues on HP BL460c G1 blades? HP chose to hide their heads in the sand with that one. Apparently, when faced with grave firmware quality issues, it’s sometimes cheaper for the vendor to wait for a new product line rather than fix the old one. Oh, and there’s something else: firmware upgrades are ultimately your responsibility, even if requested by vendor’s tech support. Let’s say you have internal procedures related to firmware upgrades – a stable and proven update schedule. Then you encounter a critical bug in production and the vendor tells you to update the firmware – just because. You have two choices: refuse to update immediately and get your ticket dropped due to non-compliance with vendor’s recommendations or update with shorter (or none) internal QA, risking further outage if the new version is a dud (and it does happen all too often).

So, what to do?

Apart from some specialized infrastructures and quite a few corner cases, there are basically four types of IT shops:

  • large enterprise / govt. Here, you’re bound by internal policy or market regulations. Usually it means little choice: no free (as in beer) software, everything certified, configured and deployed as per vendor-approved specifications. Delays are often acceptable, and the responsibility for hardware may be ceded to the vendor. On the plus side, there’s usually quite a lot of hardware, internal stock of spare parts or spare devices – and the biggest players get dedicated support teams at their vendors, exclusive access to developers and custom software builds to boot. Vendor/contractor techs take care of at least some parts of system implementation and maintenance.

  • educational / HPC. Similar to large enterprise, but without most market regulations.

  • conventional SMEs. IT is often fully outsourced. It not, there’s one or several machines for internal use (mail, website, file sharing, etc.), usually without spares or redundancy.

  • startup / high scalability web. Lots of machines, sometimes as many as at a large enterprise. Cheap x86/x64 hardware is used in large quantities and considered unreliable (reliability is achieved through software, by design). Free software and custom, in-house solutions are accepted and encouraged.

There is no one-size-fits-all solution. But here’s what you can do in most situations.

Implement an in-house hardware certification program. Get evaluation hardware from your vendors and test the living daylights out of it – and I don’t just mean a single pass of memtest86! Check firmware quality, test possible failure modes, try if that hot-plug hardware really is hot-plug. Run both synthetic hardware stressers and your real load profile on the machine. Try out the features you’re unlikely to ever use. In a few months it may turn out you need them after all. Check how your OS of choice behaves on the evaluation machine, if it’s a server. Search the web for any signs of hardware-related issues people may be having. Don’t forget about driver quality. Once you’ve certified the hardware for production use, only order that particular model until it’s phased out by the vendor. Repeat the certification process for every new model.
The main rule is: small bugs are likely to be patched in future firmware updates, but severe issues will never get fixed. If the firmware is a complete mess now, don’t count on it being sorted out in an update. In other words, what you buy is what you get.

Consider NOT using blade servers. Um, but everyone likes blades, right? Blades are good. Blades minimize costs, reduce cabling, increase density and shoot double rainbows out of the cooling ports. But think of it this way. A standard blade enclosure is 10U with 16 slots. You need 35%+ more rack space for 16 1U standalone servers and that’s without counting the modular blade network switches or other I/O modules. If you run HPC clusters, you can get those double modules packing two servers in a single slot.

However, if anything happens to the enclosure, you may lose all 16 (or 32) servers at once. Power, cooling and management is usually redundant, but firmware, backplanes and comm buses between the modules aren’t. It gets even better, and this is something people tend to forget about these days, as it’s apparently passe to work longer than two years for a single company – and very fashionable to work for an emerging startup: when the hardware gets EOLed, you lose support not just for the servers, but for the enclosure as well. If a component fails, well, good luck on ebay, or you’ll have to immediately decommission 16 machines before the whole thing goes down. If you have 16 standalone servers instead, you only lose a single machine to a critical failure in that machine’s enclosure. You can phase out and sell the devices one by one or even decide to keep some of them running well past their EOL and designate the rest as spares. While you can obviously do the same with blades, each enclosure comes at a big premium – which is fine if you’re a large enterprise or a very large startup, but if not, it may hurt a bit financially – and 16 servers may just as well constitute a significant portion of your hardware base.

Introduce a coordinated program of periodic updates to mitigate issues stemming from hardware obsolescense. In every healthy economy (yeah, I know…), hardware is cheaper than people and their time. Five year old commodity servers or disk arrays are antique by any standard. Even if they’re still working, you can definitely do more on fewer machines – with less power consumption and higher density.

And finally, since you can’t really be safe from firmware failures, diversify your hardware base: consider getting hardware from more than one vendor. That way, you’re more or less protected from firmware bugs that would otherwise drop your infrastructure like a domino. Unless, of course, the vendors share the same OEM code (LOMs anyone? tape robot controllers?). There’s one catch, though, that you need to include in your risk calculation: it may be harder to get support in a heterogeneous system, as you provide the vendors with a great excuse. Manufacturer X will always cling like a limpet to the fact that you have equipment from manufacturer Y in your network, SAN, rack or whatever. Welcome to the blame game. Several major server and storage vendors will even go so far as to ask you for the full list of equipment in the rack in which the device in question is installed, when you file a ticket. As if the EM interference from other gear in the rack somehow affected their hardware – well then, congratulations on the design!

Ultimately, there’s one thing to remember: hardware support contracts suck. When you shell out for support, what you actually pay for is the privilege of getting your phone call answered, not for getting your machine back up and running. Design your infrastructure resilience around this fact.

Posted in IT | Tagged , | Comments Off

Hi-end audio for nerds – part 1

Audiophiles and hi-end audio equipment are the subject of much controversy. Everyone seems to have an opinion, ranging from utter disbelief in anything audiophile to complete trust in whatever the manufacturers have to offer. Unsurprisingly, a lot of FUD and fables arose over the years on both sides.

I’ll try to provide a more or less moderate point of view in this article – along with a few facts from recent audio gear history and a primer on what this whole audio quality thing is actually about. I’ve been on both sides of the control room glass for quite a few years (you’d be surprised how many audiophile myths persist even in the pro audio crowd). I also have a penchant for good music and good quality sound, so I guess it puts me right on the fence. Disclaimer: I’m not a full-time engineer. Feel free to correct me.

Note that I’ll only be covering stereo – for several reasons. It’s 5.1 times harder to get room acoustics right with multichannel and there aren’t that many good albums out there in surround – as it’s also 5.1 times harder to mix a multichannel album! Besides, the greatest albums of all time of pretty much any music genre are stereo (or even mono).

Part 1: Earing the difference

Hearing is like any other sense in that it can be trained. It’s true that an experienced engineer can and will hear and identify both obvious artifacts (comb filtering, phase cancellation, etc.) and subtleties like differences between A/D converters. As with other skills – ear training is something that cannot be taught, it must be learned through practice over the years. Good engineers will familiarize themselves with the “sound character” (timbral qualities) of every piece of equipment they use.

Bear in mind though, that subtleties are just that – subtle. Quite easy to distinguish with a trained ear on a solo track, but in the mix it’s is a completely different story. The best example is real vs. sampled grand piano. Sampled pianos have gotten really good in the last few years, but it’s still relatively easy to tell them apart from the real thing – provided we hear the piano track solo. It’s much harder (or impossible) to tell the difference in the mix, unless the sampled version is really, really bad, like from a cheap rompler.

Many aspects of sound can be measured with extreme accuracy. The thing is, it’s difficult to precisely describe timbre. It’s all harmonics and partials, envelopes and modulation, and it can be sampled and represented numerically – otherwise, sound reproduction would be impossible. However, timbre is also a sensation. Subjective assessment and psychoacoustics (and, in many cases, psychology) have their say here. Mankind has only recently started to grasp the nature of timbre and some commonly accepted beliefs have already been proven false.

This is exactly why we have listening tests. The de facto standard is double-blind ABX testing. Double-blind means that neither the test subjects nor the researchers know what exactly is being tested. ABX is a method of introducing stimuli: A and B are alternatives and X is either A or B, but changes randomly between test iterations.

Note: there’s a story on how blind tests supposedly don’t work – a respected audio engineer purportedly discovered, in a ten-minute non-blind listening test, an artifact in a codec that passed 20,000 double-blind, triple-stimulus trials with hidden reference (as defined by ITU BS.1116) by 60 expert listeners at the Swedish Radio. I’ve seen this story quoted hundreds of times in print and on the Internet, sometimes by quite knowledgeable people. It actually comes from this article in Stereophile and there’s not a single piece of evidence available to corroborate it (other than “an audio recording played at (…) the 91st AES convention” mentioned – you guessed it – in the article). Even if it were true, it might as well mean that everybody had heard the artifact but the fact got lost in the average results.

All comparisons must be done at precisely matched levels. Human ears are sensitive to variations as low as 0.5 dB and louder sources are always perceived as better in direct comparisons.

Not all ears are alike: the standard human hearing range (20 Hz – 20 kHz) differs from person to person and decreases significantly with age (adults rarely hear above 15 – 18 kHz). What’s often missed, however, is that there are dips and peaks within the hearing range, e.g. a person may be less sensitive to frequencies around 4 kHz, while another person may have a sensitivity peak there, influencing perception of speech and music material in that range.

If you’d like to know your hearing range, take an audiometric test. Be aware that few facilities offer proper high-frequency audiometry (above 8 kHz). You certainly can’t use a tone generator on your computer to test: consumer sound cards can’t accurately reproduce very high frequencies. What you actually hear with that tone generator set to 20 kHz is just aliasing noise. Moreover, audiometric headphones have to be properly calibrated.

Using standard PC sound cards (particularly those in laptops) is a bad idea anyway for any audio-related testing. Apart from poor sound quality (cheap analog and digital components and an abundance of noise problems, especially at low listening levels) these cards usually have a built-in limiter to protect the speakers, so you may experience compressor pumping – unwanted audible gain changes. Many an engineer got bitten by this: the customers would listen to the mix on their laptops, hear severe compressor pumping and complain to the engineer, who in turn would not hear it on the studio setup. Turning down the level usually helps.

Let’s get back to audio gear. The usual hi-end audio setup is: loudspeakers set up in some kind of a listening room, connected to an amplification device (integrated or not) connected to a sound source, connected to a media transport (combined or not). I’m going to cover most of these components in the above order, as (to me) it is the proper order of the amount of influence on the overall sound character.

Coming up next: $10,000 loudspeakers – we won’t get fooled again (or will we?)

Posted in Music | Comments Off

Flow control flaw in Broadcom BCM5709 NICs and BCM56xxx switches

There is a design flaw in Broadcom’s “bnx2″ NetXtreme II BCM5709 PCI Express NICs (not to be confused with the older PCI-X version, BCM5708) and the BCM56314 and BCM56820 switch-on-a-chip OEM Ethernet switches.

These NICs are extremely popular, Dell and HP use them throughout their PowerEdge and ProLiant standalone and blade server ranges. The OEM switching chips are not that common, but found e.g. in Dell’s hi-end top-of-the-rack PowerConnect 6248 Gigabit Ethernet and PowerConnect 8024 10G switches.

The flaw is in the flow control (802.3x) implementation and results in a switch-wide or network-wide loss of connectivity. As is common in major failures, there is more than one underlying cause.

The problem begins when a host with a BCM5709 crashes or otherwise stops accepting data from the NIC’s RX buffer. The NIC is not quiesced on host system crash and when its buffer fills up, it starts sending flow control PAUSE frames. Arguably, this is a design flaw in itself – the situation is handled properly e.g. by the BCM5708 NIC.

Now, this event alone would not cause a network-wide disruption. PAUSE frames are transmitted point-to-point – between a NIC and a switch port. When a switch receives a PAUSE frame on a port it simply buffers or drops frames destined for that port. Under no circumstances should a switch propagate or generate and transmit PAUSE frames by itself. It’s actually allowed by the standard, to accommodate for blocking switching matrices, but it’s insanely, criminally brain dead. No respectable network equipment vendor allows the switch to send PAUSE frames.

However, this is exactly what the BCM56314 and BCM56820 switches do. Not only is flow control enabled by default (since software 3.x for the BCM56314 and since the first release for the BCM56820), it is configured to send PAUSE frames by default as well! This behavior cannot be disabled – flow control has to be switched off globally, as per-port setting is not available.

When the culprit host’s NIC starts sending PAUSE frames, the switch also sends PAUSE frames to every port trying to communicate with that NIC, causing a switch-wide loss of connectivity. It gets worse if flow control is left enabled on inter-switch connections – expectedly, all ports on all switches eventually cease to forward frames.

In my experience, Broadcom (and Dell) have been notoriously difficult to work with when it comes to firmware. There is practically no chance they will do anything about this (namely, fix BCM5709′s firmware, disable flow control by default and introduce the possibility to separately disable TX/RX flow control – per-port or at least globally).

Below are some guidelines on how to avoid this issue:

  • disable flow control on all switches, but there is a caveat here: some vendors (e.g. of enterprise iSCSI solutions) actually require – for reasons unknown – RX and TX flow control to be enabled. Other than that, most networks do not need flow control at all.

  • If you can’t disable flow control on all switches, at least disable it on your core switches. If you use it in the core, you’re Doing It Wrong™.

  • If you can’t disable flow control on the switches at all, disable TX flow control (TX PAUSE) on all BCM5709 NICs. However, I found these settings to be utterly unreliable: disabling TX flow control on some BCM5709s resulted in a complete loss of connectivity with some, but not all (!), BCM56314 and BCM56820-based switches.

  • Do not use BCM56314 and BCM56820-based OEM switches (e.g. Dell PowerConnect 6248, M8024, 8024F). Get your switches from a respectable network hardware vendor – there are quite a few these days. It’s all about support: switches are core business for networking vendors. They have in-house teams familiar with their specific implementation and they’re able to quickly provide software patches if required. With OEM switches you’re pretty much on your own. Chances are there’s no one in your vendor’s team familiar with the actual software source code, since only the original manufacturer has access.

UPDATE (2011-10-27): Sven Ulland of Opera Software has just posted some excellent research on this issue on the linux-powerdge mailing list.

It seems the problem has been noticed by Broadcom and at least partially addressed in the 4.x switch firmware. This is an excerpt from the 4.x release notes for the 80xx series:

Release 4.1.0.19
(…)
Asymmetric flow control is implemented for the PC8024X, PCM8024, PCM6348, PC70XX, and PCM8024-k switches. The switch does not generate pause frames when congested. It will honor pause frames as per industry standards.

I have yet to check if it actually works. Sadly, 62xx switches have been omitted – a grave mistake since they are much more abundant than the 80xx series.

NIC-wise, Sven has confirmed that the bnx2 driver v2.0.18c (distributed with firmware 6.0.17 for the 5709) fixes the issue (i.e. the NICs no longer send PAUSE frame floods on host system crashes). The driver is included in the mainline kernel 2.6.37.

Unfortunately, I can still easily reproduce the problem on vanilla 2.6.38, but I do recommend that you try it yourself: it seems to be network dependent.

Posted in IT | Tagged , , | 5 Comments

My favorite books on hackers

Just a personal top list. I’m not a security expert, but some of the guys featured in these books were more proficient than some white hats are today. And yes, the word hacker is used in a pejorative sense nowadays. A cracker was someone who cracked copy protection in the Atari/Commodore/Amiga days. Deal with it.

So, on to the list:

  1. Underground: Tales of Hacking, Madness and Obsession on the Electronic Frontier (1997) by Suelette Dreyfus with research by Julian Assange, about the most prolific hackers in the Australian computer underground (with guest appearances by the Brits and the Americans) in the late 80s and early 90s. I first read it some ten years ago and couldn’t put it down. This is, hands down, the best book on hackers ever written, period. Superb research, excellent narrative, immensely readable for both computer experts and the non-technical types. Ms. Dreyfus, Mr. Assange – hats off! Oh, and did I mention the book has been available as a free download since 2001, without any DRM crap?

  2. Beating the System – Hackers, Phreakers and Electronic Spies (1992) by Owen Bowcott and Sally Hamilton. It’s about the 80s European hacker scene and (in)famous early hackers like Edward Singh – or the West German hacker ring (Urmel, Pengo, Hagbard, etc.) who passed western military information to the KGB. Hagbard’s body was later found in a forest near Hannover, burned to death with gasoline. The death was considered a suicide. I first read this book when it came out and re-read it some 20 times during the next ten years. A great piece of old school no-BS journalism. It’s definitely for a non-technical reader, explaining what hacking is all about and what makes a hacker tick, but there’s LOTS of interesting information on the 80s hacking community, their exploits and their personalities. If I were to recommend only one book on hackers, this would be it.

  3. The Cuckoo’s Egg: Tracking a Spy Through the Maze of Computer Espionage (1989) by Clifford Stoll. Well, what can be said – it’s a classic. My only complaint about this book is that at 320+ pages it runs a tad long and feels a bit stretched.

  4. Kingpin: How One Hacker Took Over the Billion-Dollar Cybercrime Underground (2011) by Kevin Poulsen, about the black-hat-turned-grey-hat-turned-black-hat hacker Max Butler. I didn’t particularly like Poulsen’s writing style, but this book is a must read for the following reason: it’s not about the late-eighties naive and idealist “ethical” hackers. It’s about real criminals and the massive opportunities for fraud today’s interconnected world provides them. The sheer scale of their operations – and the fact that the most successful Eastern European fraudster was never caught – are just mind blowing.

  5. Out of the Inner Circle: A Hacker’s Guide to Computer Security (1985) by Bill Landreth. This is a technical guide, obviously terribly outdated, but if you want to know how hacking was done in the early eighties, before the Internet and widespread TCP/IP, or if you’re just nostalgic, it’s your go to book.

Honorary mention: the 1998 German movie 23, based on the story of the West German hacking/espionage ring. Although inaccurate and criticized by some of the participants of the actual events, it’s the only existing film accurately portraying high-profile hacking. Fun fact: Poland’s own Zbigniew Zamachowski plays a KGB officer in this flick.

And just to remember the original meaning of the word hacker, may I recommend Where Wizards Stay Up Late: The Origins Of The Internet (1998) by Katie Hafner and Matthew Lyon. It’s the story of BBN, IMPs, and – most importantly – the people who made ARPANET possible. It’s a bit dry and reads somewhat like a history textbook, but it’s immensely informative.

On a side note: in Kingpin, it was mentioned that a CERT team had been brought in to extract the disk crypto key from Max Butler’s computer while it was still on (the police had reportedly distracted Butler so that he didn’t have time to cut the power). I’m wondering what method they used:

  1. he just accidentally left an unlocked console – very likely, both the perps and law enforcement made trivial mistakes during the course of the story.

  2. They used a FireWire dongle to spoof an SBP-2 device and read the key from memory via DMA – also likely, law enforcement has been doing that for years.

  3. They did a cold boot attack – very unlikely, IMHO.

  4. There was no attack on the live system at all and the guy just used a weak passphrase – likely, but the info in the book does not corroborate that.

  5. There was a LE backdoor in the crypto software – maybe…

Posted in Books | Comments Off

Barriers, Caches, Filesystems

With the recent proliferation of ext4 as the new “default” Linux filesystem there’s been much talk of write barrier support. The flurry of post-2.6.18 barrier related development in most storage subsystems has left some novice users and administrators perplexed. I hope I can clear it up a bit with this primer/refresher.

If you’re familiar with the basics of I/O caching, just skip to the “Barriers” section.

Barriers have long been implemented in the kernel, e.g. in ext3, XFS and ReiserFS 3. However, they are disabled by default in ext3. Up until recently, there was no barrier support for anything other than simple devices.
 

Two words: data safety

Let’s take a look at the basic path data takes through the storage layers during a write-out in a modern storage setup:

Some of these layers/components have their own caches:

There may be other caches in the path, but this is the usual setup. The page cache is omitted if data is written in O_DIRECT mode.

When a userland process writes data to a filesystem, it’s paramount (unless explicitly requested otherwise) that the data makes it safely to physical, non-volatile media. This is a part of the “D” in ACID. Generally, data may be lost if it’s in volatile storage during hardware failure (e.g. power loss) or software crash.
 

Caches

The OS page cache (a subsystem of the VFS cache) and the buffer cache are in the host’s RAM, obviously volatile. The risk here is that the page cache is relatively large compared to other caches. It can’t survive OS crashes.

The storage controller write cache is present in most mid- and hi-end controllers and/or HBAs working in modes other than initiator-target: RAID HBAs, DAS RAID boxen, SAN controllers, etc. Every vendor seems to have their own name for it:

  • BBWC (Battery-Backed Write Cache)
  • BBU (Battery-Back-Up [Write Cache])
  • Array Accelerator (BBWC – in HPese)
  • FBWC (Flash-Backed Write Cache)

As the names suggest, BBWC is simply some memory and a rechargeable battery, usually in one or more proprietary FRU modules. In hi-end storage systems, the battery modules are hot-swappable, in mid-end systems a controller has to be shut down for battery replacement. RAID HBAs require host down time for battery maintenance unless you have hot-swap slots and multiple HBAs serving multiple paths.

FBWC is the relatively new generation of volatile cache where the battery assembly is replaced with NAND flash storage – not unlike today’s SSDs – and a replaceable capacitor bank that holds enough charge to allow data write-out from DRAM to flash in case of power failure.

Both types of cache have their drawbacks: BBWC needs constant battery monitoring and re-learning. Re-learning is a recurring process: the controller fully cycles (discharges and recharges) the battery to learn its absolute capacity – which obviously deteriorates with time and usage (cycles). While re-learning, write cache must be disabled (since at some point in the process the battery will be almost completely discharged and unable to power the BBWC memory). This is a periodic severe performance penalty for write-heavy workloads, unless there’s a redundant battery and/or controller to take over. Good controllers allow the administrator to customize re-learn schedules. The batteries must be replaced every few months or years.

Flash-based write cache is also subject to deterioration: the dreaded maximum write count for flash memory cells (however, flash is used only on power failure). The backup capacitors degrade over time. The NAND modules and the capacitor bank must be monitored and replaced if necessary.

Write cache on physical media (disk drives) is almost always volatile. Most enterprise SSDs and some consumer SSDs (e.g. the Intel 320 series, but not the extremely popular X25-M series) have backup capacitors.

Modern disks have 16-64 megabytes of cache. The problem with this type of cache is that not all drives will flush it reliably when requested. SCSI and SAS drives do the right thing: the “SYNCHRONIZE CACHE” (opcode 35) command is a part of the SCSI standard. PATA drives have usually outright lied to cheat on benchmarks. SATA does have the “FLUSH CACHE EXT” command, but whether the drive actually acts on it depends on the vendor. Get SCSI/SAS drives for mission critical data – nothing new here.

One more caveat with disk write cache is that the controller software – to ensure data durability – MUST guarantee that all data flushed out of the controller write cache is committed to non-volatile media. In other words, when the OS requests a flush and the controller returns success, the data MUST have already been committed to non-volatile media. This is why disk write cache MUST be disabled if BBWC or other form of controller cache is enabled – the controller cache must be flushed directly to non-volatile media and not to another layer of volatile cache.

Software RAID with JBOD is a special case: there is no controller cache, only the drive cache, the OS page cache and buffer cache.
 

Barriers

Think of write barriers on Linux as a unified approach to flushing and forced I/O ordering.

Consider the following setup:

This is a bit on the extreme side, but ponder for a moment how many layers of I/O (and caches) the data has to pass through to be stored on the physical disks.

If the filesystem is barrier-aware and all I/O layers support barriers/flushes, an fs transaction followed by a barrier is committed (flushed) to persistent storage (disks). All requests issued prior to the barrier must be satisfied before continuing. Also, an fsync() or a similar call will flush the write caches of the underlying storage (fsync() without barriers does NOT guarantee this!). Barrier bios (block I/Os) actually do two flushes: one before the bio and one afterwards. It’s possible to issue an empty barrier bio to flush only once.

Barriers ensure critical transactions are committed to persistent media and committed in the right order, but they incur a – sometimes severe – performance penalty.

Let’s get back to our two hardware setups: software RAID on JBOD and hardware RAID with BBWC.

Since barriers force write-outs to persistent storage, disk write cache can be safely enabled for MD RAID if the following conditions are met:

  • the filesystem supports barriers and they are enabled
  • the underlying I/O layers support barriers/flushes (see below)
  • the disks reliably support cache flushes.

However, on hardware RAID with BBWC, the cache itself is (quasi-)persistent. Since RAID controllers do implement the SYNCHRONIZE CACHE command, each barrier would flush the entire write cache, negating the performance advantage of BBWC. It’s recommended to disable barriers if – and only if – you have healthy BBWC. If you disable barriers, you must monitor and properly maintain your BBWC.

Full support for barriers on various virtual devices has been added only recently. This is a rough matrix of barrier support in vanilla kernel versions, milestones highlighted:

Barrier support Kernel version Commit
I/O barrier support 2.6.9 1
ext3 2.6.9 1
reiserfs 2.6.9 1
SATA 2.6.12 -
XFS – barriers enabled by default 2.6.16 1
ext4 – barriers enabled by default 2.6.26 1
DM – simple devices (i.e. a single underlying device) 2.6.28 1
loop 2.6.30 1
DM – rewrite of the barrier code 2.6.30 1
DM – crypt 2.6.31 1
DM – linear (i.e. standard LVM concatenated volumes) 2.6.31 1
DM – mpath 2.6.31 1
virtio-blk (only really safe with O_DIRECT backing devices) 2.6.32 1
DM – dm-raid1 2.6.33 1
DM – request based devices 2.6.33 1
MD barrier support on all personalities * 2.6.33 1
barriers removed and replaced with FUA / explicit flushes 2.6.37 1 2 3

* Note: previously barriers were only supported on MD raid1. This patch can be easily applied to 2.6.32.

As of 2.6.37, block layer barriers have been removed from the kernel for performance reasons. They have been completely superseded by explicit flushes and FUA requests.

FUA is Force Unit Access: an I/O request flag which ensures the transferred data is written directly to (or read from) persistent media, regardless of any cache settings.

Explicit flushes are just that – write cache flushes explicitly requested by a filesystem. In fact, the responsibility for safe request ordering has been completely moved to filesystems. The block layer or TCQ/NCQ can safely reorder requests if necessary, since the filesystem will issue flush/FUA requests for critical transactions anyway – and wait for their completion before proceeding.

These changes eliminate the barrier-induced request queue drains that significantly affected write performance. Other I/O requests (e.g. without a transaction) can be issued to a device while a transaction is still being processed.

However, as 2.6.32.x is the longterm kernel for several distros, barriers are here to stay (at least for a few years).
 

Filesystems

Barriers/flushes are supported on most modern filesystems: ext3, ext4, XFS, JFS, ReiserFS 3, etc. ext3/4 are unique in that they support three data journaling modes: data={ordered,journal,writeback}.

data=journal essentially writes data twice: first to the journal and then to the data blocks.

data=writeback is similar to journaling on XFS, JFS, or ReiserFS 3 before Linux 2.6.6. Only internal filesystem integrity is preserved and only metadata is journaled; data may be written to the filesystem out of order. Metadata changes are first recorded in the journal and a commit block is written. After the journal has been updated, metadata and data write-outs may proceed. data=writeback can be a severe security risk: if the system crashes while appending to a file, after the metadata has been committed (and additional data blocks allocated), but before the data has been written (data blocks overwritten with new data), then after journal recovery that file may contain blocks filled with data from previously deleted files – from any user.

Note: ReiserFS 3 supports data=ordered since 2.6.6 and it’s the default mode. XFS does support ordering in specific cases, but it’s neither always guaranteed nor enforced via the journaling mechanism. There is some confusion about that, e.g. this Wikipedia article on ext3 and this paper [PDF] seem to contradict what a developer from SGI stated (the paper seems flawed anyway, as an assumption is made that XFS is running in ordered mode, based on the result of one test).

data=ordered only journals metadata, like writeback mode, but groups metadata and data changes together into transactions. Transaction blocks are written together, data first, metadata last.

With barriers enabled, the order looks more or less like this:

  1. the transaction is written
  2. a barrier request is issued
  3. the commit block is written
  4. another barrier is issued

There is a special case on ext4 where the first barrier (between the transaction and the commit block) is omitted: the journal_async_commit mount option. ext4 supports journal checksumming – if the commit block has been written but the checksum is incorrect, the transaction will be discarded at journal replay. With journal_async_commit enabled the commit block may be written without waiting for the transaction write-out. There’s a caveat: before this commit the barrier was missing at step 4 in async commit mode. The patch adds it, so that now there’s a single empty barrier (step 4) after the commit block instead of a full barrier (two flushes) around it.

ext3 tends to flush more often than ext4. By default both ext3 and ext4 are mounted with data=ordered and commit=5. On ext3 this means not only the journal, but effectively all data is committed every 5 seconds. However, ext4 introduces a new feature: delayed allocation.

Note: delayed allocation is by no means a new concept. It’s been used for years e.g. in XFS; in fact ext4 behaves similarly to XFS in this regard.

New data blocks on disk are not immediately allocated, so they are not written out until the respective dirty pages in the page cache expire. The expiration is controlled by two tunables:

/proc/sys/vm/dirty_expire_centisecs
/proc/sys/vm/dirty_writeback_centisecs

The first variable determines the expiration age – 30 seconds by default as of 2.6.32. On expiration, dirty pages are queued for eviction. The second variable controls the wakeup frequency of the “flush” kernel threads, which process the queues.

You can check the current cache sizes:

grep ^Cached: /proc/meminfo # page cache size
grep ^Dirty: /proc/meminfo # total size of all dirty pages
grep ^Writeback: /proc/meminfo # total size of actively processed dirty pages

Note: The VFS cache (e.g. dentry and inode caches) can be further examined by viewing the /proc/slabinfo file (or with the slabtop util which gives a nice breakdown of the slab count, object count, size, etc).

Note: before 2.6.32 there was a well-known subsystem called pdflush: global kernel threads for all devices, spawned and terminated on demand (the rule of thumb is: if all pdflush threads have been busy for 1 second, spawn another thread. If one of the threads has been idle for 1 second, terminate). It’s been replaced with per-BDI (per-backing-device-info) flushers – one flush thread per each logical device (one for each filesystem).

On top of all that, there was the dreaded pre-2.6.30 “ext4 delayed allocation data loss” bug/feature. Workarounds were introduced in 2.6.30, namely the auto_da_alloc mount option, enabled by default.

You should also take into consideration the size of the OS page cache. These days machines have a lot of RAM (32+ or 64+ GB is not uncommon). The more RAM you have, the more dirty pages can be held in RAM before flushing to disk. By default, Linux 2.6.32 will start writing out dirty pages when they reach 10% of RAM. On a 32 GB machine this is 3.2 GB of uncommitted data in write-heavy environments, where you don’t hit the time based constraints mentioned above – quite a lot to lose in the event of a system crash or power failure.

This is why it’s so important to ensure data integrity in your software by flushing critical data to disks – e.g. by fsync()ing (though at the application level you may only hope the filesystem, the OS and the devices will all do the right thing). This is why database systems have been doing it for decades. Also, this is one of the reasons why some database vendors recommend placing transaction commit logs on a separate filesystem. The synchronous load profile of the commit log would otherwise interfere with the asynchronous flushing of the tablespaces: if the logs were kept on a single filesystem along with the tablespaces, every fsync would flush all dirty pages for that filesystem, killing I/O performance.

Note: fsync() is a double-edged sword in this case. fsyncing too often will reduce performance (and spin up devices). That’s why only critical data should be fsynced.

Dirty page flushing can be tuned – traditionally with these two tunables:

/proc/sys/vm/dirty_background_ratio
/proc/sys/vm/dirty_ratio

Both values are expressed as a percentage of RAM. When the amount of dirty pages reaches the first threshold (dirty_background_ratio), write-outs begin in the background via the “flush” kernel threads. When the second threshold is reached, processes will block, flushing in the foreground.

The problem with these variables is their minimum value: even 1% can be too much. This is why another two controls were introduced in 2.6.29:

/proc/sys/vm/dirty_background_bytes
/proc/sys/vm/dirty_bytes

They’re equivalent to their percentage based counterparts. Both pairs of tunables are exclusive: if either is set, its respective counterpart is reset to 0 and ignored. These variables should be tuned in relation to the BBWC memory size (or disk write cache size on MD RAID). Lower values generate more I/O requests (and more interrupts), significantly decrease sequential I/O bandwidth but also decrease random I/O latency. The idea is to find a sweet spot where BBWC would be used most effectively: the ideal I/O rate should not allow BBWC to overfill or significantly under-fill. Obviously, this is hit/miss and only theoretically achievable under perfect conditions. As usual, you should tune and benchmark for your specific workload.

When benchmarking, remember ext3 has barriers disabled by default. A direct comparison of ext3 to ext4 with default mount options is usually quite pointless. ext4 offers an increased level of data protection at the cost of speed. Likewise, directly comparing ext3 in ordered mode to a filesystem offering only metadata journaling may not yield conclusive results. Some people got their benchmarks wrong.

Note: I did that kind of benchmark a while ago: the goal was to measure system file operations (deliberately on default settings), not sequential throughput or IOPS – and ext4 was faster anyway.

All in all, it’s your data! Test everything yourself with your specific workloads, hardware and configuration. Here’s a simple barrier test workload to get you going.

Posted in IT | Tagged , , , | Comments Off

The systemd fallacy

(…) So, get yourself a copy of The Linux Programming Interface, ignore everything it says about POSIX compatibility and hack away your amazing Linux software. It’s quite relieving!

Lennart Poettering @ fosdem.org

systemd is wrong on so many levels I hardly know where to start. Perhaps its single most important design fault is that it was conceived with a blatant disregard for servers. The author’s manifesto and the “systemd for admins” series provide good insight into his motives for designing systemd.

He goes on and on about how you can save 3 seconds here and 5 seconds there by parallel and delayed service startup – systemd actually has a feature to measure system boot time. The question is: who cares? Desktop users, yes. Embedded users, maybe. Server users? Nope. It doesn’t matter if a server comes up in 96.3 seconds instead of 33.1. What counts is if it stays up and is not too cumbersome to maintain.

So how are systemd’s goals achieved? Basically, by throwing well-proven Unix paradigms out the window and clearly admitting it. Yes, Unix was designed 42 years ago. And no, it’s not broken. I’m not a die-hard traditionalist nor am I even that reluctant to adopt new solutions, but Unix stays the single most successful server OS design in the world for a reason – still used today in various forms after those 42 years. It’s simple, elegant and it works. The mainstay of its design is simplicity and modularity. One program for one task; easy interconnection between programs. Yet we are expected to ditch all that for something new and shiny.

One of the design goals of systemd is to get rid of shell scripts in the boot process and… rewrite everything in C, as the author doesn’t seem to be very fond of grep being called 77 times and awk 92 times during his system boot. Now, why do we have shell scripts in the boot process? They’re simple. They’re easy to read. Every single competent un*x admin knows at least the basics of shell scripting. There is almost complete control over the entire boot process and anything can be changed in a few seconds. Of course, one can argue it’s almost as easy to change systemd’s C code, recompile and reinstall. I’ll let you on a little secret: when do you usually need to change something in the boot process? When something doesn’t work right. No matter if you’re comfortable at your desk with your triple 30″ screens or in the trenches data center after an all-nighter gone horribly wrong – you need to fix the problem pronto. The last thing you want to worry about is instrumenting, debugging and rebuilding C code at the core of your OS.

The second design goal seems to be incredible and unwarranted intentional complexity. The single most important process in the userland is supposed to be clean, small and efficient. Let’s take a look what systemd is supposed to supervise:

  • restarting processes after they crash. sysvinit doesn’t do that and we don’t have restartd or a thousand other programs for it. Oh, wait…

  • collecting information on daemon crashes. Nowadays most daemons have their own crash report formats, logging to syslog, stderr, directly to text log files, to binary dumps, etc. Good luck making the authors conform to a single standard. And good luck with all the corner cases.

  • keeping control (via cgroups) over processes detached from their parents. But for that we already have, well… cgroups?

  • delayed/on-demand service startup. “on most machines where sshd might be listening somebody connects to it every other month or so.” says the author. On a workstation – maybe. How much RAM are you going to save by delaying the startup of a few daemons? If they’re unused, they’ll be swapped out anyway. To support on-demand startup of network services, yet another functionality already available elsewhere had to be implemented within systemd: inetd.

  • dependency based service management. To the author, dependency based management is redundant. The problem is, every boot process is dependency based. Think services, not processes. Your services depend on their filesystems having been mounted, the filesystems depend on the underlying devices having been initialized, and so forth. We’ve already had rudimentary dependency based service management in System V! S31fancyd and S31foobar depended on S30whatsit for setup. At teardown, only with K10foobar and K10fancyd down could the system proceed with K20whatsit. Servers are unlike desktops in that server boot time counts from the moment you press the big red button to the moment that server actually starts providing all its services. Or in other words: if you’re waiting, who cares if it’s in parallel or in series? It doesn’t matter if e.g. ftpd is allowed to start before /home/ftp is mounted and files can be served. Besides, an administrator may choose to stop S30whatsit without stopping S31fancyd – and he or she probably knows what they’re doing. It’s much harder to force service actions with systemd: you end up constantly fighting its decisions.

  • systemd creates autofs mount points and starts daemons before their filesystems are available (obviously, fs operations will block until then). Sounds horrible, right? This is going to be an administration nightmare. There is no way to do autofs right. If anything goes wrong with the underlying I/O or autofs itself, you’re left with an unusable system. Even on Solaris, which arguably has the most reliable automounter implementation available. Incorporating autofs into PID 1 and the boot process (and hanging services off it) guarantees problems.

  • listening to hardware changes introduces potential stability and security issues – and there already are [more or less] working facilities acting on hardware events.

  • communication via D-Bus. D-Bus is _very_ desktop-oriented. It’s not called Desktop Bus for nothing. It’s designed for portability – not speed, reliability or simplicity. There are dozens of simple, robust message passing and IPC protocols, but this is by far one of the most complicated, perhaps second only to CORBA/IIOP. Instead of letting this abomination die, or at least stay confined to the desktop, it’s actually going to be incorporated into the boot process. Daemon developers are encouraged to use it. Let’s put it in the kernel while we’re at it.

systemd is overcomplicated and bloated with unnecessary features, almost as if someone was trying to implement a second kernel in userland. It looks like it was designed by someone who never saw anything else than their own workstation. It’s a nice exercise in self-managing systems, with its kitchen sink approach it’s certainly worth a look by desktop/embedded vendors as an alternative to sysvinit or in-house inits, but that’s it.

Linux userland boot process should be reviewed and cleaned up in most major distributions, perhaps even standardized (not necessarily, though – we currently have about four major userland boot systems, all in major distros – some diversity here is actually welcome).

However, systemd is not the way to go. It would set us back a decade. Let’s hope it doesn’t catch on – just like upstart or the first implementation of devfs back in 2000. It’s hardly surprising how many people are drinking the kool-aid – systemd offers a lot of lofty promises. With Red Hat’s financial backing and all the propaganda (I can’t even call it PR), it’s going to be an arduous fight, but remember: after all upstart made it into Ubuntu; devfs even made it into the kernel. Not all hope is lost.

Posted in IT | Tagged , | 17 Comments

Cloud computing

This is hilarious. Todd Hoff on the Amazon outage:

“Be a really big customer so Amazon* will help you specifically with your problems. This seemed to help Heroku a lot. I noticed in the Amazon developer forums a lot of people forgot to do this and didn’t get the personal help they needed.”

* it’s true for all vendors.

Spot on, mate! I’m not joking, this is rule number one: paid support isn’t worth squat unless you’re a big customer – and by big I mean big enough to sue and win. Or to make the jolt of pain go right to the top if you vote with your money and switch vendors. Even then try to avoid vendor lock-in at all costs. And stop outsourcing your entire infrastructure while you’re at it.

Posted in IT | Tagged | Comments Off

Content authorization with Varnish

I’ve been asked about this so many times that I thought I should just post it here. It’s actually very simple to do using restarts.

The problem: you need to check if a user is authorized for an object (which may or may not already be cached by Varnish) by means of an external application.

The solution: the following VCL will pass GET requests from the users to the authorization app. You can modify the URLs, e.g. insert a custom query string if required by the app.

The request is then either denied (if the auth app returns anything other than a 200) or restarted and served from the real backend or from cache.

This is only an example; you can extend it to cache authorization responses, add a control header if you use restarts anywhere else in your VCL, etc.

sub vcl_recv {
        if (req.url ~ "^/authorized_content") {
          if (req.restarts == 0) {
            set req.backend = authorization_backend;
            return(pass);
          } else {
            set req.backend = real_backend;
            set req.url = regsub(req.url, "_authorize_me", "");
          }
        }
}

sub vcl_fetch {
        if (req.url ~ "^/authorized_content" && req.restarts == 0) {
          if (beresp.status == 200) {
            restart;
          } else {
            error 403 "Not authorized";
          }
        }
}
Posted in IT | Tagged | 4 Comments

Linux filesystems – small file performance on HDDs

A handy chart for quick reference.

The benchmark is quite workload-specific; I measured sequential operations on large sets of small files (rather than random reads/writes on a single large file) – an approach similar to filebench or bonnie++.

Methodology:

  1. hardware: 4-core x86-64, 10 15kRPM SAS HDDs in RAID 10, LSI MegaRAID, write cache enabled,

  2. software: Linux 2.6.32.x, irqbalance, mkfs.* etc.

  3. popular Linux filesystems: ext{2,3,4}, XFS, ReiserFS 3,

  4. JFS was intentionally left out: it performs so poorly with small files (3-4x worse than XFS), including it in the benchmark made no sense at all.

  5. All filesystems were created with default values and mounted with the noatime flag only.

  6. Data set: each process operated on its individual set of 350,000 files with an average size of 50 kB, 200 files per directory.

  7. VFS cache was dropped (or warmed up) where needed.

  8. Operations labeled on the chart are not discrete (syscalls). Rather, compound operations were performed on the data sets, i.e.

    • unlink and stat are: opendir, readdir, unlink or stat in sequence, closedir
    • write is: mkdir (if necessary), open, write, close, chown, chmod
    • rewrite is: mkdir (if necessary), open, unlink and the rest as with write
    • read is: opendir, readdir, open/read/close in sequence.

  9. All operations were in sequential (i.e. readdir) order (rather than random).

  10. Operation times are in seconds.

  11. All values were averaged from 10 samples.

  12. Op times are capped at 450 seconds on the chart, for real times see:

  13. raw data (CSV format)

  14. ext2 is included in the benchmark, as some systems still use it when journaling is not required.

  15. This benchmark is valid only for HDDs. On SSDs, the differences are significantly less pronounced due to no (low) seek time penalty.

  16. YMMV

Conclusions: nothing new here, obviously. XFS simply sucks for small files, ext3 is slow, ext2 is faster than ext3, ReiserFS 3 performs nicely with everything but concurrent writes (it uses a global lock for writes: the Big Kernel Lock up to 2.6.33 and now a mutex-based solution). ext4 is the new hotness.

The chart (click for full-size version):
Small file performance benchmark

Posted in IT | Tagged , | Comments Off

Pure VCL cookie-based sticky sessions in Varnish 2.1

Some time ago I decided to drop Apache httpd from one of my setups. The httpd was no longer used for anything but mod_jk, which only did load balancing (with sticky sessions) between multiple clusters of Tomcat application servers. mod_jk is a particularly nasty kludge and there was no reason for keeping that layer of load balancers – Varnish does LB well enough.

The problem is Varnish 2.1.x does not support cookie-based stickiness. While it’s perfectly possible to implement it in C as a director, I wanted to try a pure VCL solution. One caveat is that req.backend can’t be set from variables in VCL – you have to explicitly enumerate all backends.

The algorithm is similar to mod_jk behavior:

  1. extract app server hostname from the JSESSIONID cookie as set by Tomcat: JSESSIONID=session_id.server_hostname

  2. If a hostname is present and matches one of the app servers, use it as the backend, otherwise use the main director (with all app servers).

  3. If a single backend is set and it’s not healthy, use a cluster-specific director (with all app servers in the same session replication cluster as that backend). The request will go to another app server in the same cluster.

  4. If the cluster-specific director is not healthy (all backends sick), fall back to the main director.

An example VCL snippet:

sub get_lb_worker_from_cookie {
        if (req.http.Cookie ~ "JSESSIONID=") {
          set req.http.X-JS-LBC = regsub(req.http.Cookie,
                                        "^.*?JSESSIONID=([^;]*);*.*$", "\1");

          set req.http.X-JS-LBAS = regsub(req.http.X-JS-LBC,
                                         "^.+\.(.+)$", "\1");
        }

        log "JSESSIONID cookie: " req.http.X-JS-LBC;
        log "LB AS from JSESSIONID: " req.http.X-JS-LBAS;

        unset req.http.X-JS-LBC;
}

sub set_lb_worker_from_cookie_site1 {
        if (req.http.X-JS-LBAS ~ "^as1$") {
          set req.backend = site1_as1;
        } else if (req.http.X-JS-LBAS ~ "^as2$") {
          set req.backend = site1_as2;
        } else if (req.http.X-JS-LBAS ~ "^as3$") {
          set req.backend = site1_as3;
        } else if (req.http.X-JS-LBAS ~ "^as4$") {
          set req.backend = site1_as4;
        } else {
          set req.backend = site1_main_director;
        }

        if (!req.backend.healthy
            && req.backend != site1_main_director) {
          log "Backend " req.backend " sick, forcing to cluster director";

          if (req.http.X-JS-LBAS ~ "^as[12]$") {
            set req.backend = site1_cluster1_director;
          } else if (req.http.X-JS-LBAS ~ "^as[34]$") {
            set req.backend = site1_cluster2_director;
          }
        }

        if (!req.backend.healthy
            && req.backend != site1_main_director) {
          log "Cluster director " req.backend " sick, forcing to main";
          set req.backend = site1_main_director;
        }

        unset req.http.X-JS-LBAS;

        log "Backend set to: " req.backend;
}

sub vcl_recv {
        if (req.url ~ "^/site1/sessions_required_here/") {
          call get_lb_worker_from_cookie;
          call set_lb_worker_from_cookie_site1;
        }
}
Posted in IT | Tagged | 7 Comments