Yes, you heard me right. Run-of-the-mill hardware support contracts are getting less and less useful.
We’ve seen severe cost cuts throughout tech support departments at several major server hardware vendors as the late-2000s “financial crisis” unfolded. Entire call centers and first line support teams have been merged and relocated. New procedures have been implemented with a single purpose in mind: to delay (or avoid) the actual shipment of spare parts to the customer and booking local support techs. Well, at least it looks that way.
Let’s take a look at the pre-2010 support procedure of a major server vendor, at the highest standard level of support offered:
- Receive a service request, the clock starts ticking.
- Ask additional questions if necessary.
- Acknowledge the problem.
- Book spare parts and/or a support technician, done.
Now it looks more like this:
Receive a service request, the clock starts ticking.
Stall for time: make the customer update each and every piece of firmware in the machine, even the most obscure and unrelated to the problem. If the machine is completely unavailable, go to step 3.
Stall some more: make the customer book a pointless visit to the data center or engage the DC staff in equally pointless circus tricks: physically cycle the power or remove and re-insert the machine into its slot if it’s a modular affair. Don’t forget to open the device and move some stuff around from slot to slot, like DIMMs, line cards or other modules. Hello? If there’s been a major hardware failure, components must be replaced, not moved around. This is not poker and you’re not shuffling cards! The clock’s still ticking.
Under no circumstances accept evidence of component failure from industry acclaimed hardware testing utilities. Make the customer download a 3.9 GB DVD image with an obscure internal test suite – which fails to detect even common and straightforward failures.
All this requires several days worth of “work” for all involved: meanwhile, your service request will be passed from person to person and from team to team – sort of a ticket tag. Every new techie assigned to your case knows next to nothing about it, despite the full description, data, logs, etc. available in their ticketing system. Which is just fine – unless you have more than three servers and two switches to worry about and deal with hardware failures on a daily basis in a large infrastructure.
Finally, when all other options have been exhausted, book CRU parts or a tech. Unless…
…it’s firmware. In fact there’s so much of it, every major hardware component has a uC and runs some kind of firmware. Even the damn fans. That wouldn’t be bad by itself, if not for one little problem they don’t tell you about: hardware support contracts only really cover – you guessed it – hardware. Firmware is rarely prepared by your vendor. Rather, it’s a mix of obscure pieces of code written by disgruntled developers hired by outside contractors, old proprietary PoC OEM code (remote video consoles, anyone?) and some LGPLed libraries, typically mangled beyond recognition. That’s why it’s nearly impossible for the vendor to fix any serious firmware bugs within the few hours specified in your contract. Remember the DRAC/MC? The firmware was clearly FUBAR and yet all Dell techs could do was to offer you a replacement DRAC/MC module – which unsurprisingly fixed nothing since the problems were not related to hardware. Remember the EDAC issues on HP BL460c G1 blades? HP chose to hide their heads in the sand with that one. Apparently, when faced with grave firmware quality issues, it’s sometimes cheaper for the vendor to wait for a new product line rather than fix the old one. Oh, and there’s something else: firmware upgrades are ultimately your responsibility, even if requested by vendor’s tech support. Let’s say you have internal procedures related to firmware upgrades – a stable and proven update schedule. Then you encounter a critical bug in production and the vendor tells you to update the firmware – just because. You have two choices: refuse to update immediately and get your ticket dropped due to non-compliance with vendor’s recommendations or update with shorter (or none) internal QA, risking further outage if the new version is a dud (and it does happen all too often).
So, what to do?
Apart from some specialized infrastructures and quite a few corner cases, there are basically four types of IT shops:
large enterprise / govt. Here, you’re bound by internal policy or market regulations. Usually it means little choice: no free (as in beer) software, everything certified, configured and deployed as per vendor-approved specifications. Delays are often acceptable, and the responsibility for hardware may be ceded to the vendor. On the plus side, there’s usually quite a lot of hardware, internal stock of spare parts or spare devices – and the biggest players get dedicated support teams at their vendors, exclusive access to developers and custom software builds to boot. Vendor/contractor techs take care of at least some parts of system implementation and maintenance.
educational / HPC. Similar to large enterprise, but without most market regulations.
conventional SMEs. IT is often fully outsourced. It not, there’s one or several machines for internal use (mail, website, file sharing, etc.), usually without spares or redundancy.
startup / high scalability web. Lots of machines, sometimes as many as at a large enterprise. Cheap x86/x64 hardware is used in large quantities and considered unreliable (reliability is achieved through software, by design). Free software and custom, in-house solutions are accepted and encouraged.
There is no one-size-fits-all solution. But here’s what you can do in most situations.
Implement an in-house hardware certification program. Get evaluation hardware from your vendors and test the living daylights out of it – and I don’t just mean a single pass of memtest86! Check firmware quality, test possible failure modes, try if that hot-plug hardware really is hot-plug. Run both synthetic hardware stressers and your real load profile on the machine. Try out the features you’re unlikely to ever use. In a few months it may turn out you need them after all. Check how your OS of choice behaves on the evaluation machine, if it’s a server. Search the web for any signs of hardware-related issues people may be having. Don’t forget about driver quality. Once you’ve certified the hardware for production use, only order that particular model until it’s phased out by the vendor. Repeat the certification process for every new model.
The main rule is: small bugs are likely to be patched in future firmware updates, but severe issues will never get fixed. If the firmware is a complete mess now, don’t count on it being sorted out in an update. In other words, what you buy is what you get.
Consider NOT using blade servers. Um, but everyone likes blades, right? Blades are good. Blades minimize costs, reduce cabling, increase density and shoot double rainbows out of the cooling ports. But think of it this way. A standard blade enclosure is 10U with 16 slots. You need 35%+ more rack space for 16 1U standalone servers and that’s without counting the modular blade network switches or other I/O modules. If you run HPC clusters, you can get those double modules packing two servers in a single slot.
However, if anything happens to the enclosure, you may lose all 16 (or 32) servers at once. Power, cooling and management is usually redundant, but firmware, backplanes and comm buses between the modules aren’t. It gets even better, and this is something people tend to forget about these days, as it’s apparently passe to work longer than two years for a single company – and very fashionable to work for an emerging startup: when the hardware gets EOLed, you lose support not just for the servers, but for the enclosure as well. If a component fails, well, good luck on ebay, or you’ll have to immediately decommission 16 machines before the whole thing goes down. If you have 16 standalone servers instead, you only lose a single machine to a critical failure in that machine’s enclosure. You can phase out and sell the devices one by one or even decide to keep some of them running well past their EOL and designate the rest as spares. While you can obviously do the same with blades, each enclosure comes at a big premium – which is fine if you’re a large enterprise or a very large startup, but if not, it may hurt a bit financially – and 16 servers may just as well constitute a significant portion of your hardware base.
Introduce a coordinated program of periodic updates to mitigate issues stemming from hardware obsolescense. In every healthy economy (yeah, I know…), hardware is cheaper than people and their time. Five year old commodity servers or disk arrays are antique by any standard. Even if they’re still working, you can definitely do more on fewer machines – with less power consumption and higher density.
And finally, since you can’t really be safe from firmware failures, diversify your hardware base: consider getting hardware from more than one vendor. That way, you’re more or less protected from firmware bugs that would otherwise drop your infrastructure like a domino. Unless, of course, the vendors share the same OEM code (LOMs anyone? tape robot controllers?). There’s one catch, though, that you need to include in your risk calculation: it may be harder to get support in a heterogeneous system, as you provide the vendors with a great excuse. Manufacturer X will always cling like a limpet to the fact that you have equipment from manufacturer Y in your network, SAN, rack or whatever. Welcome to the blame game. Several major server and storage vendors will even go so far as to ask you for the full list of equipment in the rack in which the device in question is installed, when you file a ticket. As if the EM interference from other gear in the rack somehow affected their hardware – well then, congratulations on the design!
Ultimately, there’s one thing to remember: hardware support contracts suck. When you shell out for support, what you actually pay for is the privilege of getting your phone call answered, not for getting your machine back up and running. Design your infrastructure resilience around this fact.