Barriers, Caches, Filesystems

With the recent proliferation of ext4 as the new “default” Linux filesystem there’s been much talk of write barrier support. The flurry of post-2.6.18 barrier related development in most storage subsystems has left some novice users and administrators perplexed. I hope I can clear it up a bit with this primer/refresher.

If you’re familiar with the basics of I/O caching, just skip to the “Barriers” section.

Barriers have long been implemented in the kernel, e.g. in ext3, XFS and ReiserFS 3. However, they are disabled by default in ext3. Up until recently, there was no barrier support for anything other than simple devices.
 

Two words: data safety

Let’s take a look at the basic path data takes through the storage layers during a write-out in a modern storage setup:

Some of these layers/components have their own caches:

There may be other caches in the path, but this is the usual setup. The page cache is omitted if data is written in O_DIRECT mode.

When a userland process writes data to a filesystem, it’s paramount (unless explicitly requested otherwise) that the data makes it safely to physical, non-volatile media. This is a part of the “D” in ACID. Generally, data may be lost if it’s in volatile storage during hardware failure (e.g. power loss) or software crash.
 

Caches

The OS page cache (a subsystem of the VFS cache) and the buffer cache are in the host’s RAM, obviously volatile. The risk here is that the page cache is relatively large compared to other caches. It can’t survive OS crashes.

The storage controller write cache is present in most mid- and hi-end controllers and/or HBAs working in modes other than initiator-target: RAID HBAs, DAS RAID boxen, SAN controllers, etc. Every vendor seems to have their own name for it:

  • BBWC (Battery-Backed Write Cache)
  • BBU (Battery-Back-Up [Write Cache])
  • Array Accelerator (BBWC – in HPese)
  • FBWC (Flash-Backed Write Cache)

As the names suggest, BBWC is simply some memory and a rechargeable battery, usually in one or more proprietary FRU modules. In hi-end storage systems, the battery modules are hot-swappable, in mid-end systems a controller has to be shut down for battery replacement. RAID HBAs require host down time for battery maintenance unless you have hot-swap slots and multiple HBAs serving multiple paths.

FBWC is the relatively new generation of volatile cache where the battery assembly is replaced with NAND flash storage – not unlike today’s SSDs – and a replaceable capacitor bank that holds enough charge to allow data write-out from DRAM to flash in case of power failure.

Both types of cache have their drawbacks: BBWC needs constant battery monitoring and re-learning. Re-learning is a recurring process: the controller fully cycles (discharges and recharges) the battery to learn its absolute capacity – which obviously deteriorates with time and usage (cycles). While re-learning, write cache must be disabled (since at some point in the process the battery will be almost completely discharged and unable to power the BBWC memory). This is a periodic severe performance penalty for write-heavy workloads, unless there’s a redundant battery and/or controller to take over. Good controllers allow the administrator to customize re-learn schedules. The batteries must be replaced every few months or years.

Flash-based write cache is also subject to deterioration: the dreaded maximum write count for flash memory cells (however, flash is used only on power failure). The backup capacitors degrade over time. The NAND modules and the capacitor bank must be monitored and replaced if necessary.

Write cache on physical media (disk drives) is almost always volatile. Most enterprise SSDs and some consumer SSDs (e.g. the Intel 320 series, but not the extremely popular X25-M series) have backup capacitors.

Modern disks have 16-64 megabytes of cache. The problem with this type of cache is that not all drives will flush it reliably when requested. SCSI and SAS drives do the right thing: the “SYNCHRONIZE CACHE” (opcode 35) command is a part of the SCSI standard. PATA drives have usually outright lied to cheat on benchmarks. SATA does have the “FLUSH CACHE EXT” command, but whether the drive actually acts on it depends on the vendor. Get SCSI/SAS drives for mission critical data – nothing new here.

One more caveat with disk write cache is that the controller software – to ensure data durability – MUST guarantee that all data flushed out of the controller write cache is committed to non-volatile media. In other words, when the OS requests a flush and the controller returns success, the data MUST have already been committed to non-volatile media. This is why disk write cache MUST be disabled if BBWC or other form of controller cache is enabled – the controller cache must be flushed directly to non-volatile media and not to another layer of volatile cache.

Software RAID with JBOD is a special case: there is no controller cache, only the drive cache, the OS page cache and buffer cache.
 

Barriers

Think of write barriers on Linux as a unified approach to flushing and forced I/O ordering.

Consider the following setup:

This is a bit on the extreme side, but ponder for a moment how many layers of I/O (and caches) the data has to pass through to be stored on the physical disks.

If the filesystem is barrier-aware and all I/O layers support barriers/flushes, an fs transaction followed by a barrier is committed (flushed) to persistent storage (disks). All requests issued prior to the barrier must be satisfied before continuing. Also, an fsync() or a similar call will flush the write caches of the underlying storage (fsync() without barriers does NOT guarantee this!). Barrier bios (block I/Os) actually do two flushes: one before the bio and one afterwards. It’s possible to issue an empty barrier bio to flush only once.

Barriers ensure critical transactions are committed to persistent media and committed in the right order, but they incur a – sometimes severe – performance penalty.

Let’s get back to our two hardware setups: software RAID on JBOD and hardware RAID with BBWC.

Since barriers force write-outs to persistent storage, disk write cache can be safely enabled for MD RAID if the following conditions are met:

  • the filesystem supports barriers and they are enabled
  • the underlying I/O layers support barriers/flushes (see below)
  • the disks reliably support cache flushes.

However, on hardware RAID with BBWC, the cache itself is (quasi-)persistent. Since RAID controllers do implement the SYNCHRONIZE CACHE command, each barrier would flush the entire write cache, negating the performance advantage of BBWC. It’s recommended to disable barriers if – and only if – you have healthy BBWC. If you disable barriers, you must monitor and properly maintain your BBWC.

Full support for barriers on various virtual devices has been added only recently. This is a rough matrix of barrier support in vanilla kernel versions, milestones highlighted:

Barrier support Kernel version Commit
I/O barrier support 2.6.9 1
ext3 2.6.9 1
reiserfs 2.6.9 1
SATA 2.6.12 -
XFS – barriers enabled by default 2.6.16 1
ext4 – barriers enabled by default 2.6.26 1
DM – simple devices (i.e. a single underlying device) 2.6.28 1
loop 2.6.30 1
DM – rewrite of the barrier code 2.6.30 1
DM – crypt 2.6.31 1
DM – linear (i.e. standard LVM concatenated volumes) 2.6.31 1
DM – mpath 2.6.31 1
virtio-blk (only really safe with O_DIRECT backing devices) 2.6.32 1
DM – dm-raid1 2.6.33 1
DM – request based devices 2.6.33 1
MD barrier support on all personalities * 2.6.33 1
barriers removed and replaced with FUA / explicit flushes 2.6.37 1 2 3

* Note: previously barriers were only supported on MD raid1. This patch can be easily applied to 2.6.32.

As of 2.6.37, block layer barriers have been removed from the kernel for performance reasons. They have been completely superseded by explicit flushes and FUA requests.

FUA is Force Unit Access: an I/O request flag which ensures the transferred data is written directly to (or read from) persistent media, regardless of any cache settings.

Explicit flushes are just that – write cache flushes explicitly requested by a filesystem. In fact, the responsibility for safe request ordering has been completely moved to filesystems. The block layer or TCQ/NCQ can safely reorder requests if necessary, since the filesystem will issue flush/FUA requests for critical transactions anyway – and wait for their completion before proceeding.

These changes eliminate the barrier-induced request queue drains that significantly affected write performance. Other I/O requests (e.g. without a transaction) can be issued to a device while a transaction is still being processed.

However, as 2.6.32.x is the longterm kernel for several distros, barriers are here to stay (at least for a few years).
 

Filesystems

Barriers/flushes are supported on most modern filesystems: ext3, ext4, XFS, JFS, ReiserFS 3, etc. ext3/4 are unique in that they support three data journaling modes: data={ordered,journal,writeback}.

data=journal essentially writes data twice: first to the journal and then to the data blocks.

data=writeback is similar to journaling on XFS, JFS, or ReiserFS 3 before Linux 2.6.6. Only internal filesystem integrity is preserved and only metadata is journaled; data may be written to the filesystem out of order. Metadata changes are first recorded in the journal and a commit block is written. After the journal has been updated, metadata and data write-outs may proceed. data=writeback can be a severe security risk: if the system crashes while appending to a file, after the metadata has been committed (and additional data blocks allocated), but before the data has been written (data blocks overwritten with new data), then after journal recovery that file may contain blocks filled with data from previously deleted files – from any user.

Note: ReiserFS 3 supports data=ordered since 2.6.6 and it’s the default mode. XFS does support ordering in specific cases, but it’s neither always guaranteed nor enforced via the journaling mechanism. There is some confusion about that, e.g. this Wikipedia article on ext3 and this paper [PDF] seem to contradict what a developer from SGI stated (the paper seems flawed anyway, as an assumption is made that XFS is running in ordered mode, based on the result of one test).

data=ordered only journals metadata, like writeback mode, but groups metadata and data changes together into transactions. Transaction blocks are written together, data first, metadata last.

With barriers enabled, the order looks more or less like this:

  1. the transaction is written
  2. a barrier request is issued
  3. the commit block is written
  4. another barrier is issued

There is a special case on ext4 where the first barrier (between the transaction and the commit block) is omitted: the journal_async_commit mount option. ext4 supports journal checksumming – if the commit block has been written but the checksum is incorrect, the transaction will be discarded at journal replay. With journal_async_commit enabled the commit block may be written without waiting for the transaction write-out. There’s a caveat: before this commit the barrier was missing at step 4 in async commit mode. The patch adds it, so that now there’s a single empty barrier (step 4) after the commit block instead of a full barrier (two flushes) around it.

ext3 tends to flush more often than ext4. By default both ext3 and ext4 are mounted with data=ordered and commit=5. On ext3 this means not only the journal, but effectively all data is committed every 5 seconds. However, ext4 introduces a new feature: delayed allocation.

Note: delayed allocation is by no means a new concept. It’s been used for years e.g. in XFS; in fact ext4 behaves similarly to XFS in this regard.

New data blocks on disk are not immediately allocated, so they are not written out until the respective dirty pages in the page cache expire. The expiration is controlled by two tunables:

/proc/sys/vm/dirty_expire_centisecs
/proc/sys/vm/dirty_writeback_centisecs

The first variable determines the expiration age – 30 seconds by default as of 2.6.32. On expiration, dirty pages are queued for eviction. The second variable controls the wakeup frequency of the “flush” kernel threads, which process the queues.

You can check the current cache sizes:

grep ^Cached: /proc/meminfo # page cache size
grep ^Dirty: /proc/meminfo # total size of all dirty pages
grep ^Writeback: /proc/meminfo # total size of actively processed dirty pages

Note: The VFS cache (e.g. dentry and inode caches) can be further examined by viewing the /proc/slabinfo file (or with the slabtop util which gives a nice breakdown of the slab count, object count, size, etc).

Note: before 2.6.32 there was a well-known subsystem called pdflush: global kernel threads for all devices, spawned and terminated on demand (the rule of thumb is: if all pdflush threads have been busy for 1 second, spawn another thread. If one of the threads has been idle for 1 second, terminate). It’s been replaced with per-BDI (per-backing-device-info) flushers – one flush thread per each logical device (one for each filesystem).

On top of all that, there was the dreaded pre-2.6.30 “ext4 delayed allocation data loss” bug/feature. Workarounds were introduced in 2.6.30, namely the auto_da_alloc mount option, enabled by default.

You should also take into consideration the size of the OS page cache. These days machines have a lot of RAM (32+ or 64+ GB is not uncommon). The more RAM you have, the more dirty pages can be held in RAM before flushing to disk. By default, Linux 2.6.32 will start writing out dirty pages when they reach 10% of RAM. On a 32 GB machine this is 3.2 GB of uncommitted data in write-heavy environments, where you don’t hit the time based constraints mentioned above – quite a lot to lose in the event of a system crash or power failure.

This is why it’s so important to ensure data integrity in your software by flushing critical data to disks – e.g. by fsync()ing (though at the application level you may only hope the filesystem, the OS and the devices will all do the right thing). This is why database systems have been doing it for decades. Also, this is one of the reasons why some database vendors recommend placing transaction commit logs on a separate filesystem. The synchronous load profile of the commit log would otherwise interfere with the asynchronous flushing of the tablespaces: if the logs were kept on a single filesystem along with the tablespaces, every fsync would flush all dirty pages for that filesystem, killing I/O performance.

Note: fsync() is a double-edged sword in this case. fsyncing too often will reduce performance (and spin up devices). That’s why only critical data should be fsynced.

Dirty page flushing can be tuned – traditionally with these two tunables:

/proc/sys/vm/dirty_background_ratio
/proc/sys/vm/dirty_ratio

Both values are expressed as a percentage of RAM. When the amount of dirty pages reaches the first threshold (dirty_background_ratio), write-outs begin in the background via the “flush” kernel threads. When the second threshold is reached, processes will block, flushing in the foreground.

The problem with these variables is their minimum value: even 1% can be too much. This is why another two controls were introduced in 2.6.29:

/proc/sys/vm/dirty_background_bytes
/proc/sys/vm/dirty_bytes

They’re equivalent to their percentage based counterparts. Both pairs of tunables are exclusive: if either is set, its respective counterpart is reset to 0 and ignored. These variables should be tuned in relation to the BBWC memory size (or disk write cache size on MD RAID). Lower values generate more I/O requests (and more interrupts), significantly decrease sequential I/O bandwidth but also decrease random I/O latency. The idea is to find a sweet spot where BBWC would be used most effectively: the ideal I/O rate should not allow BBWC to overfill or significantly under-fill. Obviously, this is hit/miss and only theoretically achievable under perfect conditions. As usual, you should tune and benchmark for your specific workload.

When benchmarking, remember ext3 has barriers disabled by default. A direct comparison of ext3 to ext4 with default mount options is usually quite pointless. ext4 offers an increased level of data protection at the cost of speed. Likewise, directly comparing ext3 in ordered mode to a filesystem offering only metadata journaling may not yield conclusive results. Some people got their benchmarks wrong.

Note: I did that kind of benchmark a while ago: the goal was to measure system file operations (deliberately on default settings), not sequential throughput or IOPS – and ext4 was faster anyway.

All in all, it’s your data! Test everything yourself with your specific workloads, hardware and configuration. Here’s a simple barrier test workload to get you going.

This entry was posted in IT and tagged , , , . Bookmark the permalink.