THE RANT / THE SCHPLOG
Schmorp's POD Blog a.k.a. THE RANT
a.k.a. the blog that cannot decide on a name

This document was first published 2015-10-08 13:54:06, and last modified 2015-12-30 19:49:40.

NOTES: at the moment, this article will be updated as new information comes in. I decided to publish, because a lot of people are looking for this info. Update 2015-11-14: I am on it again, and will gather more data in the coming months. Update 2015-11-17: added a MINI FAQ at the end. Update 2015-12-13: Updated stable f2fs git repository. Update 2015-12-30: Neither kernel 4.3 nor 4.4 are stable; added Linux 3.18 Ubuntu kernel source; improved f2fs module install command sequence. 2016-08-25: Re-tests BTRFS on 4.4+ with extremely good results, making BTRFS the recommended option, see last section.

SMR Archive Drives, Fast, Now

I've got a thing for big drives, as I've got a thing for large filesystems and backups.

Not surprisingly, that got me interested in dirt cheap (comparatively) Seagate 8TB SMR drives. Asshole companies as big companies are, drive manufacturers do not grow tired of throwing content-free white papers after us, and praising this new technology, but can't even be bothered to document some basic facts on their drives.

This means people who want to know how these drives work (to be able to maybe optimize filesystems for them) might have to resort to bore holes into their drives to gain insights.

I won't write in detail about SMR technology, let me just summarise the Seagate 8TB parameters: The drive has zones between 15 and 40MB in size. These zones can either be written in one go, as many times as you want, or they can be changed in a span from somewhere inside to their end, but only a limited number of times (probably once, or even nonce).

Writes that do not fall into these categories will be cached, and the disk will then read the zone, patch it, and write it out again, at a later time.

This cache is 25GB in size, and can store up to 250000 separate writes, so you can write single sectors randomly, very fast, for a while, and then the drive might need some hours or days to actually fix up in the background.

The catch is that you don't know where these zones are, the firmware is very dumb and unoptimized, and even buggy.

So, let's see if we can get speed out for them, and how.

Is my drive faulty? What behaviour is normal?

First of all, these drives can have very scary behaviour, and most are not faulty, even though often they certainly look like broken drives. Most common problem is I/O errors, e.g.:

[  174.738171] ata1.00: device reported invalid CHS sector 0
[  174.738175] ata1: EH complete
[  208.784935] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[  208.784950] ata1.00: failed command: WRITE DMA EXT
[  208.784952] ata1.00: cmd 35/00:c0:40:d5:3b/00:3a:02:00:00/e0 tag 13 dma 7700480 out
[  208.784952]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  208.784954] ata1.00: status: { DRDY }
[  208.784956] ata1: hard resetting link

These errors come in various shapes and sizes (30s freezes, usually), and in most cases, do not indicate a problem with the drive, but are due to problems with the Linux kernel. In most cases these are no indicator for permanent data loss, but a filesystem check might be in order. Unless your drive develops bad sectors or repeatably unreadable sectors, it is most likely a SATA issue of some sort that can be solved (see below).

Another common behaviour that is harmless seen is that the drive stops responding - you can power cycle it, reboot, and so on, and neither the BIOS nor the OS will see it. In other words, the drive looks completely and irrevocably broken.

In such a case do not panic - the drives sometimes go into a state where they do some kind of emergency recovery - when they are powered, they clean their internal journal, which usually takes minutes - if your drive makes regular quiet "clicks", then most likely it is just busy. Keep it powered on for an hour at least, then power cycle it (and/or reboot, to be sure), and see if it is back. Typically, it will be as good as new after such an event.

Kernel woes

Knowing what is normal, let's find out what we can do about it.

Between us and the drive is Linux, which normally is/has quite a capable SATA driver, but in this case, there are a number of kernel bugs that cause these drives to fail mysteriously when you write a lot of data, making the drives seem bad.

In addition, newer versions of Linux allow larger write sizes, which can trigger data corruption with these drives.

To make a long story short, the only kernel newer than 3.16 (and probably 3.10) that will work with these drives is 3.18.x (I usually get my kernels from Ubuntu).

Kernel 4.2 and 4.3 might also work, if you reduce the write size, and newer kernels will/might/hopefully could have an automatic workaround, but 3.18.21 is a safe bet.

So, either get 3.18.21 and feel safe, or 4.2 (or later) and reduce write sizes of your drive(s) to 512kb:

# workaround for silent data corruption on smr
for dev in /sys/block/sd?/; do
   echo 127 >"$dev"/queue/max_sectors_kb
done

This reduces the write size for all drives, but frankly, you will not see a difference in performance (the SMR drive for example performs at 200MB/s with a write size of 16 or more). But then, maybe your SATA interface is even shittier than mine. But even if, Linux survived nicely until now with 512Kib as the default maximum size.

If you want to check kernel compatibility, you should have a look at the kernel bugreport for these drives.

It's not clear whether disabling NCQ will also help with these drives (NCQ usually works, but when you run into drive timeouts, disabling it might help further). I think reducing the queue depth does help with performance, and fully disabling NCQ (via libata.force=4:noncq boot argument for example - the 4 is a placeholder) might help as well.

I also found that my 6Gbps SATA ports cause errors sooner or later, so I also force the link to 3Gbps (libata.force=4:3.0Gbps), but that is probably not a big issue, as the kernel will reduce speed on it's own sooner or later, and likely only affects a subset of ports.

Filesystem woes

Now, let's have a loot at the filesystems. I tried:

XFS, btrfs, ext4, ZFS, nilfs

And they all suck badly with these drives. Specifically that means that, depending on mount options, you can write 200-600GB of data, then the drives will slow down to a few MB/s, which, for an 8TB drive, is simply too slow.

Putting this into perspective - if you mostly read from these drives (which is typical), and you only write in batches of a few hundred GB each day or so, then these drives might be very good performers with any filesystem - they can write 100-200MB/s over the whole surface, much better than most normal drives.

But if you write tebibytes, or do big backups, they just won't work for you.

Now, with XFS, btrfs and ext4, this is not surprising, they were designed for completely different drive technologies. And even though ext4 has some switches for SMR drives, they don't see to have anything to do with actually available SMR drives, as they are mostly ineffective. (NOTE: see last section about news with BTRFS).

ZFS is a different beast, of course. As usual, the ZFS fanboys were quick to announce that "ZFS may be the only file system that can use SMR drives with NO LOSS in performance" (here, p. 11).

And just as usual, it's an utter fabrication (and actually wrong, as we will see) - ZFS actually performs worse then XFS on these drives.

A further problem with these filesystems is that they all tend to fragment badly over time, so when you delete and add files regularly, the disk will become extremely slow over time.

That leaves nilfs. This filesystem should be a perfect match (apart from it sucking in other ways), because it almost treats the drive as tape device, by writing a single circular log, i.e. all writes are monotonous. It also manages to squeeze 600GB of data on the disk before it, too, becomes unbearably slow. I was unable to investigate further what the reasons were - I suspect it might leave gaps, probably for alignment, which means death for these drives.

In any case, you do not want nilfs on these drives for other reasons - when you delete files, it might have to shuffle up to 8TB of data (4TB on average) around before actually being able to reuse the space. That means it can literally take days for a deleted file to actually free associated disk space (but at least it will not slow down over time, in theory).

So if all these are failures, is there a filesystem to save the day?

Enter f2fs

There is, and from a completely (to me) unexpected corner. The flash friendly file system, not made for rotational media, is capable of driving these drives with high speed (~100MB/s) over the whole surface. It needs GC times, and it has a rather unfinished feel to it still, but in the recent weeks, the f2fs team has made patch after patch after every trace and report I sent to make f2fs better on these drives.

The Bad

Of course, the road to there is stony. While the f2fs in 3.18.x is very fast (faster than later versions) with these drives, it is not very good at keeping your data, and not good at all at not freezing your box.

But if you want to make experiments, you can install the standard f2fs-tools from your distro, and:

mkfs.f2fs -s64 -t0 -a0 /dev/sdX
mount -t f2fs -onoinline_data,noatime,flush_merge,no_heap /dev/sdX /mnt

And then experiment. Make sure you don't expect your data to be persistent until you enter sync or umount and it returns successfully - in one test I write 2.1TiB to the disk, let it idle for hours, and then sync was hanging. After a forced reboot, I got a drive with 0.7TB of data, in perfect condition. Also, do not run fsck.f2fs if you want to keep your data - at least not unless you sue the f2fs-tools from git.

And while f2fs in newer kernels such as 4.2 works more stable, it also becomes extremely slow with these drives.

The Good

But fret not, this is all brand new stuff, so there is a way out.

F2fs in Linux 4.3 is quite capable with these drives, and Linux 4.4 (or 4.5) might finally have all the patches to make the SATA interface for these drives stable (you still need current f2fs-tools).

In the meantime, here's what you can do: install a linux 3.18.21 kernel, current f2fs-tools and compile an updated f2fs module.

1. Get Linux 3.18.21

Can't help you much with that, either compile your own kernel from source, or get a 3.18.21 kernel via your favourite package manager. I use Ubuntu kernels (on Debian GNU/Linux) that I get from here.

E.g. for amd64, do something like this:

BASE=http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.18.21-vivid
wget $BASE/linux-headers-3.18.21-031821-generic_3.18.21-031821.201509020527_amd64.deb
wget $BASE/linux-headers-3.18.21-031821_3.18.21-031821.201509020527_all.deb
wget $BASE/linux-image-3.18.21-031821-generic_3.18.21-031821.201509020527_amd64.deb
dpkg -i linux-*-3.18.21-031812*.deb
2. Get current f2fs-tools
git clone -depth 1 git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Then configure / make install, or whatever you want to do to install it.

The current f2fs-tools do not just have the minor advantage of not completely corrupting the filesystem, they also allocate a lot less space for overprovisioning, that is, a freshly formatted f2fs 8TB partition doesn't lose 400GB out of the box, but only 80.

3. Get a current f2fs module

This is harder. At the moment, the f2fs team (or rather Jaegeuk Kim) maintains a linux-3.18 branch (just for me, so cool!). You can build it using the following commands, if you have a ubuntu kernel (I used 3.18.21 from mainline-ppa in this example, it probably works fine with other 3.18 kernels, although if you want to be sure, use 3.18.21):

git clone -b linux-3.18.y --depth 1 git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-stable.git/
cd f2fs-stable
KVER=3.18.21-031821
rsync -avPR include/linux/f2fs_fs.h include/trace/events/f2fs.h /usr/src/linux-headers-$KVER/.
cd fs/f2fs
make -C /lib/modules/$KVER-generic/build/ M=$PWD clean
make -C /lib/modules/$KVER-generic/build/ M=$PWD modules
make -C /lib/modules/$KVER-generic/build/ M=$PWD modules_install
make -C /lib/modules/$KVER-generic/build/ M=$PWD clean
depmod

Then you can rmmod f2fs/modprobe -v f2fs, and verify that the module loaded is the newly installed one (for me, /lib/modules/3.18.21-031821-generic/extra/f2fs.ko, not the one in kernel/fs/f2fs.ko).

Note that the above will overwrite files in your linux-headers-3.18.21-031821 package, but you probably already realised that.

With this, you can write 7TB or more to the disk, at an average speed of about 100MB/s, sometimes much more, sometimes a bit less, using the mount options above.

Disk Full

So why does f2fs achieve good speed, and what are the trade-offs?

In addition to writing data mostly linearly, which is important the first time you write to these drives, the -s64 mkfs option above instructs f2fs to (mostly) handle disk space in blocks of 128MB (64*2), which spans 3-8 zones. So when disk space is freed, the garbage collector will read a 128MB block and linearly copy it elsewhere, freeing a full 128MB block.

This does not work perfectly at the moment, sometimes f2fs fills 2MB chunks randomly, and this spells for disaster, but this only happens when the filesystem is very low on space.

Looking at the /sys/kernel/debug/f2fs/status file can help indicate any issues. The Dirty: line shows the number of dirty 2MB segments, that is, segments that need cleaning up. The Free: line shows the number of free 2MB segments. While this is not a foolproof method, a high dirty number and a low free number means that the fs has little contiguous space to play with, and a garbage collection might be in order.

Unfortunately, this area of f2fs is not yet fully sorted out. By default, f2fs garbage collects in the background, but, also by default, at glacially slow speeds.

I found it best to force the GC manually, which fortunately is possible in current f2fs versions. To do that, I use the following script:

#!/usr/bin/perl

use Coro::AIO;

while () {
   my $gc = pack "i", 8;
   ioctl STDIN, 0xf506, $gc
      or last;
   unpack "i", $gc
      or last;
   aio_syncfs *STDIN;
}

(No, I don't want to write this in C, and I don't feel bad about hardcoding ioctl numbers.)

What this does is invoke the GC to do some work, and then immediately sync the fs. While this isn't optimal from a performance standpoint, it will avoid making your system very sluggish, and will make the garbage collection process proceed at near maximum speed.

Note that you don't have to do this unless your disk becomes very full and/or you delete a lot of data and want to reuse the freed space.

I did verify in multiple experiments that f2fs does indeed not degrade over time - a GC after deleting data will make it fast again.

Metadata read performance (i.e. ls-l , find)

Of course, being designed for flash devices, you might expect extremely slow performance in some areas, sich as when listing files. I have some big directories (hundreds of thousands of files and subdirectories), and just listing them can take tens of minutes.

This is bad, but not phenomenally worse than traditional filesystems - without dmcache, my 8-disk raid 20TB XFS filesystem on the box I write this blog entry needs 1.5 hours for a find over the disk, if it is full.

Unfortunately dmcache cannot (currently) be used on SMR drives, because it can cause millions of small random writes after an unclean shutdown, which can take weeks to complete with these drives (it sure takes hours on my non-SMR raid volume).

Nevertheless, f2fs has some options to improve directory performance which I haven't explored yet, so maybe there is a positive surprise here still in store. In any case, while f2fs does not perform as well as XFS on metadata reads, it performs way better than you might guess from it being designed for flash.

Things to try with f2fs

You can experiment yourself, for example, with different -s sizes (they don't have to be powers of two). Smaller -s parameters make garbage collection more efficient from the fs perspective (but maybe slower). You won't see a performance difference until the disk was used (by adding and deleting files), though, so while -s1 might feel blazingly fast, you might have to pay with 1-2MB write speeds after a few weeks of use.

MINI FAQ

This section just lists some random questions that keep coming up, and some that don't keep coming up.

Can you summarise the situation with different kernels?

According to my knowledge, old kernels (3.10 from Debian Jessie for example) work just fine with these drives. The only problem is that f2fs in these versions is quite unstable and buggy.

Sometime after 3.10 (probably in 3.17) a bug was introduced that increasingly made accessing these drives flaky.

This was fixed in 3.18, and broken again (but differently) in 3.19.

Linux 4.2 mostly works with the drives, but needs adjustments for the max_sectors_kb (apparently a firmware bug in the drives). Mostly means that there are still stability problems.

Linux 4.3 has a fix for the max_sectors_kb problem, but again isn't stable with these drives (tested up to 4.3.3). F2fs in this version, however, is quite stable.

Linux 4.4-rc7 still isn't completely stable with these drives, often causing a lot of I/O errors that can cause the drive to play dead for a while.

Thus the recommend version is Linux 3.18.21 (with a custom f2fs module).

How can I use these disks in a RAID?

If you want to use them in a RAID 0 or RAID 1 config, you cna go ahead, there is nothing special about these disks. The only potential issue is that these drives have a tendency to need more time for some operations, which can make them time out with some raid controllers. This behaviour is not uncommon to modern desktop disks (the relevant search term for other disks is "TLER"), but somewhat more likely with these disks.

Other RAID levels are be strongly discouraged, both by the manufacturers themselves and by common sense reasoning: While the drives should worik correctl in a RAID 5 or similar setting, they will almost certainly show extremely bad performance. The reason is that higher RAID levels split the disks into many zones which are much smaller thana zone, and these are often not written "in one go", forcing the disks to change, say, a 64kB write into a 30MB read/write cycle, with a corresponding 1000 time slowdown. While caching can and does alleviate this a lot, the slowdown is still tremendous, so a one day rebuild can become a multiple weeks rebuild, for example.

That means you can certainly try and it might work, but you likely won't be able with the results.

How about snapraid?

Snapraid should be fine as long as the parity disk is not an SMR disk (you can combine two normal 4GB disks into a parity disk for example). Also, since snapraid is kind of an off-line RAID 4, it might be workable with to use an SMR disk as parity disk, although it is hard to tell what kind of slowdown you might encounter - you can try out and tell me if you wish.

I get frequent timeouts/scary SATA errors, is my disk broken?

Likely not, this is probably a kernel issue - read the full article.

My disk is not recognized by my BIOS/OS, is it broken?

Likely not - keep it powerd on an connected for an hour at least, power cycle and see if it recovers. Repeat this a few times before considering it broken.

df shows a lot of space in use after a fresh f2fs format, is this normal?

Yes, f2fs does not hide the reserved space it uses, unlike e.g. ext4 (format the same partition with ext4, and you can see 0 bytes used, but a lot smaller overall filesystem size). Make sure to use the most current f2fs tools, which should result in about 100GB reserved space on an 8TB disk.

f2fs is quite slow at directory operations, can I use dm-cache to speed it up?

In theory, dm-cache is a great tool to spead up metadata operations for rotational media. There are two issues with it though, the second of which makes it completely impractical to use it with SMR drives.

The first issue is that dm-cache ignores the data type of I/O operations (filesystems cna and do specify whether some I/O operation involves metadata or filedata for example) and simply caches everything. This can be alleviated with some tricks and configuration, such as priming the cache with an fsck while the cache is configured to aggressively build it's cache.

The second, show-stopper issue is that dm-cache will write all cached blocks out after an unclean shut-down (it doesn't matter whether the cache (is in writethrough or writeback mode). While this is slow with normal rotational disks (an hour with a typical cache) this will result in a (similar slowdown as e.g RAID 5, because every, say, 64kB write causes a 30MB read-write cycle on average. This can mean a slow/unusable server for weeks after an unclean shutdown.

UPDATE 2016-08-25: F2FS and BTRFS medium/long-term results

Here are some long-term results: while f2fs performed well, it generated an enourmous amount of fragmentation over time (as would be expected), but has no way to deal with it long-term, so over time, f2fs performance did detoriate regardless of the underlying drive technology.

Changes in my backup set-up freed the two SMR drives, so I decided to give BTRFS another try, using kernel 4.4.x (and in the meantime, have used 4.6.x and 4.7.x as well).

The results are astonishing - while BTRFS in 3.18 performed about as bad as XFS and other filesystems, with major write speed degradation, this is more or less completely gone: Not only was I able to write almost 16TB of data in one session to a two-device 8TB SMR setup (single, not raid, although dup, raid0 and raid1 profiles should work just as well) with steady write speeds of between 80-150MB/s peak, BTRFS also performs extremely well at near-disc-full conditions (using 99% or more of the available space without performance degradation as seen with e.g. XFS or F2FS). No special mkfs.btrfs flags are needed for this, either.

The space-efficiency, coupled with the fact that BTRFS also keeps a data checksum and has online scrubbing, makes it the recommended filesystem for archival-type setups with these disks - although with some increased maintenance, BTRFS would also perform well as a general-purpose setup on SMR drives (exception: databases and similar workloads).

I would not, at this point, use BTRFS without a backup - BTRFS itself is pretty solid at this point, but it is very sensitive to hardware problems (such as when a disk loses stable data after a power outage), and the repair tools are well documented to easily destroy your whole filesystem if you use them wrongly.

The remaining problems with BTRFS - low peak write throughput compared to other filesystems and the need for some constant maintenance (BTRFS does not defragment itself) are independent of the underlying SMR technology and will doubtlessly improve over time.

Uncommon solutions for BTRFS

There are lots of resources to solve common problems with BTRFS. Here I will give a few tips on long-term maintenance instead.

Regular Scrubbing

If you are into archival storage, you probably know how important reading your data regularly is, to give the disk a chance to replace weak sectors, and to detect problems before they beocme too big.

With BTRFS, you can do this online either as a low-priority job (taking much longer, but not slowing down normal usage as much):

btrfs scrub start -c3 /mountpoint

Or as a higher priority job:

btrfs scrub start -c2 -n7 /mountpoint

The first command usually reads at about 100MB/s on my filesystems, while the second job reads near the possibe I/O speed (100-180MB/s on my SMR disks, 500MB/s on my bigger raid5 setups, 2700MB/s on my NVM SSD).

It is possible to suspend scrubbing:

btrfs scrub cancel /mountpoint

And to resume it later:

btrfs scrub resume -c3     /mountpoint # low priority
btrfs scrub resume -c2 -n7 /mountpoint # high priority

You can see if there are any problems found with:

btrfs scrub status /mountpoint

And if there are, the syslog will have more details.

Scrubbing will replace faulty metadata blocks if your profile is dup, and pinpoint faulty data blocks with full path to the file in your syslog.

Internal fragmentation

For my purposes, I define internal fragmentation as space allocated but not usable by the filesystem. In BTRFS, each time you delete files, the space used by those files cannot be reused for new files automatically.

I use the following cron job every three hours for my SMR archival storage to reduce internal fragmentation:

exec >/dev/null 2>&1 ionice -c3 btrfs fi ba start -dusage=99 -dlimit=2 /mountpoint

This will roughly move two gigabytes of data per invocation (two chunks, to avoid btrfs moving the same chunk over and over). The frequency should be adjusted to the amount of writing the device receives, as BTRFS will move this data whether it's needed or not.

You can get an idea on whether it's useful or not by looking at:

btrfs fi us -g /mountpoint

Data,single: Size:14314.01GiB, Used:14312.66GiB
Unallocated: 531.99GiB

This line shows how much of the disk is allocated to the data (Size) and how much of that is actually used Used).

The difference between is a measure of internal fragmentation. In the case above, it's less than 2GB, which is about perfect, due to the cron job. Here are some of my other BTRFS filesystems, with similarly acceptable values, meaning the difference between allocated and used is much lower than the amount of unallocated space still left:

Data,single: Size:6707.01GiB, Used:6550.06GiB
Unallocated: 4441.77GiB

Data,single: Size:13226.00GiB, Used:13135.03GiB
Unallocated: 13838.56GiB

Data,single: Size:58.00GiB, Used:55.42GiB
Unallocated: 30.94GiB

It's not a hard requirement to do this maintenance regularly, but doing it regularly spares you waiting for hours when the disk is full and you need to wait for a balance clean up command - and of course also reduces the number of times you get unexpected disk full errors. As a side note, this can also be useful to prolong the life of your SSD because it allows the SSD to reuse space not needed by the filesystem (although there is a trade-off, frequent balancing is bad, no balancing is bad, the sweet spot is somewhere in between).

Also, the above only works on the data sections of the filesystem, not the metadata sections. You can have an eye on the metadata values, but since they are usually very small in comparison (on my nearly 16TB setup BTRFS uses about 24GB metadata for 5 million files), but the required frequency of maintenance is very low in comparison.

External fragmentation

External fragmentation in BTRFS happens each time a file is updated in-place, or when files are appended to over time. Typical problematic cases are VM disk images, databases and torrents.

For some of these (databases, disk images, some torrent clients), it is possible to switch of copy-on-write, but this usually cripples performane on SMR drives and also switches off data checkums, which are great for archival purposes.

The alternative is to defragment files regularly, e.g. after downloading a torrent. Fortunately, BTRFS has a tool for this as well, e.g.:

btrfs fi def -t 128m -r file-or-directory

This is a living document

This document is just a kind of draft (I haven't even proofread it once), and will be updated when f2fs development goes along. I normally would delay writing this blog entry, but given the amount of pain these drives cause currently, I decided to publish my findings, so people have a chance of getting good performance out of these drives, now.