THE RANT / THE SCHPLOG
Schmorp's POD Blog a.k.a. THE RANT
a.k.a. the blog that cannot decide on a name

This document was first published 2015-07-12 04:56:30, and last modified 2015-07-12 04:56:30.

How to Scan Directories, Fast: the Tricks of aio_scandir

In this article, I want to delve into IO::AIO's aio_scandir function, and describe which tricks it uses to scan directories fast (and why it can be used to scan directory hierarchies even faster). These optimisations have nothing to do with Perl and could be used in implementations of any language.

Now, what do I mean by that? First, scanning a directory is easy, you opendir it, readdir all the names, and you are done. This is quite fast, and hard to optimize. Things get a lot more interesting if you want to find which entries in a directory are subdirectories, and which aren't.

The basic solution in Perl looks something like this - basically, read all names, lstat them and see which one are directories:

opendir my $dirfh, $path
   or die "$path: $!";

local $_;
my (@dirs, @nondirs);

while (readdir $dirfh) {
   /^\.\.?$/ and next; # skip . and ..
   lstat "$path/$_" or next;

   if (-d _) {
      push @dirs, $_;
   } else {
      push @nondirs, $_;
   }
}

This works, but for large directories, this is significantly slower than it could be. So let's see what aio_scandir does about this.

Trick 1: use aio_readdirx

aio_scandir is (currently) written in Perl, but to get the list of names in the directory, it uses aio_readdirx, which returns all names in the directors, and possibly more. aio_readdirx (which is implemented in the lower level C library libeio) supports a number of flags:

READDIR_DENTS

Instead of returning just names, return the name, the type, and the inode number for each entry.

READDIR_DIRS_FIRST

Sort the list so the likely directories come first.

READDIR_STAT_ORDER

Sort the list so that calling lstat on each entry in order will likely be optimal.

READDIR_FOUND_UNKNOWN

This is a result flag, and is set when there where any entry of type unknown.

The description for the aio_readdirx in the IO::AIO manpage gives a much more verbose explanation for these.

So how are these implemented, and how are they useful?

Trick 1a: inode numbers

Basically all readdir implementations return an inode number (which is not returned by perl's readdir). In most UNIX filesystems, each file (and directory) has a unique inode number, and often, this inode number is a block number where the stat information can be found, or is a good proxy for the position of this information. So if you sort by inode number and stat in order, you are likely to get a faster response than stat'ing them in other orders.

Trick 1b: type fields

Many (but not all) readdir implementations return an additional field, the D_TYPE field, which basically returns the type directly. If it is available, its usefulness depends on the filesystem: Many Linux filesystems fully support it. My current favourite, XFS, basically doesn't - it sets DT_DIR on . and .., and gives everything else a DT_UNKNOWN, which basically means "go figure!".

Detecting whether type fields are available is very hard - it's probably easiest to write an autoconf test. libeio itself relies on _XOPEN_SOURCE, which, of course, is very unreliable on BSDs, so uses some additional guessing.

A minor optimisation involving type fields is the READDIR_FOUND_UNKNOWN flag - if it isn't returned by aio_readdirx, then aio_scandir knows that all the entries have a type, so the type is known with certainty, resulting in a quick grep for the subdirectories.

Trick 1c: sorting and guessing

From the flags and the directory information, aio_readdir then compiles a sort order, which is then used to sort the names into a hopefully good ordering. The sort algorithm used is a variant of MSD radix sort, which is stable, easy to implement, and works very well on the often very non-random data returned by readdir.

Type and inode sorting is straightforward, but in the absence type information, which names are likely directories? Here we need to use a heuristic, which, in thew case of libeio, are:

If a name starts with a dot, it's a likely directory (that might not be true, but dot-entries are usually rare, and usually have directories among them).

Otherwise if the name does not contain a dot, then it is a likely directory and then shorter names tend to be more likely to be directories.

Otherwise, it's unlikely to be a directory.

This works in common cases, as directories often do not have suffixes, while files often have one (blabla.txt), and other things tend to be rare.

Trick 2: directory link count

This is a common trick (also used in many find implementations), and consists of looking at the link count of the directory. On UNIX filesystems, every directory (usually with the exception of /), has one hardlink for the . entry, one hardlink for its name entry in the parent directory, and one hardlink for the parent directory (..) entry in each subdirectory.

Or in other words, the number of subdirs in a directory is usually the number of hardlinks minus 2. For example, for the /etc directory on the box where I write this article:

/etc# stat .
Device: fc0ch/64524d    Inode: 24117249    Links: 351
/etc# ls -la | grep ^d | wc -l
351

The ls -la lists 351 directories (including . and ..), of which there are 349 subdirectories. And 349 + 2 equals the link count of 351.

That means if you have a full directory with a link count of 2...

# stat .
Device: 2eh/46d Inode: 10379009716  Links: 2
# ls | wc -l
19457

... then we already know there are no subdirectories here, and can potentially skip 19457 stat calls.

This trick explains why we want to sort entries into likely directories first - if aio_scandir find a n subdirectories, and the link count is n+2, we can immediately stop searching for more subdirs, which can be substantial if you have directory with a million JPG files and a single .thumbnail subdir which gets' stat'ed first.

What about non-POSIX filesystems? These tend to have link counts of 0 or 1, and can be detected rather portably. And since the trick is commonly used, it is bound to be supported by most kernels.

Trick 2: asynchronous lstat

With the sorted list and the number of expected subdirectories, aio_scandir now needs to lstat them. Doing these calls concurrently has drastic advantages: By doing at least one lstat call concurrently with the scanning function means that any necessary processing can overlap with the lstat call, and thus, often hide in the disk I/O latency.

Doing more in parallel often allows operating systems to find more efficient paths for the read/write head (see http://en.wikipedia.org/wiki/Elevator_algorithm). And even for modern SSD disks which laugh at this, command queueing (e.g. NCQ) will benefit even those (and conventional harddisks as well).

This is a configurable parameter to aio_scandir, the default is to do up to four asynchronous lstat calls in parallel.

Trick 3: Don't lstat the path/entry itself, stat path/entry/. instead!

This trick is rather more obscure, but is easy to understand: Assuming the directory entry is a directory and stat'ing the . entry inside gives us either success, in which case the entry at least points to a directory, or an error if it isn't.

The "trick" becomes a trick because most filesystems, including ones that do not directly store type information in directories (or refuse to return this information) can take shortcuts when trying to resolve a path. That means by trying to see "inside" the entry, the filesystem can error out earlier, often without actually retrieving the stat information.

Naturally, there are trade-offs and complications: For directories with a lot of symlinks to other directories (which are not subdirectories), this has to do a lot more work, namely resolving the symlink and stat'ing the resulting path. So for each directory we find, we need to lstat the entry itself to see if it really is a directory.

Fortunately, directories are usually a lot less common than files - even if some hierarchies contain a lot more directories than files, it's usually just hundreds of directories, compared to thousands or millions of files in big directories, those where speed counts most.

Trick 4: fstatat vs. lstat

A minor trick, also not accessible from pure Perl, is to use fstatat instead of lstat: unless you have the luxury of being able to chdir in the target directory, you normally have to call lstat with the full path. This doesn't usually hit the disk, but for thousands or millions of calls, the overhead of resolving the path each time is visible.

Both libeio and IO::AIO abstract the xxxat () interface away using its working directory abstraction, so this is easy to do.

Correctness hits. Correctness hits. Correctness hits. You die.

There are, naturally, some things to watch out for - the whole scanning operation (starting with readdir) is not atomic, so entries can appear, vanish and change at any time during the process.

While nothing can be done about this (the result can always be outdated), aio_scandir can at least try to make some basic guarantees in the presence of activity that changes a directory.

One such guarantee is that, in the list of files returned, there really are no directories unless a file was replaced by a directory during scanning. As basic as this sounds, applying these tricks will not guarantee this, as the appearance of a new directory might cause another existing one to be overlooked when the link count heuristic is used.

To protect against this, aio_scandir stat's the directory before it starts reading the directory, and afterwards. If link count, size, mtime and a few other values have changed, it falls back to the full scanning algorithm. In practise, you also need to consider an mtime that is in the current second as changed, as many filesystems only store timestamps at full second resolution.

Other optimisations

The above is all I can imagine and remember that aio_scandir can take advantage of (if you know of another technique to speed up scanning, tell me). However, aio_scandir is usually used in the context of directory hierarchy scanning, that is, recursively scanning all directories inside a directory.

In that case, more optimisations could be done - the same sort-by-inode trick used for stat'ing inside directories can be used globally, that is, it can be used to scan many directories concurrently, and decide by inode number which directory is scanned next, and also do the stat calls of all these directory scans in an optimal way. If you are interested in these extra optimisation, consider them your homework assignment :)

Unscientific Marks on Benches

How much improvement are we actually talking about, compared to a good find implementation? Well, the savings can be substantial - let's look at two cases. First, a 20TB XFS filesystem (itself a hard case, as it gives no useful D_TYPE information) on an 8 harddisk RAID5 and about 20 million inodes.

On this filesystem, a full scan with empty cache using GNU find takes 1h19m (find /wd -printf "" >/dev/null). The treescan utility that comes with IO::AIO, which naively uses aio_scandir, scans the same filesystem in 50 minutes, thats half an hour less, which is extremely helpful if you sit in front of it and wait for results.

Second case, let's consider SSDs. While scanning a typical SSD is hardly worth any optimisation, what about using an SSD as dm-cache for the same RAID... In the same setup as above, but with a 40GB dm-cache for the filesystem that happens to store all the stat info, aio_scandir still wins over GNU find in terms of absolute speed:

A full scan with empty cache using GNU find takes 64.5s, while the same with treescan only needs 15.8s, over four times faster. This is almost completely due to saved stat calls.

Where treescan loses out is in terms of CPU usage - in the dm-cache case, GNU find uses 3.5s of CPU time, a tiny fraction of the absolute time, while treescan uses 17.3s - which is more than the absolute time due to the extra threads involved.