THE RANT / THE SCHPLOG
Schmorp's POD Blog a.k.a. THE RANT
a.k.a. the blog that cannot decide on a name

This document was first published 2016-03-03 22:27:29, and last modified 2016-03-03 22:27:29.

Detecting a mount point

Today, I just want to dispense the results of a tiny bit of research I did. May it be of some use to you :)

How to scan a directory tree

In the old days, the way to scan a directory tree was to readdir the names in the directory, lstat them, and then recursively do this for any directories you found. The reason you would use lstat instead of stat is that symlinks can point up in the tree, potentially creating loops. Directories cannot point up in the tree, because you can't hardlink them: the original prohibition on hard linking directories was implemented because it tricks programs into endless recursion.

That doesn't keep bash from doing it the wrong way (echo ** is a sure way to invoke the out-of-memory killer on my system, if I had the patience to wait for it, that is), but most programs working on directory trees (such as tar) do get this right.

The two main cases where this is not enough are -x options and Linux. First, many programs have an -x or --one-file-system option that instructs them not to cross filesystem boundaries. That way, you can use tar --one-file-system to get a tar file of your root filesystem without also adding /proc for example.

The way this was traditionally implemented is to stat the directory you want to recurse into and its parent directory, and see whether the st_dev member matches. If it doesn't match, the subdirectory is on another filesystem, i.e. a mount point.

Unfortunately, this fails badly on Linux, for two reasons. The first is a bug - st_dev values are not stable (most notably with NFS), that is, you might stat a path, get a st_dev value, and later stat the same path again to get a different st_dev value, even though nothing has changed. This can cause your tree walker to detect a filesystem crossing when there really isn't one.

The other reason is bind mounts - you can create a loop by bind mounting some directory into a subdirectory of itself:

mkdir subdir
mount --bind . subdir
ls subdir/subdir/subdir/subdir/...

This successfully confuses a lot of programs who otherwise get recursion right (or to put it differently, Linux bind mounts broke a lot of programs that worked properly for decades).

So how to avoid it?

The horrible way

One horrible way is to parse /etc/mounts, the output of mount, or some other source of mount points and compare the paths with the directory you want to enter, to see if it is a mount point.

This is horrible for so many reasons - comparing paths is hard (some of the components might be symlinks), and there are so many race conditions you can't check for - somebody could mount a new directory, an existing directory could be renamed, and so on.

I'd outright refuse to use this method, even if there was no other solution.

The tricky^Wtrivial way

Fortunately, there is a super-simple way of checking whether a directory is a mount point that is mostly race-free (and can be used for completely race-free scanning with some care) - by exploiting what some people consider a bug, but the Linux kernel developers fortunately consider a security feature: While a bind mount might refer to the same filesystem as its parent, it is still a security boundary for rename and similar syscalls.

That means that you can't rename or link over mount point boundaries, even when source and destination are the same filesystem!

A practical way to exploit this is to do some illegal rename and then check the error return (errno).

rename ("somepath/.", "somepath/subdir/.");

If this succeeds, you are in big trouble of course. So it will fail, but how exactly does it fail?

If it fails with EXDEV, then somepath/subdir is a mount point. Otherwise (e.g. EBUSY), it isn't. If you chdir into the subdir before checking (or use the xxxat functions), and use relative paths, then it is even race-free, i.e. this will reliably tell you whether the current directory is a mount point:

rename ("../.", ".");

Even if somebody mounts something over that directory shortly afterwards, your program can still continue to scan to underlying paths on the same filesystem.

Brilliant, isn't it? So what are the trade-offs? Everything has trade-offs, right?

Well, for one, this isn't exactly POSIX-specified behaviour. Bind mounts and other constructs where this problem happens aren't governed by POSIX anyway, and some OS might someday implement a similar construct and allow cross-mount-point renames. I don't know of any who do at the moment, and it would be dumb to do so, but then...

The bigger issue is that POSIX doesn't actually specify the order in which error conditions are checked. The above rename can fail for a number of reasons, and there is no guarantee that EXDEV will be returned even for mount points - the kernel could simply refuse to use . as source path, regardless of destination. But currently, it is working, everywhere, and that's likely not going to change.

In any case, it's more portable and better than any other alternative I can think of.

Appendix: security boundary?

The reason why the EXDEV error behaviour is not an accident but by design is that it is supposedly usable as a security boundary. As explained by Al Viro, this:

mount --bind /tmp /tmp

Gives you a /tmp directory that keeps people from linking from inside /tmp to the outside, and vice versa, which prevents some common exploits.

So this is a supported feature. Of course, there are bugs everywhere, so it might not always work, but that isn't news, is it?