THE RANT / THE SCHPLOG
Schmorp's POD Blog a.k.a. THE RANT
a.k.a. the blog that cannot decide on a name

This document was first published 2015-06-13 05:41:41, and last modified 2015-06-13 05:41:41.

Emulating Linux MIPS in Perl - Part 2: Linux emulation

In this part about the Linux MIPS emulator, I will write about the "Linux Kernel Emulator" part, which is little more than a stub to get dash running, and as such represents kind of the "minimal" API that needs to be implemented to somehow run as an actual program.

Preliminaries

First we need some definitions (I only show a few):

sub ENOENT (){  2 }
sub EBADF  (){  9 }
sub ENOMEM (){ 12 }
sub ENOSYS (){ 89 }

sub O_ACCMODE   (){    0003 }
sub O_RDONLY    (){      00 }
sub O_WRONLY    (){      01 }
sub O_RDWR      (){      02 }
sub O_CREAT     (){  0x0100 } # not fcntl
sub O_NOATIME   (){ 0x40000 }

Then we need some helpers to translate between native errno and open mode values and MIPS ones:

sub errno2mips() {
   $!*1 # wrong, wrong, wrong
}

sub mips2omode($) {
   my $mmode = shift;

   my $omode = 0;

   $omode |= Fcntl::O_RDONLY if ($mmode & O_ACCMODE) == O_RDONLY;
   $omode |= Fcntl::O_WRONLY if ($mmode & O_ACCMODE) == O_WRONLY;
   $omode |= Fcntl::O_RDWR   if ($mmode & O_ACCMODE) == O_RDWR;

   for my $mode (qw(
      APPEND SYNC NONBLOCK CREAT TRUNC EXCL NOCTTY
      ASYNC NOFOLLOW DIRECT DIRECTORY NOATIME
   )) {
      eval "\$omode |= Fcntl::O_$mode if \$mmode & O_$mode";
   }

   $omode
}

While the open mode translation is relatively full-featured, you can see that the errno translation is, kind of, well, unimplemented. That's probably one reason I didn't pack this up and formally publish it (I wrote this sometime in 2011).

Next we define an array that maps syscall numbers to function references implementing them, and fill them with a function returning ENOSYS:

my $enosys = sub {
   warn "unimplemented syscall $r2\n";
      die;
};

my @syscall = ($enosys) x 7000;

According to a high profile glibc author, it's legal for any syscall in POSIX to return ENOSYS, which makes it easy for us, and would make POSIX a bit less useful.

On hindsight, maybe it would be better to do syscall lookups like this, instead of reserving 7000 entries:

$syscall{$nr} || $enosys

Invocation

On MIPS, syscalls use a special instruction (quite interesting given how few instructions MIPS actually has), unsurprisingly called syscall.

The CPU emulator code simply invokes the sys function when it encounters, which is conceptually simple and mimics the code that a real kernel would execute: it gathers syscall parameters in a highly architecture-specific way, calls the corresponding implementation function, and then passes the return values to the user mode program, again in a highly architecture-specific way.

Let's start with gathering arguments - the first four arguments are in registers R4 to R7 (all registers are represented by Perl variables of essentially the same name), the remaining four are still on the user mode stack (as indicated by R29), so we need to access memory for them:

sub sys {
   my @args =  map $_*1,
      $r4, $r5, $r6, $r7, # first four args in regs
      # extra arguments on stack
      $mem[($r29 + 16) >> ADDR_SHIFT][(($r29 + 16) >> 2) & ADDR_MASK],
      $mem[($r29 + 20) >> ADDR_SHIFT][(($r29 + 20) >> 2) & ADDR_MASK],
      $mem[($r29 + 24) >> ADDR_SHIFT][(($r29 + 24) >> 2) & ADDR_MASK],
      $mem[($r29 + 28) >> ADDR_SHIFT][(($r29 + 28) >> 2) & ADDR_MASK],
   ;

Invoking the actual implementation function is trivial, we simply look up the function in @syscall, using the syscall number in R2, and adorn it with some "strace" output, which is really useful for debugging the thing, as you can imagine:

   $strace = "$r2 (@args)";
   my $retval = $syscall[$r2](@args);
   print STDERR "$$ SYS_$strace = $retval\n" if STRACE;

Syscalls (in Linux) generally return a single word (32 or 64 bit), which is either the result or a negative errno value. This is passed directly to userspace on some architectures, but on MIPS (where Linux has to emulate the pre-existing ABIs), normal results and up in R2 and are flagged up as R7=0. Errors are also returned as R2 (real errno value, not the negated one), but R7 is set to 1 to signal an error.

   if ($retval > -4096 && $retval < 0) {
      $r2 = -$retval;
      $r7 = 1;
   } else {
      $r2 = $retval;
      $r7 = 0;
   }
}

Linux treats values from -4096 to -1 as errno values, which (I haven't checked many architectures) seems to be pretty universal on how things are done in Linux.

Finally, here is the full output from the "strace" code above when starting dash. The # at the end is the dash prompt, and the SYS_write after it is the strace output corresponding to writing the prompt to, hmm, STDERR, interesting:

638 SYS_ioctl (0, 540d, f00efe80) = 0
638 SYS_ioctl (1, 540d, f00efe80) = 0
638 SYS_getpid () = 638
638 SYS_sigaction (18, f00efce8, 0) = -89
638 SYS_geteuid () = 0
638 SYS_brk (0) = 268435456
638 SYS_brk (10001000) = 268439552
638 SYS_getppid () = 2582
638 SYS_stat64 (/root/src/mips, f00efca0) = 0
638 SYS_stat64 (., f00efca0) = 0
638 SYS_ioctl (0, 540d, f00efdc0) = 0
638 SYS_ioctl (1, 540d, f00efdc0) = 0
638 SYS_sigaction (2, 0, f00efde0) = -89
638 SYS_sigaction (3, 0, f00efde0) = -89
638 SYS_sigaction (15, 0, f00efe00) = -89
638 SYS_open (/dev/tty, 2002, 0) = 4
638 SYS_fcntl (4, 0, a) = 10
638 SYS_close (4) = 0
638 SYS_fcntl (10, 2, 1) = 0
638 SYS_ioctl (10, 40047477, f00efe18) = 0
638 SYS_getpgrp () = 638
638 SYS_sigaction (24, 0, f00efdf8) = -89
638 SYS_sigaction (27, 0, f00efdf8) = -89
638 SYS_sigaction (26, 0, f00efdf8) = -89
638 SYS_getpgid (0) = 638
638 SYS_ioctl (10, 80047476, f00efe0c) = 0
638 SYS_wait4 (-1, f00efddc, 3, 0) = -10
638 SYS_stat64 (/, f00efce0) = 0
# 638 SYS_write (2, 1000094c, 2) = 2

The alert reader will of course notice that the above output is not actually the string assigned to $strace in the sys function. That's because syscalls can (and most do) override that string with something slightly more useful, as we'll see shortly.

Meet the Syscalls

Now let's see how some of these syscalls are implemented, and why some of the more interesting ones had to be implemented (somehow).

Let's start with a real easy one, to get in the mood, namely syscall number 4001, exit (which corresponds to POSIX::_exit, not the built-in exit):

$syscall[4001] = sub { # exit
   strace "exit ($_[0])";
   exit $_[0];
};

Could hardly be simpler. The strace function is the part that overwrites the $strace string that is later being printed, and works pretty much like sprintf:

sub strace($;@) {
   $strace = $#_
      ? sprintf $_[0], @_[1..$#_]
      : shift;
}

Anyways, let's try the next harder one, which is (hopefully surprisingly), fork:

$syscall[4002] = sub { # fork
   strace "fork";

   my $pid = fork;
   return -errno2mips unless defined $pid;
   $pid
};

The fork syscall is emulated using, well, the normal fork function. If we encountered an error (undefined $pid), we map (cough) that errno value to the MIPS ABI and return it, otherwise, we return the $pid value, which is, as we all remember, the newly created process id in the parent, and 0 in the child.

Let's continue with syscall 4003, read, which is more complex mostly because it has to access memory, and some more:

$syscall[4003] = sub { # read
   my ($fd, $rbuf, $count) = @_;
   strace "read (%d, %x, %d)", $fd, $rbuf, $count;

   $count = sysread $fh[$fd], my $buf, $count;

   memset $rbuf, $buf;

   defined $count ? $count : -errno2mips
};

Still reasonably simple, we use sysread to emulate the read syscall, and then use a helper function called memset to put the result buffer into the emulator memory. There is also a smoking gun here.

Anyways, memset isn't a function to be very proud of, could be optimised a lot (but who are we kidding here), but it does get the job done:

sub memset($$) {
   for (0 .. (length $_[1]) - 1) {
      my $i = $_[0] + $_;
      my $c = unpack "C", substr $_[1], $_, 1;

      my $s = (~$i & 3) << 3;
      $i = \$mem[$i >> ADDR_SHIFT][($i >> 2) & ADDR_MASK];
      $$i = $$i & ~(0xff << $s) | ($c << $s);
   }
}

Since we store 32 bit words and not octets, we do a pretty complicated dance of shifting and masking to only change the octet we need to. An obvious optimisation would be to store aligned groups of four characters as one word, but I'm not the person for this kind of optimisation.

There are also helper functions called memget, which reads memory and memstr, which reads a 0-terminated string.

Descriptor Woes

And the smoking gun (which the alert reader is already burning to get explained)? That's $fh[$fd], which translates file descriptor numbers to something that passes as a filehandle in Perl.

First of all, to seasoned Perl coders, this might look obvious, but at least I naturally gravitate towards using my IO::AIO module to do raw I/O, and feel right at home with using file descriptors (I avoid FILE * in C like a pest, too). But since my goal was to not use anything not part of a basic Perl installation, I somehow had to cope with perl built-ins.

Well, making an array mapping file descriptors (ints) to file descriptions (Perl file handles) sounds simple enough - until you realise that directories are rather special beasts.

Anyway, since file descriptors are pretty central, there is some code specifically dealing with it.

First, some variables to store file and directory handles:

my @fh;
my @dh; # directory-handles, HACK

Did I mention yet that I don't like directory handles?

Next we have some initialisation code that opens file descriptors 0, 1 and 2 (stdin, out, err) as files and maps them to the same file descriptors in our emulator, by stuffing the handles into $fd[0] and so on:

for my $fd (0..2) {
   open my $fh, "+<&", $fd
      or next;

   $fh[$fd] = $fh;
}

Note that we open copies of these file descriptors, so we can keep reading and writing to stdin/out/err even if the emulated program closes them. This isn't exactly correct, as we signal EOF to external programs only at program end, while it would normally be signalled when the program close's these, but works well enough.

Next we map file descriptors 3 to 9 in a similar but not identical way:

for my $fd (3..9) {
   open my $fh, "+<&=", $fd
      or next;

   $fh[$fd] = $fh;
}

Here we don't open copies, because we don't need them preserved. If you wonder why just 3 to 9, and not more: the reason is a combination of laziness, and the fact that the POSIX shell only guarantees access to single-digit file descriptors, so this should be enough for shell uses.

In fact, there is another detail: I lied to you. We don't map 0 to 2, and then 3 to 9, we do it in the opposite order. The reason is that, when we first make copies of stdin/out/err, these copies will likely fall into the range 3 to 9, so we would generate a phantom file descriptor in the emulator. By first getting the actual fds 3 to 9, we avoid this problem.

Lastly, mostly for the EBADF check, we have a helper function called fd_valid, which, as the name implies, checks whether an fd is valid, that is, open:

sub fd_valid($) {
   !($_[0] & ~65535)
   && $fh[$_[0]]
}

The last simple helper function is newfd, which simply finds a new empty file descriptor slot and attached the passed handle to it. POSIX requires that the lowest free fd is used when allocating a new fd.

sub newfd($) {
   my $fd;
   ++$fd while $fh[$fd];
   $fh[$fd] = $_[0];
   $fd
}

The remaining helper is only used in a single case execve, which is where it will be described.

Now let's use these helper functions in some syscalls, for example, open and close:

$syscall[4005] = sub { # open
   my ($path, $flags, $mode) = @_;
   $path = memstr $path;
   strace "open (%s, %x, %o)", $path, $flags, $mode;

   if (opendir my $dh, $path) {#d#
      open my $fh, "</dev/null"or die;
      my $fd = newfd $fh;
      $dh[$fd] = $dh;
      return $fd;
   }

   sysopen my $fh, $path, mips2omode $flags, $mode
      or return -errno2mips;

   newfd $fh
};
$syscall[4006] = sub { # close
   my ($fd) = @_;
   strace "close ($fd)";
   fd_valid $fd or return -EBADF;

   undef $dh[$fd];#d#
   (close delete $fh[$fd])
      ? 0 : -errno2mips
};

Yes, there really isn't a newline between those syscall definitions. open reads the path from memory, and then has to use either opendir or sysopen, depending on whether the path to open refers to a directory or not, assuming that opendir fails on directories - in unix, there is little distinction between directories and files. In fact, in very old unices, some commands or libc functions directly manipulated directories (such as mkdir patching in the directory name directly, or mknod appending a new node to the directory, all by opening a directory for writing). In modern, or in fact not antique, unices, directories cannot be opened for writing anymore.

Ramblings of an old man

Which reminds me of an anecdote - I once wrote a remote updater daemon to maintain my university departments large collection of HP-UX, medium collection of IRIX and small collection of GNU/Linux boxes (the core part of that daemon is still in use in a commercial remote administration program).

When replacing a directory entry (usually a device node or normal file), it would unlink the entry to be updated if it couldn't be replaced atomically by rename.

Now, I knew that, as root, you could unlink directories on HP-UX, which would literally unlink it, that is, it would remove the name, but not clean up the directory, as (at least) the . entry within would still refer to it, leaving it allocated because of a reference count cycle. This is a similar feature to writable directories, and is fortunately no longer found in modern unix clones.

I was quite aware that my remote updater couldn't be used to update directories anyways, so I didn't give this any thought.

Not so my coworker, who tried it by updating the config file, but leaving it to me to try it out in the evening. Result: /usr and a few other vital parts were unlinked on our main file server, not a healthy thing. After the mandatory short panic wave, I realised it's not fatal - all I would need would be an fsck run, which would find the directory and offer to link it into lost+found, so I started to do so.

My bad luck didn't quite end there, as another coworker logged in from home, saw the file server wasn't running, and did what he always did, reboot it. During my fsck run.

I wasn't amused, especially as the box obviously didn't boot up without its /usr hierarchy, so I incanted the necessary magic to boot into single user mode, made sure rlogin/ssh was not running, did my fsck, found /usr and the other directories, and if I hadn't told anybody, nobody would have known of this small hiccup.

Back to close - this syscall is again rather trivial, it nukes the directory handle out of existence, closes the file handle and returns any error. I am sure it doesn'T work properly on directory handles, but nothing complained so far.

Now to the most complex one of all, execve. Let's start with the very boring parts, reading all the strings from memory:

$syscall[4011] = sub { # execve
   my ($path, $argv, $envv) = @_;
   $path = memstr $path;

   for my $vec ($argv, $envv) {
      my $addr = $vec;
      $vec = [];
      while () {
         my $ptr = unpack "N", memget $addr, 4
            or last;
         push @$vec, memstr $ptr;
         $addr += 4;
      }
   }

As a quick look at the manpage shows, execve is passed a filename to exec and a vector if argument strings, and a vector of environment variables, all of which are read from memory and replaced into the $path, $argv and $envv variables.

Fortunately, and as with other syscalls, we don't have to emulate execve by implementing it's function, we merely have to interface to perl's exec built-in. For that, we replace %ENV with the variables from $envv,

   local %ENV;
   /([^=]*)=(.*)/s, $ENV{$1} = $2
      for @$envv;

twiddle the first argument (execve special cases the filename as an extra argument, while perl's exec special cases argv[0]),

   ($path, $argv->[0]) = ($argv->[0], $path);

invoke reify_fds,

   reify_fds;

invoke exec,

   exec {$path} @$argv;

and finish up:

   # not normally printed...
   strace "execve (%s, [%s], [%s])", $path, (join "|", @$argv), (join "|", @$envv);

   -errno2mips
};

The reify_fds step I skipped so quickly above is, in fact, a major complication. The fd numbers inside out emulator do not usually correspond to fd numbers on the unix process the emulator runs as. This impedance mismatch is fixed by some serious fd swapping in reify_fds, which I will show here, but leave up to you to explain:

sub reify_fds {
   my $top = 512;

   for my $fd (0..$#fh) {
      next unless $fh[$fd];

      POSIX::dup2 fileno $fh[$fd], $top + $fd;
      close $fh[$fd];
   }

   for my $fd (0..$#fh) {
      next unless $fh[$fd];

      POSIX::dup2 $top + $fd, $fd;
      POSIX::close $top + $fd;

      open my $fh, "+<&=", $fd
        or die;

     $fh[$fd] = $fh;
   }
}

The Other Syscalls

Many other syscalls had to be implemented - the obvious ones are chdir, unlink, kill, dup and so on, followed by less obvious, but still pretty standard brk, ioctl, fcntl, the latter ones are a pain in the ass, but generally, all of them correspond to familiar POSIX calls.

When reading the code, you can look at them. Some of them are named a bit weird, for example newstat or newuname. These are simply newer variants of the same syscall - old libcs (or libcs from other operating systems) might call stat, newer libcs might call newstat. There are also "large file" interfaces, such as stat64.

Then there is pipe (or, specifically, sysm_pipe), which is special because it has two return values. Most unices kind of creep out because of that, and use various weird ad-hoc calling conventions (such as using an extra return register), causing much grief. MIPS uses such an ad-hoc calling convention, by using R3 to return the write fd:

$syscall[4042] = sub { # sysm_pipe
   strace "sysm_pipe ()";

   pipe my $r, my $w
      or return -errno2mips;

   $r = newfd $r;
   $w = newfd $w;

   strace "sysm_pipe ($r, $w)";

   $r3 = $w;
         $r
};

Some syscalls would be even more than just a pain in the ass, such as mmap (especially as not many programs call msync when they should). Fortunately, this implementation of mmap makes both dash and bash quite happy:

$syscall[4090] = sub {
   # SYSCALL_DEFINE6(mips_mmap, unsigned long, addr, unsigned long, len,
   #         unsigned long, prot, unsigned long, flags, unsigned long,
   #         fd, off_t, offset)

   strace "mips_mmap (%x, %d, %x, %x, %d, %d)", @_;
   -ENOSYS
};

Signal handling would be another medium pain in the ass, and consequently, I didn't bother yet.

The remaining weird syscall is getdents64, which is what readdir uses:

$syscall[4219] = sub { # getdents64
   my ($fd, $dirp, $count) = @_;
   strace "getdents64 (%d, %x, %d)", $fd, $dirp, $count;

   my $name = readdir $dh[$fd];

   return 0 unless defined $name;

   my $ino = -1;
   my $type = 0;

   my $entry = pack "NN NN n C Z*",
      $ino >> 32, $ino,
      0, 0, # offset
      (length $name) + 20,
      $type,
      $name;

   memset $dirp, $entry;
   length $entry
};

My version only returns a single directory entry per call, which is valid, but really quite inefficient. It does get the job done, though.

The End

And that's it for this part. If/when you ever study the source code, the above should give you a pretty good idea on how the Linux "emulation" works, and how few (or many) syscalls you need to make dash run perl's Configure script.

This article turned out to be surprisingly long. In fact, when I was writing this program and was mostly finished with the ELF loader and CPU emulator, and thus pretty much rejoicing at my success, it dawned on me that the Linux emulator was still mostly missing, which surprised me, but in a rather negative way - it turned out to be almost three times as much code as the CPU emulator. This probably explains why it only does the absolute minimum required: I wanted results, and I wanted results now.

The next part will look at the most interesting (to me) block, namely the actual CPU emulator, which isn't very advanced, but is fats enough for me to run configure scripts.