THE RANT /
THE SCHPLOG
Schmorp's POD Blog a.k.a. THE RANT
a.k.a. the blog that cannot decide on a name

This document was first published 2015-07-01 02:03:35, and last modified 2021-06-26 18:03:14.

Emulating Linux-MIPS in Perl - Part 3: MIPS CPU emulation

In this part about the Linux MIPS emulator, I will write about the actual CPU emulation.

I chose MIPS after surveying quite a few other architectures, and MIPS won because the original MIPS I (or, back then, just MIPS of course) architecture is extremely simple, even compared to it's successors, but still well supported by GCC. It also doesn't have a condition code register, which usually simplifies CPU emulation a lot.

The MIPS Architecture

It's surprisingly hard to find an actual nice instruction overview of MIPS I, but as you can see from the linked image, it comfortably fits on a single page.

Most instructions should be self-explaining (at least, if you know some assembly programming). The (slightly) weird features of the MIPS architecture include separate signed/unsigned addition/subtraction instructions (signed ones trap on overflow) and the extra "lo" and "hi" registers each with their own instruction, whose only purpose is to read the results of the previous multiplication or division instruction.

A typical artifact of RISC architectures (such as MIPS) is the "load upper immediate" (lui) instruction, which is used to incrementally load full 32 bit words into registers, which MIPS cannot generally do in a single instruction, and the existence of "load word left"/"load word right" (lwl/lwr) and associated store instructions, which can generate efficient code for unaligned accesses, if the alignment is known at compile time. Hmm, those instructions aren't in the linked reference - did I mention it's hard to get a complete and correct one?

I originally hoped that dash needs neither the annoying "left"/"right" instructions nor the add/sub-with-trap instructions, but I lost on the former.

Mapping MIPS on Perl

Now, Perl isn't exactly a low-level language. One consequence is that 32 bit arithmetic isn't actually easy to come by in Perl. Even though Perl doesn't go overboard and implements multi precision integers by default, thanks to Nicholas Clark, basic arithmetic in Perl tries to do the right thing by using full ranges of integers, unsigned integers and floating point values.

Anyway, let's jump right into the code.

use integer; # minor speed improvement, major correctness improvement for div/divu

Using the integer pragma helps a bit. Next we need some constants:

sub M32       (){ 0xffffffff }
sub B31       (){ 0x80000000 }
sub M16       (){     0xffff }
sub B15       (){     0x8000 }
sub M8        (){       0xff }
sub B7        (){       0x80 }

M32 ("mask 32") is simply a full 32 bit mask, used to "normalise" results of arithmetic expressions into the 32 bit range stored in registers. B31 (bit 31), the topmost bit, is mostly used to extract the sign bit. Same thing with M16 and B15 and M8/B7, just for 16 and 8 bit values, respectively.

The Registers

Looking at the registers we need to represent the CPU state once more shows how simple MIPS is:

my ($pc, $hi, $lo, $delay); # cpu state

my (
   $r0 , $r1 , $r2 , $r3 , $r4 , $r5 , $r6 , $r7 ,
   $r8 , $r9 , $r10, $r11, $r12, $r13, $r14, $r15,
   $r16, $r17, $r18, $r19, $r20, $r21, $r22, $r23,
   $r24, $r25, $r26, $r27, $r28, $r29, $r30, $r31,
);

my $insn;

The $pc is the program counter (or instruction pointer), $hi and $lo are the special hi/lo registers, and the 32 regular registers are simply called $r0 to $r31.

Which leaves $insn and $delay.

Detour: Delay Slots

One big feature of MIPS that it shared with many early RISC architectures is delay slots.

Basically, due to pipelining inside the CPU, the CPU has fetched and decoded the next instruction long before the previous instruction has finished execution.

If the previous instruction is a jump, the means that the CPU will execute the instruction after the jump, even when the jump is taken. Let's see an example:

  6f4:       2442ffff        addiu   v0,v0,-1
  6f8:       1440000a        bnez    v0,0x724
  6fc:       ae220000        sw      v0,0(s1)

The middle instruction here, bnez is a branch that jumps to address 0x724 (somewhere "far" away) if register v0 is not zero. However, regardless of whether the branch is taken or not, the MIPS CPU will always execute all three instructions. A minor complication is that the branch and the delay slot instruction must execute without interruption, too - an interrupt between them could wreak havoc, as the interrupt would return to the delay slot instruction and execute the wrong branch.

Detour: Assembler Macro Magic

The (cough) alert reader will, without doubt, by now have stumbled over two unrelated problems: v0 is not in the list of MIPS registers, and bnez is not a MIPS instruction!

Well, thats the magic of assembly macros: MIPS has a number of aliases for registers, for example, v0 is simply and alias for r2. Most registers have such aliases (there are v1, a0, t0, s0, k0, gp, sp, fp and ra registers). Almost all of these are just conventions, which presumably makes it easier to read or write assembly instructions. To me, it makes it harder to read, and when you work on an emulator, it will quickly piss you off.

Similarly, bnez v0,label is really the bne v0,r0,label instruction, which jumps if v0 is not equal to r0. r0 is always zero in MIPS, so this really jumps when v0 is not equal to zero. Again, makes it easier to code assembly (this time, I think, for real), but can be annoying when you read objdump output.

Detour: Delay Slots (Again!)

Anyway, coming back to delay slots, the MIPS CPU also has delay slots for all load instructions. That means something slightly different, namely that the value loaded by the load instruction is not available in the next instruction, so the compiler needs to find something else to do in the meanwhile.

At the time MIPS was invented, it was assumed that compilers will have no trouble finding something useful for all these delay slots to fill, and as expected, the result is that even today, the most common instruction in MIPS code is nop. It is so common in delay slots, that special casing it in my emulator gives a noteworthy speed improvement. Witness this randomly selected code sample from the dash binary:

  700:       8ea20000        lw      v0,0(s5)
  704:       00000000        nop
  708:       10400006        beqz    v0,0x724
  70c:       00000000        nop
  710:       8f9986c4        lw      t9,-31036(gp)
  714:       00000000        nop
  718:       0320f809        jalr    t9
  71c:       00000000        nop
  720:       8fbc0010        lw      gp,16(sp)
  724:       8e420000        lw      v0,0(s2)
  728:       00000000        nop
  72c:       8c430004        lw      v1,4(v0)
  730:       24420004        addiu   v0,v0,4

Load, nop, branch, nop, load, nop, branch, nop, load, load, nop... Sigh.

The idea of these delay slots was so good that it became the whole theme for the architecture (MIPS stands for "Microprocessor without Interlocked Pipeline Stages", which basically means the CPU will not pause just because an instruction needs results from the previous instruction too early).

In fact, the idea was so good that it mostly vanished with MIPS II and later CPU architectures. Compilers have trouble filling delay slots, and deep pipelines and slow RAM means that we would either need more and more of these delay slots, or more intelligent CPUs. The world chose the latter.

So what about $insn and $delay? The $insn variable is declared globally simply so that it is in scope of everything that needs it, and contains the currently executing instruction. $delay implements the delay slot - basically, the emulator always fetches the next instruction from the $delay variable, and then fills $delay from the next memory cell. This ensures that a branch that changes the $pc will still execute the directly following instruction, because it is already inside $prev.

Detour: Caching, Debugging... and Deparsing

The way the emulator works, as we will see shortly, is by compiling instructions into perl expressions, and then compiling these once.

That means each distinct 32 bit instruction value might be compiled into a separate perl sub. In practice, there are far fewer than 2**32 instructions used, so this pays off well.

For this, the emulator has three hashes:

my %insn2src; # insn to perl source
my %insn2sub; # insn to compiled perl sub
my %sub2insn; # sub+0 to insn code

The first, %insn2src, maps instructions (32 bit numbers) to perl expressions that emulate them. The next, %insn2sub contains the compiled versions of them, and %sub2insn, does the reverse thing - it maps addresses of these code references back to the instruction code.

That's of course only useful for debugging, and in fact, that's what it's used for, namely for printing instructions as they are executed. In fact, the emulator doesn't use these hashes in the expected way - the CPU emulator really only fills in %insn2sub, and the trace code then deparses the perl expression. Since that is super expensive, the result of deparsing is cached:

sub cpu_pr {
   printf " 0 %8x=zr %8x=at %8x=v0 %8x=v1 %8x=a0 %8x=a1 %8x=a2 %8x=a3\n",
              $r0  , $r1  , $r2  , $r3  , $r4  , $r5 ,  $r6  , $r7;
   printf " 8 %8x=t0 %8x=t1 %8x=t2 %8x=t3 %8x=t4 %8x=t5 %8x=t6 %8x=t7\n",
              $r8 ,  $r9 ,  $r10,  $r11,  $r12,  $r13,  $r14,  $r15;
   printf "16 %8x=s0 %8x=s1 %8x=s2 %8x=s3 %8x=s4 %8x=s4 %8x=s6 %8x=s6\n",
              $r16,  $r17,  $r18,  $r19,  $r20,  $r21,  $r22,  $r23;
   printf "24 %8x=t8 %8x=t9 %8x=k0 %8x=k1 %8x=gp %8x=sp %8x=fp %8x=ra\n",
              $r24,  $r25,  $r26,  $r27,  $r28,  $r29,  $r30,  $r31;

   my $code = $sub2insn{$insn+0};
   my $src  = $insn2src{$code} ||= deparse $insn;
   $src =~ s/\s+/ /g;
   printf "%x: %08x (%s)\n", $pc * 4, $code, $src;
}

Typical output would be:

4191d4: 00a21021 ({ $r2 = $r5 + $r2 & 4294967295; })
 0        0=zr fffffff8=at   444724=v0        0=v1 443d96=a0   444708=a1    300=a2 429e5a=a3
 8   429e6e=t0        0=t1   444708=t2   429e5a=t3 429e50=t4        0=t5      0=t6      0=t7
16   443d90=s0        0=s1   44469d=s2   426661=s3 43fa60=s4   426678=s4 426670=s6      7=s6
24 deadbeef=t8   41bf00=t9 deadbeef=k0 deadbeef=k1 447a90=gp f00efe78=sp      a=fp 4191e4=ra
4191d8: 8c440000 ({ $i = $r2 + 0; $r4 = $mem[$i >> 16][$i >> 2 & 16383]; })
 0        0=zr fffffff8=at   444724=v0        0=v1 429c2b=a0   444708=a1    300=a2 429e5a=a3
 8   429e6e=t0        0=t1   444708=t2   429e5a=t3 429e50=t4        0=t5      0=t6      0=t7
16   443d90=s0        0=s1   44469d=s2   426661=s3 43fa60=s4   426678=s4 426670=s6      7=s6
24 deadbeef=t8   41bf00=t9 deadbeef=k0 deadbeef=k1 447a90=gp f00efe78=sp      a=fp 4191e4=ra
4191dc: afa50064 ({ $i = $r29 + 100; $mem[$i >> 16][$i >> 2 & 16383] = $r5; })
 0        0=zr fffffff8=at   444724=v0        0=v1 429c2b=a0   444708=a1    300=a2 429e5a=a3
 8   429e6e=t0        0=t1   444708=t2   429e5a=t3 429e50=t4        0=t5      0=t6      0=t7
16   443d90=s0        0=s1   44469d=s2   426661=s3 43fa60=s4   426678=s4 426670=s6      7=s6
24 deadbeef=t8   41bf00=t9 deadbeef=k0 deadbeef=k1 447a90=gp f00efe78=sp      a=fp 4191e4=ra
4191e0: 0320f809 ({ $r31 = $pc << 2; $pc = $r25 >> 2; })
 0        0=zr fffffff8=at   444724=v0        0=v1 429c2b=a0   444708=a1    300=a2 429e5a=a3
 8   429e6e=t0        0=t1   444708=t2   429e5a=t3 429e50=t4        0=t5      0=t6      0=t7
16   443d90=s0        0=s1   44469d=s2   426661=s3 43fa60=s4   426678=s4 426670=s6      7=s6
24 deadbeef=t8   41bf00=t9 deadbeef=k0 deadbeef=k1 447a90=gp f00efe78=sp      a=fp 4191e4=ra
41bf00: afb90068 ({ $i = $r29 + 104; $mem[$i >> 16][$i >> 2 & 16383] = $r25; })
 0        0=zr fffffff8=at   444724=v0        0=v1 429c2b=a0   444708=a1    300=a2 429e5a=a3
 8   429e6e=t0        0=t1   444708=t2   429e5a=t3 429e50=t4        0=t5      0=t6      0=t7
16   443d90=s0        0=s1   44469d=s2   426661=s3 43fa60=s4   426678=s4 426670=s6      7=s6
24 deadbeef=t8   41bf00=t9 deadbeef=k0 deadbeef=k1 447a90=gp f00efe78=sp      a=fp 4191e4=ra
41bf04: 10000001 ({ $pc += 0; })

Here, between all these register dumps, you can see the actual "instructions" as perl sees them. Here are some more instructions:

41602c: 8fbc0010 ({ $i = $r29 + 16; $r28 = $mem[$i >> 16][$i >> 2 & 16383]; })
416030: 10400006 ({ $pc += 5 if $r2 == 0; })
416034: 02002021 ({ $r4 = $r16 + 0 & 4294967295; })
416038: 8f998498 ({ $i = $r28 + -31592; $r25 = $mem[$i >> 16][$i >> 2 & 16383]; })
41603c: 00000000 ({ })
416040: 0411fcb3 ({ $r31 = $pc << 2, $pc += -846; })
41530c: 24050009 ({ $r5 = 9; })
415310: 3c1c0044 ({ $r28 = 4456448; })
415314: 27bdffd0 ({ $r29 = $r29 + -48 & 4294967295; })
415318: 279c7a90 ({ $r28 = $r28 + 31376 & 4294967295; })
41531c: afbf002c ({ $i = $r29 + 44; $mem[$i >> 16][$i >> 2 & 16383] = $r31; })
415320: afb30024 ({ $i = $r29 + 36; $mem[$i >> 16][$i >> 2 & 16383] = $r19; })
415324: afb20020 ({ $i = $r29 + 32; $mem[$i >> 16][$i >> 2 & 16383] = $r18; })
415328: 00809021 ({ $r18 = $r4 + 0 & 4294967295; })
41532c: afb1001c ({ $i = $r29 + 28; $mem[$i >> 16][$i >> 2 & 16383] = $r17; })
415330: afb00018 ({ $i = $r29 + 24; $mem[$i >> 16][$i >> 2 & 16383] = $r16; })
415334: 00a08021 ({ $r16 = $r5 + 0 & 4294967295; })
415338: afbc0010 ({ $i = $r29 + 16; $mem[$i >> 16][$i >> 2 & 16383] = $r28; })
41533c: 0c105340 ({ $r31 = $pc << 2; $pc = $pc & 1006632960 | 1069888; })
414d00: afb40028 ({ $i = $r29 + 40; $mem[$i >> 16][$i >> 2 & 16383] = $r20; })

The first column is the instruction address, the second is the instruction value, and the rest is the actual perl code that corresponds to that number.

All that is now needed is the code that generates these perl expression, and a loop to call it in a loop.

But first...

Deparsing

Deparsing a perl code reference into source doesn't really have anything to do with the emulator, but I am sure some people will find it interesting to see how it's done, so here is how I do it:

{
   our ($hint_bits, $warning_bits, %hint_hash);

   BEGIN {
     ($hint_bits, $warning_bits, %hint_hash) = ($^H, ${^WARNING_BITS}, %^H);
   }

   sub deparse {
      require B::Deparse;
      my $deparser = new B::Deparse;
      $deparser->ambient_pragmas (
         hint_bits    => $hint_bits,
         warning_bits => $warning_bits,
         '$['         => $[+0,
         integer      => 1,
         '%^H'        => \%hint_hash,
      );
      $deparser->coderef2text ($_[0])
   }
}

Basically, it calls the coderef2text method of the B::Deparse module, with all the rest of the code dealing with warning flags and similar nuisances.

The MIPS Instructions

Special Instructions

The next part of the emulator defines the MIPS instructions. Or rather, the Perl expressions for them. Or rather, templates for the Perl expressions. It's easier to see that in actual code:

my ($s, $t, $i); # "global" for speed

Yeah, ok, that's just some variables declared globally for speed.

my @special = ('die "special ", $insn & 63, " not supported"') x 64;

And, yeah, this wasn't so MIPS-y either. There are two special opcode forms, "special" and "register immediate" instructions. And the @special array contains all the 64 possible "special" instructions. Initially, these die especially well, or in other words, since not all of them are actually defined, the array is filled with die calls. The actual instructions are then assigned to the array:

$special[ 0] = "RD = (RT                       << SA       ) & M32"; # sll

This is special instruction #0, "shift left logical" (sll, basically an unsigned left shift).

The string doesn't look so obviously like Perl code, that's because RD, RT and SA are replaced by the relevant bits/values from the instruction (and M32 is just a constant defined earlier).

In this case, RD is the "d register" from the instruction, which probably stands for "destination register", RS is the "s (or source) register" and SA is the "shift amount". All in all, these special tokens exist:

   IMM  # 16 bit signed immediate (lower 16 bits of insn)
   IMMU # 16 bit unsigned immediate (lower 16 bits of insn)

   SA   # shift amount (bits 6..10)

   RS   # s register (bits 21..25)
   RT   # t register (bits 16..20)
   RD   # d register (bits 11..15)

With this knowledge, you should be able to understand how these templates work:

$special[ 3] = 'RD = ((RT - ((RT & B31) << 1)) >> SA       ) & M32'; # sra
$special[ 4] = 'RD =  (RT                      << (RS & 31)) & M32'; # sllv

Well, maybe it's not that obvious - the emulator stores values as unsigned 32 bit numbers. Some instructions, such as the "shift right arithmetical" (sra) need signed numbers. The expression ($value & MASK) - (($value & SIGNBIT) << 1) converts an unsigned into a signed value. Quickly. Or so I hope. I even used GNU superopt to find possible instruction sequences and benchmarked them. The expression won out against all others, including use of pack/unpack.

Seriously, though, if you know a faster way, let me know!

Anyways, let's look at more pleasant instructions:

$special[16] = 'RD = $hi'; # mfhi
$special[17] = '$hi = RS'; # mthi
$special[18] = 'RD = $lo'; # mflo
$special[19] = '$lo = RS'; # mtlo

mfhi is "move from hi", mtlo is "move to lo", and with this, it should be easy to see what these do.

$special[ 8] = '               $pc = RS >> 2'; # jr
$special[ 9] = 'RD = $pc << 2; $pc = RS >> 2'; # jalr

jr jumps to an address given in a register, jalr does the same, but stores the return address in another register (another RISC-y thing, that is, not having special stack support instructions).

The emulator stores the "word index" of the instruction in $pc, not the byte address, which is why assignments and reads to/from the $pc need an extra shift.

Addition and subtraction are easy, too:

$special[32] = 'die "add instruction unsupported"';
$special[33] = "RD = (RS + RT) & M32"; # addu
$special[34] = 'die "sub instruction unsupported"';
$special[35] = "RD = (RS - RT) & M32"; # subu
#$special[32] = $special[33]; # add # buggy, does not trap
#$special[34] = $special[35]; # sub # buggy, does not trap

As you can see, the addi/subi instructions can "trap", in fact, always trap, because compilers do not generate them. And as you can also see, if I ever meet this naughty compiler which does use them, I have a plan B, too.

Multiplication is a bit more complicated, but not overly so:

$special[24] =  '# mult
   $lo = (RS - ((RS & B31) << 1))
       * (RT - ((RT & B31) << 1));
   $hi = ($lo >> 32) & M32;
   $lo &= M32;
';
$special[25] = ' # multu
   $lo = RS * RT;
   $hi = $lo >> 32;
   $lo &= M32;
';

The signed multiplication has to do a bit more work to convert unsigned to signed, and we have to split the result into the lower and higher 32 bit values, but this is all straightforward.

Logical operations are again very simple:

$special[36] = "RD =  RS & RT       "; # and
$special[37] = "RD =  RS | RT       "; # or
$special[38] = "RD =  RS ^ RT       "; # xor
$special[39] = "RD = (RS | RT) ^ M32"; # nor

Ah yeah, MIPS has a nor instruction. That is clearly a very weird feature, but allows the CPU to do something useful with three registers, while a two-operand not instruction would waste the opportunity.

The last interesting special instructions are sys and break:

$special[12] = "sys";
$special[13] = "die sprintf \"BREAK (%08x)\\\n\", $insn"; # break

sys triggers a syscall (see part 2), and break is used for breakpoints by debuggers and similar applications, none of which concern this emulator.

Register Immediate Instructions

In a similar way to "special" instructions, the register immediate ones are declared:

my @regimm = ('die "regimm $insn not supported"') x 32;

$regimm[ 0] = '                    $pc += IMM - 1  if     RS & 0x80000000'; # bltz
$regimm[16] = '($r31 = $pc << 2), ($pc += IMM - 1) if     RS & 0x80000000'; # bltzal
$regimm[ 1] = '                    $pc += IMM - 1  unless RS & 0x80000000'; # bgez
$regimm[17] = '($r31 = $pc << 2), ($pc += IMM - 1) unless RS & 0x80000000'; # bgezal bal

That's all of them, there really aren't many, and all are branch instructions. Two of them just branch, the other two also store the return address. All of them directly test the s registers sign bit.

Not being a hardware guy, I don't know for sure why these have their own encoding group, but who cares.

Finally, the actual Opcodes

Now we are ready to consume the actual opcode table, or rather array, which maps opcode numbers (bits 26..31 in the instruction) to perl code. The opcodes are not actually strings, but code references that return strings. To see why, consider the first two opcodes, which actually represent the special and register immediate instructions:

$opcode[ 0] = sub { $special[$insn & 63] }; # special
$opcode[ 1] = sub { $regimm[($insn >> 16) & 31] }; # regimm

The subs decode the $insn further and return the corresponding special or register immediate string.

The remaining opcodes are simpler:

$opcode[ 2] = sub { '                 $pc = ($pc & 0x3c000000) | (' . $insn . ' & 0x03ffffff)' }; # j
$opcode[ 3] = sub { '$r31 = $pc << 2; $pc = ($pc & 0x3c000000) | (' . $insn . ' & 0x03ffffff)' }; # jal
$opcode[ 4] = sub { '($pc += IMM - 1) if RS == RT' }; # beq beqz b
$opcode[ 5] = sub { '($pc += IMM - 1) if RS != RT' }; # bne bnez
$opcode[ 6] = sub { '($pc += IMM - 1) if !RS || RS >= 0x80000000' }; # blez
$opcode[ 7] = sub { '($pc += IMM - 1) if  RS && RS <  0x80000000' }; # bgtz

These are all conditions jumps testing different things. And yes, the j instruction really just replaces the lower 28 (not 26!) bits of the instruction pointer.

$opcode[ 8] = sub { die "addi instruction unsupported" }; # addi
$opcode[ 9] = sub { "RT = (RS + IMM) & M32" }; # addiu

Unlike their "special" counterparts which work on registers, these add instructions add a constant signed 16 bit value that is encoded in the instruction. Since the constant is signed there isn't a need for a subiu instruction (I am sure there is an assembly macro for it, though).

Shifts and logical "special" instructions have immediate counterparts, too:

$opcode[10] = sub { 'RT = ((RS - ((RS & B31) << 1))) <  IMM       ' }; # slti
$opcode[11] = sub { 'RT =   RS                       < (IMM & M32)' }; # sltiu
$opcode[12] = sub { 'RT = RS & IMMU' }; # andi
$opcode[13] = sub { 'RT = RS | IMMU' }; # ori
$opcode[14] = sub { 'RT = RS ^ IMMU' }; # xori

And lui really just loads a 32 bit constant into a register:

$opcode[15] = sub { 'RT = IMMU << 16' }; # lui

The fun starts with the "load byte/half-word/word" (lb/lh/lw) instructions, which have to access the @mem array that was explained in detail in part 2:

$opcode[32] = sub {' # lb
   $i = RS + IMM;
   $s = (~$i & 3) << 3;
   $s = $mem[$i >> ADDR_SHIFT][($i >> 2) & ADDR_MASK] >> $s;
   RT = (($s &  M8) - (($s &  B7) << 1)) & M32;
'};
$opcode[33] = sub {' # lh
   $i = RS + IMM;
   $s = (~$i & 2) << 3;
   $s = $mem[$i >> ADDR_SHIFT][($i >> 2) & ADDR_MASK] >> $s;
   RT = (($s & M16) - (($s & B15) << 1)) & M32;
'};
$opcode[35] = sub {' # lw
   $i = RS + IMM;
   RT = $mem[$i >> ADDR_SHIFT][($i >> 2) & ADDR_MASK];
'};

Since our memory representation is based on words, the byte and half-word instructions need to do some shifting, but the "load word" instruction is simple (and most common).

I'll skip the unsigned variants and the store variants of these, and go directly to "load word left/right" (lwl/lwr):

$opcode[34] = sub {' # lwl
   $i = RS + IMM;
   $s = ($i & 3) << 3;
   $i = $mem[$i >> ADDR_SHIFT][($i >> 2) & ADDR_MASK];
   RT = (RT & (M32 >> (32 - $s))) | ($i << $s & M32);
'};
$opcode[38] = sub {' # lwr
   $i = RS + IMM;
   $s = (($i & 3) + 1) << 3;
   $i = $mem[$i >> ADDR_SHIFT][($i >> 2) & ADDR_MASK];
   RT = (RT >> $s << $s) | ($i >> (32 - $s));
'};

They look so simple, but please don't ask how long it took me to get them right. And absolutely do not ask what they do, get a MIPS manual and find out yourself!

That leaves one instruction:

# 0x7c03e83b rdhwr $3, $29 ($29=tls)- emulated by kernel normally, for thread support

That one was a bit of a puzzle - this instruction is understood and generated by binutils, but it doesn't actually exist in MIPS. Basically, this loads the thread local storage area pointer into a register, and when the CPU sees it, it generates an illegal instruction, after which the kernel emulates it.

As a result of not emulating this instruction, the emulator does not actually support threads, because I didn't need it, and that saved a lot of code.

Finally, the CPU

Finally we are ready to see the inner workings of the CPU emulator. Let's start with the most important instruction of them all:

my $NOP = sub { };

The nop is so important that it gets the royal treatment, special case of special case of special case treatment.

Ok, now the real thing:

sub get_insn {
   $insn2sub{$_[0]} ||= do {
      my $old_insn = $insn;

      $insn = $_[0]*1;
      my $src = &{ $opcode[$insn >> 26] };

      $src =~ s/\bIMM\b/($insn & M16) - (($insn & B15) << 1)/ge; # 16 bit signed immediate
      $src =~ s/\bIMMU\b/$insn & M16/ge;                         # 16 bit unsigned immediate

      $src =~ s/\bSA\b/($insn >> 6) & 31/ge;                     # shift amount

      $src =~ s/\bRS\b/'$r' . (($insn >> 21) & 31)/ge;           # s register
      $src =~ s/\bRT\b/'$r' . (($insn >> 16) & 31)/ge;           # t register
      $src =~ s/\bRD\b/'$r' . (($insn >> 11) & 31)/ge;           # d register

      $src =~ s/\$r0 = //g; # optimize away r0 assignments
      $src =~ s/\$r0\b/0/g; # optimise away r0 access

      my $cb = ($insn ? eval "sub { $src }" : $NOP)
         || die "$insn<$src>: $@";

      $sub2insn{$cb+0} = $insn;

      $insn = $old_insn;

      $cb
   }
}

This function contains all the magic needed to convert an instruction number into a code reference implementing it. Let's step through:

sub get_insn {
   $insn2sub{$_[0]} ||= do { ...}
}

This is a common idiom (in my code). The do block computes something the slow way, and the $insn2sub{$_[0]} ||= then builds a cache, so repeated calls are simply a hash lookup and a boolean test.

Since the process is cached, the actual instruction generating code isn't so speed critical:

      my $old_insn = $insn;

For deep reasons, the code saves and restores the $insn variable.

      $insn = $_[0]*1;

It then stores the instruction to be generated in $insn. The *1 simply makes sure it is an actual integer (under use integer).

      my $src = &{ $opcode[$insn >> 26] };

This extracts the opcode index from the instruction and calls one of the opcode subs from earlier.

      $src =~ s/\bIMM\b/($insn & M16) - (($insn & B15) << 1)/ge; # 16 bit signed immediate
      $src =~ s/\bIMMU\b/$insn & M16/ge;                         # 16 bit unsigned immediate

      $src =~ s/\bSA\b/($insn >> 6) & 31/ge;                     # shift amount

      $src =~ s/\bRS\b/'$r' . (($insn >> 21) & 31)/ge;           # s register
      $src =~ s/\bRT\b/'$r' . (($insn >> 16) & 31)/ge;           # t register
      $src =~ s/\bRD\b/'$r' . (($insn >> 11) & 31)/ge;           # d register

These simply use regex search and replacements to replace the IMM, SA, RS etc. identifiers in the instruction string by their actual values.

      $src =~ s/\$r0 = //g; # optimize away r0 assignments
      $src =~ s/\$r0\b/0/g; # optimise away r0 access

And this is an optimisation - $r0 is always zero, so it gets replaced by an actual 0.

      my $cb = ($insn ? eval "sub { $src }" : $NOP)
         || die "$insn<$src>: $@";

This, finally wraps the Perl expression into a sub { } and compiles it (it also special cases the nop instruction). In theory, it could store the string in %insn2src, but that would be extra overhead and we can always deparse it later for fun and profit :)

      $sub2insn{$cb+0} = $insn;

It does store the mapping from code reference to instruction, though. If you use a reference (any reference) in numerical context, you get its address. This isn't usually very useful, except that it is shorter and faster to generate then the stringified version (e.g. ODE(0x9ed218)) and is unique, as long as the value is in program scope.

      $insn = $old_insn;

      $cb

The rest merely restores $insn and returns the function reference (which is then cached in y%insn2sub).

Now we are ready to look at the actual CPU loop, except, cpu_reset is first:

sub cpu_reset($) {
   $pc = $_[0] >> 2;

   $r0  = 0;

          $r1  = $r2  = $r3  = $r4  = $r5  = $r6  = $r7  =
   $r8  = $r9  = $r10 = $r11 = $r12 = $r13 = $r14 = $r15 =
   $r16 = $r17 = $r18 = $r19 = $r20 = $r21 = $r22 = $r23 =
   $r24 = $r25 = $r26 = $r27 = $r28 =        $r30 =
   $hi  = $lo  = 0xdeadbeef;

   $r2  = 0;
   $r29 = STACK;
   $r31 = 0;

   $delay = $NOP; # start with a nop
}

All the registers have an initial value of 0xdeadbeef, which is quite useful for debugging (in the trace output earlier you can see some registers still have this value).

Also, don't be angry when I kept talking about the actual CPU loop while keeping it for the end of this article, but the above really is in the order the program is written in, with the basics first and stuff that depends on it later.

But now, finally, here it is, the actual "CPU":

sub cpu_run() {
   while () {
      $insn = $delay;

      $delay = $mem[$pc >> (ADDR_SHIFT - 2)][$pc & ADDR_MASK];
      unless (ref $delay) {
         $delay =
         $mem[$pc >> (ADDR_SHIFT - 2)][$pc & ADDR_MASK] =
            get_insn $delay;
      }
      ++$pc;

      &$insn;
   }
}

This function is a bit simplified (it optionally does some benchmarking and of course prints every instruction using cpu_pr), but during normal usage, this is the actual code being executed.

The basic skeleton is this, and deals with inastruction fetching and the delay slot:

   while () {
      $insn = $delay;

      $delay = $mem[$pc >> (ADDR_SHIFT - 2)][$pc & ADDR_MASK];

      ...

      ++$pc;

      &$insn;
   }

The loop copies the $delay slot into $insn and then fetches the next instruction from memory. Kind of like the real pipeline in the CPU would do. It then magically comes up with some code reference for the instruction, increments the program counter and executes the instruction. It has to be this order, because this is basically how the real CPU does it.

The code that gives us the code reference is the only remaining trick up my sleeve:

   unless (ref $delay) {
      $delay =
      $mem[$pc >> (ADDR_SHIFT - 2)][$pc & ADDR_MASK] =
         get_insn $delay;
   }

This doesn't actually compile the $insn, it compiles the $delay slot. This is the reason why get_insn saves and restores the $insn variable (really, the whole thing is a design bug).

More importantly, it doesn't just place the code reference in $delay, it also stores it into the main memory.

This means that, as instructions get executed, the emulator replaces their numerical value by code references, which obviously saves time and hash lookups (hash lookups are quite slow).

In theory, this is a bug, as the MIPS program could read the instruction memory and suddenly have code references in its registers, but it works quite fine in practice, as GCC is really well-behaved in this regard.

And that's it - the CPU emulator itself is very small - it mostly consists of declaring variables for registers and strings as opcode templates. The whole emulation is done with two functions - get_insn, which converts instructions into code references, and cpu_run, which basically calls it in a loop and emulates the pipeline.

(Almost) Last Words

There are a myriad of optimisation opportunities (some of which I have used in my vt100/102/132 hardware emulator that I will describe in another series) that I have skipped - I am sure with a lot of effort one could make this twice or three times faster (mostly by grouping extended basic blocks into subs), but that would be a rather pointless exercise, as an implementation in C would be simpler and much, much faster.

But it wouldn't be pure perl!

The Alert Reader

Alright, there are some minor details missing. And in fact, there hasn't been a download link in sight. That's because I want to keep that for the rather much shorter part 4. Stay tuned ... :)