That time I 10x'd a TI-84 emulator's speed by replacing a switch-case

There’s a javascript emulator for the TI83+, TI84+, and TI84+CSE calculators called jsTIfied, which was written by Christopher Mitchell, founder of the calculator fan site cemetech.net. There’s not a whole lot of reasons to use it over something else if you’ve got a native option available, but if you don’t it’s pretty great. I got interested because it was the first emulator to support the TI84+CSE when that calculator was released in the early 2010s. The CSE was exciting because it retrofitted a 320x240 color display onto the hardware platform of the 84+SE, so all the other hardware and OS access was the same except for graphics. I wanted to be one of the first game developers for the CSE, but developing for the calculator without an emulator and debugger is pretty painful, so I tried out jsTIfied with my older calculator ROMs to get a feel for it.

jsTIfied had a problem though: it was too damn slow. These calculators use a z80 processor, which is pretty simple to emulate. But jsTIfied couldn’t even handle emulating the 6MHz calculator models at full speed, and the CSE’s processor was clocked at 15MHz, so it was even worse. jsTIfied is closed source, but I decided that I was going to try and do something about it anyway.

Of course the first thing you want to do when you’re debugging a web app is go to the profiler. There was just one hotspot that dwarfed all the others, and that was the instruction decode and execution switch-block. That’s the sort of thing you’d expect since these calculators don’t have any other complicated hardware to emulate like pixel processing units or audio chips, but it seemed a bit fishy. Yeah javascript is slow, but computers made in the early 2000s could have handled emulating this calculator at full speed with native code. Javascript overhead wasn’t enough to explain it.

So I started digging into the actual code. I had to unminify it, but I was used to dealing with obfuscated code from Minecraft. The instruction decode block had one giant switch block, with additional nested switch blocks for multi-byte instructions. In most languages this is just fine, since your compiler will turn it into jump tables, so why wasn’t I seeing jump table performance here? I had a bit of an obsession with javascript performance at the time due to my WebGL experiments, and I had already learned that at the time JS engines wouldn’t optimize functions above a certain size. Knowing this, I split all the nested switch statements into their own functions, and made the parent switch call them, to see if that would take care of things.

Now I needed a way to actually load my code. I quickly spun up a web server on my computer that would pass through requests to the upstream website, but intercept the request for the emulator engine and return my modified code instead. I switched /etc/hosts to point the upstream domain at 127.0.0.1 and I had my code loaded.

Unfortunately though, I saw basically no speed up. I was very sure at this point that I was within the size limits for functions, so there had to be something else missing. I went digging around looking for low level details on the implementation of switch statements in javascript. Eventually I found a stackoverflow post from someone trying to do exactly the same thing I was: optimize a (different) z80 emulator. That’s when I saw a deeply disturbing comment, with sources cited directly to Chrome’s V8 source code:

@LGB actually in V8 (JS engine used by google chrome) you need to jump through a lot of hoops to get switch case optimized: All the cases must be of same type. All the cases must either be string literals or 31-bit signed integer literals. And there must be less than 128 cases. And even after all those hoops, all you get is what you would have gotten with if-elses anyway (I.E. no jump tables or sth like that). True story.

Check out the post for yourself here https://stackoverflow.com/questions/18830626/should-i-use-big-switch-statements-in-javascript-without-performance-problems#comment27798374_18830724

This is not what you want to hear when you’re looking at an emulator with a heckload of switch blocks, especially switch blocks that all had more than 128 cases. I had let the optimizer run on my functions, but when it got to the switch blocks it said “thanks but no thanks I’m good”. I had only one option left, and that was to wrap every case of every switch in a function, dump them all into an array, and do the lookups myself. So I wrote a script to do just that.

The original code would look something like this:

switch (z8.r2[Regs2_PC]++) {
  case 0x00: // nop
    break;
  case 0x01: // do something?
    break;
  // ...
  case 0xDD // index register prefix
    switch (z8.r2[Regs2_PC]++) {
      case 0x00: // do something
        break;
      case 0x01: // do something
        break;
      // ...
      case 0xFF
        break;
    }
    break;
  // ...
  case 0xFF:
    break;
}

Which I then translated to something like this:

let instr_table = new Array(256);
let instr_subtable_DD = new Array(256);


instr_table[0] = functon() { /* nop */ };
instr_table[1] = function() { /* do something, probably */ };
instr_table[0xDD] = function() {
  return instr_subtable_DD[read_byte(z8.r2[Regs2_PC]++)]();
};

After all that was done, I had success! The emulator went from slow as molasses to being too fast. I told the original dev about this and he was eager to merge those changes in, but a calculator that’s too fast is bad in its own way, because you can hardly control the thing. So I got him to give me access to the source, wrote a speed governor, and he got it all squared away and pushed up to the site. You can see this yourself at https://www.cemetech.net/projects/jstified/jstified_compressed.js?20170706a, just search for z8oT. You can also see a demo I recorded at the time below, first with the old slow version and then with the code that was far too fast.

I had one more trick up my sleeve too: notice that those program-counter register increments doesn’t involve a & 0xFFFF to keep the value within the appropriate 16 bits. That’s because I switched to storing our registers in a Uint16Array, which has that wrapping behavior built in (since it’s backed by real honest-to-goodness u16s). I think that overflow behavior is defined in the spec but I don’t actually remember- either way it works just fine everywhere I’ve tried it, but do me a favor and check for yourself. Ultimately this had a marginal performance benefit at best, but it removed a LOT of bitwise ops from the code and made it much harder to mess up the value-range of register operations in general.

Before closing, I must warn you I wouldn’t recommend you take this as modern performance advice for any of your own javascript without checking the V8 and spidermonkey source first. 2013 was a different time, and JS engines have come a long way since then. I really hope they’ve made this better than it was.