That time I wrote a FORTH compiler for my TI84

I’ve got something special in the works for my next Oxide-related post, but while I work on that I want to revisit an old project of mine, uninspiringly named calccomp. The README claims it’s a C compiler but that’s a lie; I never got very far on the C part of the project. What it does have is a custom Z80 assembler and a compiler for my own dialect of FORTH, with a few neat features like the ability to write an infinitely recursive “word” (the FORTH terminology for a subroutine/function) without stack overflows. I’m leaving the project in its unorganized and disheveled state, but it deserves some proper representation.

FORTH?

Yeah, FORTH. or Forth. or forth. Capitalize it however you want really, but I’m all-capsing it as is tradition. If you don’t know, FORTH is a stack based concatenative programming language. Most programming languages use a stack under the hood to store things like local variables and where to return after a function ends, but FORTH makes manual manipulation of that stack a core part of the language. A typical FORTH program looks something like this:

1 DUP + .

FORTH’s grammar is very simple: just string.split() the source code on spaces and you have your tokens ready for interpretation. This program has 4 steps to it:

Anything that isn’t a literal value is called a “word”. DUP, +, and . are all words. Think of them like functions or subroutines.

This sort of programming can be pretty confusing, but it’s also kinda fun. It feels a bit like a puzzle game trying to swap around stack entries the most efficient way possible to do whatever you want to get done.

It’s worth stating that FORTH isn’t a singular language, but more of a family of languages, a bit like lisp. FORTH is one of the simplest languages out there from a syntactic standpoint, and is also relatively easy to write an interpreter or a compiler for. The result is that a lot of people write their own FORTHs with little quirks that reflect the desires and personal flair of whoever made them.

My first exposure to FORTH was through the MineCraft mod RedPower 2, written some years ago by eloraam. She implemented an emulator for a modified 6502 CPU instruction set, and then wrote a FORTH interpreter and compiler on top of that. You could then program this thing in FORTH to interact with the world and various machines in all sorts of ways, from simple redstone equations to complicated item sorting algorithms and flying excavator machines. I was in love with this, and I married this love to my obsession with TI83/84 calculators.

I accidentally wrote an Assembler

Before we get to the FORTH compiler, I want to talk about my assembler. It’d be natural to wonder why I even bothered writing an assembler in the first place. After all, I could just make my compiler feed text into someone else’s assembler and call it a day. That was actually my original plan, and the assembler was kind of an accident.

See, the simplest way to write a compiler is to just turn each token into some assembly code, print it all out, and then shove that into an assembler to do something with it, but that makes implementing optimizations more error prone and just plain annoying. It also means that if your assembly representation of your primitives and core library have any invalid assembly in them, you don’t find out until your assembler starts giving you cryptic errors with line numbers that are difficult to correlate with your actual compiler code.

The solution to both of these problems is to build some data types to represent the assembly language, or as a compiler dev will call it, an Abstract Syntax Tree (AST). Granted, assembly rarely gets very tree-like, but it’s a tree nonetheless. This way, you can ensure that you never accidentally typo your way into invalid assembly code without it getting caught by the host compiler (the compiler that’s compiling your compiler).

Once you’re working with an AST though, you eventually need to serialize it to a file. And at that point you might as well generate the machine code for your target CPU to skip the assembly step and OOPS you wrote your own assembler! So that’s what happened to me, and I just decided to roll with it and add labels and a few other features I wanted.

I primarily use Ben Ryves’ Brass Assembler when I’m writing Z80 code and I stole its variable allocation feature to make the compiler implementation simpler. Here, let me explain with some code:

; Tells the assembler to use 768 bytes starting at
; address 4000 (hexadecimal) for static variable allocation.
.varloc 4000h, 768

; Defines snake_x = 4000h as an assembly-time constant and
; removes 4000h/4001h from the allocation pool
.var 2, snake_x

; Defines snake_y = 4002h as an assembly-time constant and
; removes 4002h/4003h from the allocation pool
.var 2, snake_y

Rather than having to manually allocate all of your static variables, you can give Brass a pool of memory and ask it to slice off pieces of that memory for you without having to worry about where exactly they are. In a language like C, we take this sort of thing for granted, but not all assemblers can do this. I love this feature, and it made writing the compiler easier, so I threw it into the assembler.

For my final assembler trick, I added in a Z80 quasi-quoter. My assembler and compiler are both written in Haskell. Haskell’s got a feature called quasi-quoters, which let you write some custom parser logic that takes in a String and dumps out a Haskell syntax tree. Then you can use this to do Compile Time Shenanigans. Here’s an example from the FORTH compiler:

-- wasm stands for "word assembly", Web Assembly wasn't a thing yet :)
wasm "DUP" = rtni [asm|
  pop hl
  push hl
  push hl
|]

rtni is a function that takes in a Z80 assembly syntax tree, but writing out the syntax tree by hand is annoying. Instead I import a function from my assembler called asm. That’s the quasi-quoter! [asm| whatever |] is a “quasi-quotation”, which tells the Haskell compiler to feed all the text in between the two pipe characters to the asm function. Whatever Haskell code comes back out is what gets compiled, type-checked, and all that good stuff. This way I can have my cake of writing bare assembly but I get all the benefits of having my entire standard library syntax checked statically.

My FORTH at the language level

My FORTH was based on a list of standard FORTH words from some old scanned-in PDF, and I don’t remember which one anymore. It uses the traditional two-stack layout with a data stack and a return stack. The data stack is the stack most of the FORTH words use for inputs and outputs, and any time something refers to “the stack” that’s usually what they mean. The return stack stores the return address for word calls, but there’s also some words you can use to shuffle data between this and the data stack. This is pretty common practice, but you’ve got to be very careful when writing a word that the return stack looks the same at the end as it did at the start or you’re going to end up jumping to who knows where and crashing your system.

I changed the syntax around from traditional FORTH word definition. A lot of FORTHs, defining a word looks something like this:

: MY_WORD 2 * ;

: starts the definition, MY_WORD is the name of the word, and then everything up to the ; is the body. I didn’t really like that at the time, and decided what the world needed was more curly braces, so I ended up with this:

WORD MY_WORD {
  2 *
}

I genuinely don’t remember the reason. While I was at it though, I also added inline assembly:

ASMWORD ENABLE_INTERRUPTS {
  ei ; enable interrupts
}

The body of an ASMWORD is passed directly into my assembler without any extra processing. This was my escape hatch to do anything I couldn’t do with my standard library, or write tight assembly loops for graphics. I’ll talk about this a bit more in the implementation details when I explain calling conventions.

Apparently I used THEN as a terminator for if-statements, so they look a bit funky. check this out,

( pops the stack and executes BODY1 if the value is non-zero (TRUE), else BODY2 )
IF 
  ( BODY1 goes here )
ELSE
  ( BODY2 goes here )
THEN

I also mentioned earlier that I invented a way to do infinite recursion; this was actually an excuse to avoid implementing for/while loops for a bit. I added a word called RECURSE which just jumped back to the start of the current word definition. This looks a bit like tail-call optimization if you squint but it’s much simpler. Traditional tail-call optimization has to go through flow analysis to prove that a function is calling itself and then immediately returning, and then the compiler can choose to just jump to the start of the function again. Since FORTH is so loosey goosey with the concept of “calling a word” and “word arguments”, adding a tool to just jump back to the start of a word is totally chill as long as the word leaves the data and return stacks in a sensible state when it finally does return.

So here’s what a loop that counts down to 0 looks like in my FORTH:

WORD MAIN {
  5 PRINT_UNTIL_ZERO
}

WORD PRINT_UNTIL_ZERO {
  DUP . ( print the number )
  1-    ( subtract 1 )
  DUP 0 < IF
    ( value is less than 0 )
    DROP RETURN
  ELSE
    RECURSE
  THEN
}

How the compiler actually works

The parser is pretty boring so I’m going to skip that.

Calling Conventions

Calling conventions- every language has to have them. To keep things fast I use the hardware stack to store data, pointed to by the SP register. Since the return stack isn’t used as much, I use the slower IX index register to keep track of that. This presents a bit of a problem for calling words because I use the real call instruction, and that puts the return address on our data stack. To solve this, the first thing all words do is pop the return address from the data stack and move it over to the return stack for safe keeping. When returning, they load the top value from the return stack into the HL register and do an indirect jump back to the caller.

Inline Words

The calling conventions impose some overhead, so it’s best to avoid calling words unless they’re long enough to warrant it. Most of the core vocabulary gets inlined as long as the implementation isn’t too big. For some of the larger core stuff like multiply I go halfsies on it, inlining the stack pops and pushes before and after calling the main body of the routine. This is important for the optimizations I’ll talk about later.

Everything else gets the full calling overhead. If my compiler was a bit more complex I could avoid this by automatically doing the halfsies approach for words that don’t reach deep into the stack, but that would have required deeper static analysis than I wanted to figure out.

Tree-Shaking

There’s a lot of stuff in the standard library, but most programs don’t use all of it. For inline words this isn’t a problem; the code is inserted inline if you use it and omitted if it’s not. For everything else, I added dependency tracking. If a word calls another word, it declares the called word as a dependency. When it comes time for code-generation, only the stuff the program actually uses gets inserted into the final binary.

Optimizations

Here’s where I take this from “neat” to “fast enough to write game logic”. Throughout the standard library, I stick to the HL and DE registers as much as possible for stack operations and math operands, and this gives me two really obvious optimizations.

First, pushing a register and popping back into it does nothing but waste CPU time so we can remove this pattern entirely.

; Delet this
push hl
pop hl

; This too
push de
pop de

Second, copying a register to another register is faster if you skip the stack:

; Delet this
push hl
pop de

; Use this instead
ld d,h
ld e,l

This makes a lot of our stack operations just cancel each other out. For example, an unoptimized SWAP DROP like this

; SWAP
pop hl
ex (sp),hl
push hl

; DROP
pop hl

can get turned into this optimized code:

pop hl
ex (sp),hl

And if a previous word ended in push hl, the SWAP DROP is suddenly a single ex (sp),hl instruction.

The compiler devs among you are already yelling “peephole optimization” at the screen and yeah, exactly that. The performance gains from this were so big that I didn’t even bother writing other optimizations because I didn’t need them.

Is it practical?

You bet your ass it is! Here’s Snake for the TI84+CSE, implemented in my FORTH. The project is very TI-calculator focused, but could totally be adapted to other Z80 systems. I have a Z80 computer kit lying around so maybe I’ll do that sometime.

Anyhow, thanks for listening to me ramble on about the machinations of my past.