4. Interpreter

An interpreter is a program that executes the source code directly, without turning it into machine code first.

An interpreter is not a mandatory part of the course project, and indeed compilers don’t usually come with interpreters. That said, unless you’ve found everything so far easy and feel very comfortable with recursion, it’s not recommended to skip this chapter, because the same techniques will be applied to more difficult problems in the next two chapters.

For most languages, it’s easiest to write an interpreter that operates recursively on an AST. Production-quality interpreters like Python’s tend to first translate the AST to a flat sequence of commands, sometimes called ”bytecode”, and interpret that, because that’s more efficient.

We’ll write an AST-based interpreter, because it’s more valuable to practice recursing over the AST at this stage, and because it’s more fun to get your project to run programs sooner rather than later.

Recursion over an AST

Let’s look at the AST for the following expression

if x < 10 then print_int(x + 7) else print_int(0)

The AST has the following types of nodes:

Constant literals: 10, 7 and 0.
Variables: x and print_int
Operators: < and +
Function calls: name(arguments)
An if-then-else expression

We need to define how to compute the result for each of these.

Constants are super easy: just return the constant value.
Variables need to look up the value from somewhere - we’ll get back this.
Operators and function calls need to recursively evaluate their arguments and then apply the operator or function on the results.
If-then-else expressions need to recursively evaluate the condition and then decide which branch to recursively evaluate.

The code for a very simple interpreter, without support for variables and with very poor error checking, might look something like this:

# src/compiler/interpreter.py
from typing import Any
from compiler import ast

type Value = int | bool | None

def interpret(node: ast.Expr) -> Value:
    match node:
        case ast.Literal():
            return node.value

        case ast.BinaryOp():
            a: Any = interpret(node.left)
            b: Any = interpret(node.right)
            if node.op == '+':
                return a + b
            elif node.op == '<':
                return a < b
            else:
                raise ...

        case ast.IfThenElse():
            if interpret(node.condition):
                return interpret(node.then_branch)
            else:
                return interpret(node.else_branch)
        ...

You are now ready to do task 1.

Symbol tables

An interpreter needs some place to look up the values for variables. Such a place is usually called a symbol table. A symbol table is almost always needed when recursing over the AST – the next two chapters (type checker and IR generator) will use one too.

Our first thought might be that a symbol table is just a dictionary that maps variable names to values, but that is not enough.

Consider nested blocks that may define new variables. These variables must only be visible inside that block, i.e. the block defines a scope. Variables defined in a scope may hide i.e. shadow variables of the same name in outer scopes.

Example:

{
    var x = 1;
    {  # <----------- Inner scope begins
        var x = 2;  # Shadows the 'x' in the outer scope
        var y = 3;
        print_int(x);   # Prints 2
        print_int(y);   # Prints 3
    }  # <----------- Inner scope ends
    print_int(x);  # Prints 1
    print_int(y);  # Error: 'y' does not exist in this outer scope
}

We could just ban shadowing (e.g. C# does so, mostly), but shadowing is sometimes useful (e.g. Rust even encourages it).

We’ll definitely want to support shadowing later when we add functions to the language. We at least want to be able to add a global variable even if some function happens to use the same variable name internally.

The solution is a hierarchical symbol table. It contains two things:

locals: a dictionary of variable names to values – contains variables defined in the current scope,
parent: an optional reference to the outer scope’s symbol table – we consult this recursively when a variable is not found in locals.

In the example above, at the first print_int(x), the hierarchical symbol table would look like this:

The top-level symbol table is where global things like our built-in functions and operators are defined. We’ll put them there in task 3.

Tasks

Task 1

Get the interpreter code example working in your project and test that interpreting 2 + 3 works.

You don’t need to improve error checking in cases where the type checker, which we will write in chapter 5, would reject the program. For example, 1 + (2 < 3) tries to add a number and a boolean, which the type checker won’t allow. It’s OK for the interpreter to fail with an ugly error, or do something unexpected in these cases (in Python, 1 + (2 < 3) happens to be 2 because True is converted to 1).

Task 2

Write a class for a hierarchical symbol table (call it e.g. SymTab) and make interpret take a SymTab as a parameter.

Implement variable declarations and blocks in your interpreter.

Task 3

Refactor your interpreter so that built-in functions and operators are defined as symbols in the top-level symbol table.

This makes the code easier to read as library code is no longer mixed with the interpreter code. This is also the structure we want when we add user-defined functions to the language later.

To distinguish between unary and binary operators, you can prefix the unary operators with unary_ in the symbol table.

Finally, implement all operators listed in the language spec.

Hint: you’ll need to extend the type definiton of Value to include callable functions.

Hint: you can keep the assignment operator = as a special case.

Task 4

Consider the logical operators and and or.

In almost all languages, these are short-circuiting: if the left operand is enough to know the result (false for and or true for or) then the right operand is not evaluated at all. E.g. a and f(b) never calls f if a is false.

To see why short-circuiting is useful consider the Python expression len(list) > 0 and list[0] == 123. Without short-circuiting, this would crash if list is empty.

Some old languages like Pascal and Visual Basic initially did not have short-circuiting, but it was since added to both: popular Pascal implementations enabled it by default in the late 90s and Visual Basic got alternative (if unwieldily named) short-circuiting operators AndAlso and OrElse.

Languages with short-circuiting by default still come with a second pair of and/or operators that do not have short-circuiting (list) – in Python (and many others) these are & and |.

The only benefit of not short-circuiting is more efficient code in situations where both checks are so simple that a branching machine instruction is more expensive than a redundant check, but modern compilers can usually figure out when that’s the case and generate the faster version automatically.

There remain some cases where the compiler fails to do this optimization because it cannot see that the second check is always safe to perform. In such cases, the non-short-circuiting operators may still have a valid use in micro-optimization.

Implement short-circuiting for these operators.

Hint: you can either make these operators special cases in the interpreter (recommended), or you can extend the definition of Value to allow macros: special kinds of operators/functions that can choose when and if to evaluate each of their arguments.

Hint: to unit test this, you can use test programs like this:

var evaluated_right_hand_side = false;
true or { evaluated_right_hand_side = true; true };
evaluated_right_hand_side  # Should be false

Task 5

Complete the interpreter by implementing all constructs mentioned in the language spec.

As mentioned in task 1, the interpreter is allowed to fail or do silly things if the program would not pass type checking.

Hint: you can represent the value unit with Python’s None.

Compilers

spring 2024

4. Interpreter

Recursion over an AST

Symbol tables

Tasks