6. IR generator

The intermediate representation (IR) is a formulation of code that is close to machine code. While an AST is a tree, IR code is a list of simple instructions. Later, each IR instruction will be mapped to one or more machine code instructions by the Assembly generator stage.

Each compiler defines their own IR, suitable to the language they are compiling. Here’s an example of some of our IR instructions, corresponding to the expression f(a + 3):

# Load constant 3 to internal variable x1.
LoadIntConst(3, x1)

# Call operator +,
# passing arguments from variables a and x1,
# and storing the result in variable x2
Call(+, [a, x1], x2)

# Call function f with x2, store result in x3
Call(f, [x2], x3)

An IR can be designed to be more high-level i.e. more abstract, or more low-level i.e. closer to machine code. Our IR is quite high-level.

For instance, we assume an infinite set of abstract variables (a, x1, …). We leave it to the Assembly generator to map each variable to a concrete memory location or register.

We also use a single command Call to invoke both basic operators like + and functions like f. The Assembly generator will later transform some Calls into machine instructions like add and others into function call code.

Variables and instructions

IR instructions operate on abstract IR variables (sometimes called ”abstract registers” or ”symbols”).

An IR variable can denote the following kinds of things:

Built-in operations (e.g. + or <)
Other functions (e.g. print_int)
Storage slots for values.

For example, the instruction Call(f, [x], y) operates on the following IR variables:

The IR variable denoting the function (or built-in) to call (f)
The IR variables to read parameters from ([x])
The IR variable to store the return value in (y)

Unlike the variables in our language, IR variables don’t have scopes. We can use the variable names in our language as IR variable names as long as they don’t clash. If the input code contains e.g. var print_int = ..., we will map that variable to a different IR variable, such as print_int2. The same goes for shadowing variables.

Later when we translate IR code to Assembly code, IR variables will correspond to 8-byte (64-bit) storage locations (registers or memory slots). This limitation doesn’t matter for now, since all values in our language can be represented as 8-byte integers.

Exercise (optional)

Write IR code for the expressions a * b, f(g(x + 1)) and { f(x); f(y); }.

Labels and jumps

To compile if-expressions and while-loops, we need the ability to express conditional execution in our IR. We do this the same way it’s done in machine code: by introducing jump instructions, which designate the next instruction to execute. They do this by refering to labels, which are dummy instructions that act as targets for a jump instruction but don’t otherwise do anything.

An unconditional jump instruction Jump(L) means that the next instruction to execute is label L.

A conditional jump instruction CondJump(x, L1, L2) jumps to label L1 if variable x is true, and to L2 otherwise.

For example, here’s IR code implementing the expression if a < b then f() else g():

Call(<, [a, b], x1)    # x1 = a < b
CondJump(x1, L1, L2)   # if x1, jump to L1, else to L2

L1                     # then-branch starts here
Call(f, [], x2)        # f()
Jump(L3)               # skip the code for the else branch

L2                     # else-branch starts here
Call(g, [], x2)        # g()

L3                     # end of if expression

Exercise (optional)

Write IR code for the expression while a < b do f().

IR definition

Let’s define Python classes to represent IR variables and IR instructions.

We could represent IR variables as simple strings, but when using Python’s type hinting, we can get more type-safety if we define a simple class instead.

@dataclass(frozen=True)
class IRVar:
    """Represents the name of a memory location or built-in."""
    name: str

    def __str__(self) -> str:
        return self.name

We set frozen=True as a best practice, and to allow use of IRVar as a dictionary key. We define __str__ to return a cleaner string representation, which will be helpful later.

Now let’s define a base class for IR instructions. We require each instruction to hold a source code location and we define a __str__ (details unimportant) that stringifies instructions to look like they do in this text.

# src/compiler/ir.py

@dataclass(frozen=True)
class Instruction():
    """Base class for IR instructions."""
    location: Location

    def __str__(self) -> str:
        """Returns a string representation similar to
        our IR code examples, e.g. 'LoadIntConst(3, x1)'"""
        def format_value(v: Any) -> str:
            if isinstance(v, list):
                return f'[{", ".join(format_value(e) for e in v)}]'
            else:
                return str(v)
        args = ', '.join(
            format_value(getattr(self, field.name))
            for field in dataclasses.fields(self)
            if field.name != 'location'
        )
        return f'{type(self).__name__}({args})'

Now let’s define the actual instructions. Most of these you’ve already seen in examples.

@dataclass(frozen=True)
class LoadBoolConst(Instruction):
    """Loads a boolean constant value to `dest`."""
    value: bool
    dest: IRVar

@dataclass(frozen=True)
class LoadIntConst(Instruction):
    """Loads a constant value to `dest`."""
    value: int
    dest: IRVar

@dataclass(frozen=True)
class Copy(Instruction):
    """Copies a value from one variable to another."""
    source: IRVar
    dest: IRVar

@dataclass(frozen=True)
class Call(Instruction):
    """Calls a function or built-in."""
    fun: IRVar
    args: list[IRVar]
    dest: IRVar

@dataclass(frozen=True)
class Jump(Instruction):
    """Unconditionally continues execution from the given label."""
    label: Label

@dataclass(frozen=True)
class CondJump(Instruction):
    """Continues execution from `then_label` if `cond` is true, otherwise from `else_label`."""
    cond: IRVar
    then_label: Label
    else_label: Label

@dataclass(frozen=True)
class Label(Instruction):
    """Marks the destination of a jump instruction."""
    name: str

So if your source code is 1+1 then we’d compile that to the following IR:

LoadIntConst(1, x1)
LoadIntConst(1, x2)
Call(+, [x1, x2], x3)
Call(print_int, [x3], x4)  # Added to print the final result

Generating IR

Just like the interpreter and type checker, an IR generator works by recursing over the AST with a symbol table. At each AST node, the IR generator emits one or more IR instructions and generates IR variables as needed to store intermediate results.

Below is partial code for the IR generator. Each ”...” means you get to fill in that part.

# src/compiler/ir_generator.py

from compiler import ast, ir
from compiler.symtab import SymTab
from compiler.types import Bool, Int, Type, Unit

def generate_ir(
    # 'root_types' parameter should map all global names
    # like 'print_int' and '+' to their types.
    root_types: dict[IRVar, Type],
    root_expr: ast.Expr
) -> list[ir.Instruction]:
    var_types: dict[IRVar, Type] = root_types.copy()

    # 'var_unit' is used when an expression's type is 'Unit'.
    var_unit = IRVar('unit')
    var_types[var_unit] = Unit

    def new_var(t: Type) -> IRVar:
        # Create a new unique IR variable and
        # add it to var_types
        ...

    # We collect the IR instructions that we generate
    # into this list.
    ins: list[ir.Instruction] = []

    # This function visits an AST node,
    # appends IR instructions to 'ins',
    # and returns the IR variable where
    # the emitted IR instructions put the result.
    #
    # It uses a symbol table to map local variables
    # (which may be shadowed) to unique IR variables.
    # The symbol table will be updated in the same way as
    # in the interpreter and type checker.
    def visit(st: SymTab[IRVar], expr: ast.Expr) -> IRVar:
        loc = expr.location

        match expr:
            case ast.Literal():
                # Create an IR variable to hold the value,
                # and emit the correct instruction to
                # load the constant value.
                match expr.value:
                    case bool():
                        var = new_var(Bool)
                        ins.append(ir.LoadBoolConst(
                            loc, expr.value, var))
                    case int():
                        var = new_var(Int)
                        ins.append(ir.LoadIntConst(
                            loc, expr.value, var))
                    case None:
                        var = var_unit
                    case _:
                        raise Exception(f"{loc}: unsupported literal: {type(expr.value)}")

                # Return the variable that holds
                # the loaded value.
                return var

            case ast.Identifier():
                # Look up the IR variable that corresponds to
                # the source code variable.
                return st.require(expr.name)

            ... # Other AST node cases (see below)

    # Convert 'root_types' into a SymTab
    # that maps all available global names to
    # IR variables of the same name.
    # In the Assembly generator stage, we will give
    # definitions for these globals. For now,
    # they just need to exist.
    root_symtab = SymTab[IRVar](parent=None)
    for v in root_types.keys():
        root_symtab.add_local(v.name, v)

    # Start visiting the AST from the root.
    var_final_result = visit(root_symtab, root_expr)

    if var_types[var_final_result] == Int:
        ... # Emit a call to 'print_int'
    elif var_types[var_final_result] == Bool:
        ... # Emit a call to 'print_bool'

    return ins

Let’s look at how to handle a few more AST node cases in visit.

IR for most binary operators is generated like this:

case ast.BinaryOp():
    # Ask the symbol table to return the variable that refers
    # to the operator to call.
    var_op = st.require(expr.op)
    # Recursively emit instructions to calculate the operands.
    var_left = visit(st, expr.left)
    var_right = visit(st, expr.right)
    # Generate variable to hold the result.
    var_result = new_var(expr.type)
    # Emit a Call instruction that writes to that variable.
    ins.append(ir.Call(
        loc, var_op, [var_left, var_right], var_result))
    return var_result

Binary operators =, and and or will need special cases. We’ll deal with them later.

Here’s how to generate IR for ”if-then” expressions:

case ast.IfExpr():
    if expr.else_clause is None:
        # Create (but don't emit) some jump targets.
        l_then = new_label()
        l_end = new_label()

        # Recursively emit instructions for
        # evaluating the condition.
        var_cond = visit(st, expr.condition)
        # Emit a conditional jump instruction
        # to jump to 'l_then' or 'l_end',
        # depending on the content of 'var_cond'.
        ins.append(ir.CondJump(loc, var_cond, l_then, l_end))

        # Emit the label that marks the beginning of
        # the "then" branch.
        ins.append(l_then)
        # Recursively emit instructions for the "then" branch.
        visit(st, expr.then_clause)

        # Emit the label that we jump to
        # when we don't want to go to the "then" branch.
        ins.append(l_end)

        # An if-then expression doesn't return anything, so we
        # return a special variable "unit".
        return var_unit
    else:
        ... # "if-then-else" case

Tasks

The sandbox generates IR that you can compare with your own IR generator’s output, but note that there are sometimes multiple reasonable ways to generate IR for a particular AST node.

Task 1

Start writing the IR generator, starting from the example snippets above.

Copy the IR instruction and IRVar class definitions to your project.
Copy the incomplete function generate_ir to your project.
Add the example case ast.BinaryOp(): code.
Implement the missing ”...” parts, except for other AST node types.
Create a way to manually test IR generation for a given input and check that the output looks right for input 1 + 2 * 3.

Task 2

Implement the rest of the IR generator.

Note the following special cases:

Operator = should require the left hand side to be an identifier and emit a Copy instruction.
Operators and and or should short-circuit just like in the Task 4 of the interpreter
As in the interpreter, unary operator names should be prefixed with unary_ to distinguish e.g. the binary minus - and the unary minus unary_-.
Adding operators == and != to root_types is not necessary.

The base language’s IR generator will only look at the type of an IR variable if it is the final result of the program. Therefore it will never look at the types of operators == and !=, except for very silly programs that result in one of those operators. You can ignore such cases and just not add == and != to root_types, or you can add them with some special case placeholder type.

Manual testing is enough at this point. Unit tests are not as valuable here as they were in the previous stages, because the correctness of IR generation is best verified by checking whether the generated IR ”does the right thing”, but we don’t have a way to run the IR yet. We’ll write end-to-end tests after doing Assembly generation.

Compilers

spring 2024