6. IR generator
The intermediate representation (IR) is a formulation of code that is close to machine code. While an AST is a tree, IR code is a list of simple instructions. Later, each IR instruction will be mapped to one or more machine code instructions by the Assembly generator stage.
Each compiler defines their own IR, suitable to the language they are compiling.
Here’s an example of some of our IR instructions, corresponding to the expression
f(a + 3)
:
# Load constant 3 to internal variable x1.
LoadIntConst(3, x1)
# Call operator +,
# passing arguments from variables a and x1,
# and storing the result in variable x2
Call(+, [a, x1], x2)
# Call function f with x2, store result in x3
Call(f, [x2], x3)
An IR can be designed to be more high-level i.e. more abstract, or more low-level i.e. closer to machine code. Our IR is quite high-level.
For instance, we assume an infinite set of abstract variables
(a
, x1
, …). We leave it to the Assembly generator
to map each variable to a concrete memory location or register.
We also use a single command Call
to invoke both basic operators like
+
and functions like f
. The Assembly generator will later transform
some Call
s into machine instructions like add
and others into function call code.
Variables and instructions
IR instructions operate on abstract IR variables (sometimes called ”abstract registers” or ”symbols”).
An IR variable can denote the following kinds of things:
- Built-in operations (e.g.
+
or<
) - Other functions (e.g.
print_int
) - Storage slots for values.
For example, the instruction Call(f, [x], y)
operates on the following IR variables:
- The IR variable denoting the function (or built-in) to call (
f
) - The IR variables to read parameters from (
[x]
) - The IR variable to store the return value in (
y
)
Unlike the variables in our language, IR variables don’t have scopes.
We can use the variable names in our language as IR variable names as long as
they don’t clash. If the input code contains e.g. var print_int = ...
,
we will map that variable to a different IR variable,
such as print_int2
. The same goes for shadowing variables.
Later when we translate IR code to Assembly code, IR variables will correspond to 8-byte (64-bit) storage locations (registers or memory slots). This limitation doesn’t matter for now, since all values in our language can be represented as 8-byte integers.
Exercise (optional)
Write IR code for the expressions a * b
, f(g(x + 1))
and { f(x); f(y); }
.
Labels and jumps
To compile if-expressions and while-loops, we need the ability to express conditional execution in our IR. We do this the same way it’s done in machine code: by introducing jump instructions, which designate the next instruction to execute. They do this by refering to labels, which are dummy instructions that act as targets for a jump instruction but don’t otherwise do anything.
An unconditional jump instruction Jump(L)
means that the next instruction to execute is label L
.
A conditional jump instruction CondJump(x, L1, L2)
jumps to label L1
if variable x
is true,
and to L2
otherwise.
For example, here’s IR code implementing the expression if a < b then f() else g()
:
Call(<, [a, b], x1) # x1 = a < b
CondJump(x1, L1, L2) # if x1, jump to L1, else to L2
L1 # then-branch starts here
Call(f, [], x2) # f()
Jump(L3) # skip the code for the else branch
L2 # else-branch starts here
Call(g, [], x2) # g()
L3 # end of if expression
Exercise (optional)
Write IR code for the expression while a < b do f()
.
IR definition
Let’s define Python classes to represent IR variables and IR instructions.
We could represent IR variables as simple strings, but when using Python’s type hinting, we can get more type-safety if we define a simple class instead.
@dataclass(frozen=True)
class IRVar:
"""Represents the name of a memory location or built-in."""
name: str
def __str__(self) -> str:
return self.name
We set frozen=True
as a best practice, and to allow use of IRVar
as a dictionary key.
We define __str__
to return a cleaner string representation, which will be helpful later.
Now let’s define a base class for IR instructions.
We require each instruction to hold a source code location
and we define a __str__
(details unimportant)
that stringifies instructions to look like they do in this text.
# src/compiler/ir.py
@dataclass(frozen=True)
class Instruction():
"""Base class for IR instructions."""
location: Location
def __str__(self) -> str:
"""Returns a string representation similar to
our IR code examples, e.g. 'LoadIntConst(3, x1)'"""
def format_value(v: Any) -> str:
if isinstance(v, list):
return f'[{", ".join(format_value(e) for e in v)}]'
else:
return str(v)
args = ', '.join(
format_value(getattr(self, field.name))
for field in dataclasses.fields(self)
if field.name != 'location'
)
return f'{type(self).__name__}({args})'
Now let’s define the actual instructions. Most of these you’ve already seen in examples.
@dataclass(frozen=True)
class LoadBoolConst(Instruction):
"""Loads a boolean constant value to `dest`."""
value: bool
dest: IRVar
@dataclass(frozen=True)
class LoadIntConst(Instruction):
"""Loads a constant value to `dest`."""
value: int
dest: IRVar
@dataclass(frozen=True)
class Copy(Instruction):
"""Copies a value from one variable to another."""
source: IRVar
dest: IRVar
@dataclass(frozen=True)
class Call(Instruction):
"""Calls a function or built-in."""
fun: IRVar
args: list[IRVar]
dest: IRVar
@dataclass(frozen=True)
class Jump(Instruction):
"""Unconditionally continues execution from the given label."""
label: Label
@dataclass(frozen=True)
class CondJump(Instruction):
"""Continues execution from `then_label` if `cond` is true, otherwise from `else_label`."""
cond: IRVar
then_label: Label
else_label: Label
@dataclass(frozen=True)
class Label(Instruction):
"""Marks the destination of a jump instruction."""
name: str
So if your source code is 1+1
then we’d compile that to the following IR:
LoadIntConst(1, x1)
LoadIntConst(1, x2)
Call(+, [x1, x2], x3)
Call(print_int, [x3], x4) # Added to print the final result
Generating IR
Just like the interpreter and type checker, an IR generator works by recursing over the AST with a symbol table. At each AST node, the IR generator emits one or more IR instructions and generates IR variables as needed to store intermediate results.
Below is partial code for the IR generator.
Each ”...
” means you get to fill in that part.
# src/compiler/ir_generator.py
from compiler import ast, ir
from compiler.symtab import SymTab
from compiler.types import Bool, Int, Type, Unit
def generate_ir(
# 'root_types' parameter should map all global names
# like 'print_int' and '+' to their types.
root_types: dict[IRVar, Type],
root_expr: ast.Expr
) -> list[ir.Instruction]:
var_types: dict[IRVar, Type] = root_types.copy()
# 'var_unit' is used when an expression's type is 'Unit'.
var_unit = IRVar('unit')
var_types[var_unit] = Unit
def new_var(t: Type) -> IRVar:
# Create a new unique IR variable and
# add it to var_types
...
# We collect the IR instructions that we generate
# into this list.
ins: list[ir.Instruction] = []
# This function visits an AST node,
# appends IR instructions to 'ins',
# and returns the IR variable where
# the emitted IR instructions put the result.
#
# It uses a symbol table to map local variables
# (which may be shadowed) to unique IR variables.
# The symbol table will be updated in the same way as
# in the interpreter and type checker.
def visit(st: SymTab[IRVar], expr: ast.Expr) -> IRVar:
loc = expr.location
match expr:
case ast.Literal():
# Create an IR variable to hold the value,
# and emit the correct instruction to
# load the constant value.
match expr.value:
case bool():
var = new_var(Bool)
ins.append(ir.LoadBoolConst(
loc, expr.value, var))
case int():
var = new_var(Int)
ins.append(ir.LoadIntConst(
loc, expr.value, var))
case None:
var = var_unit
case _:
raise Exception(f"{loc}: unsupported literal: {type(expr.value)}")
# Return the variable that holds
# the loaded value.
return var
case ast.Identifier():
# Look up the IR variable that corresponds to
# the source code variable.
return st.require(expr.name)
... # Other AST node cases (see below)
# Convert 'root_types' into a SymTab
# that maps all available global names to
# IR variables of the same name.
# In the Assembly generator stage, we will give
# definitions for these globals. For now,
# they just need to exist.
root_symtab = SymTab[IRVar](parent=None)
for v in root_types.keys():
root_symtab.add_local(v.name, v)
# Start visiting the AST from the root.
var_final_result = visit(root_symtab, root_expr)
if var_types[var_final_result] == Int:
... # Emit a call to 'print_int'
elif var_types[var_final_result] == Bool:
... # Emit a call to 'print_bool'
return ins
Let’s look at how to handle a few more AST node cases in visit
.
IR for most binary operators is generated like this:
case ast.BinaryOp():
# Ask the symbol table to return the variable that refers
# to the operator to call.
var_op = st.require(expr.op)
# Recursively emit instructions to calculate the operands.
var_left = visit(st, expr.left)
var_right = visit(st, expr.right)
# Generate variable to hold the result.
var_result = new_var(expr.type)
# Emit a Call instruction that writes to that variable.
ins.append(ir.Call(
loc, var_op, [var_left, var_right], var_result))
return var_result
Binary operators =
, and
and or
will need special cases.
We’ll deal with them later.
Here’s how to generate IR for ”if-then” expressions:
case ast.IfExpr():
if expr.else_clause is None:
# Create (but don't emit) some jump targets.
l_then = new_label()
l_end = new_label()
# Recursively emit instructions for
# evaluating the condition.
var_cond = visit(st, expr.condition)
# Emit a conditional jump instruction
# to jump to 'l_then' or 'l_end',
# depending on the content of 'var_cond'.
ins.append(ir.CondJump(loc, var_cond, l_then, l_end))
# Emit the label that marks the beginning of
# the "then" branch.
ins.append(l_then)
# Recursively emit instructions for the "then" branch.
visit(st, expr.then_clause)
# Emit the label that we jump to
# when we don't want to go to the "then" branch.
ins.append(l_end)
# An if-then expression doesn't return anything, so we
# return a special variable "unit".
return var_unit
else:
... # "if-then-else" case
Tasks
The sandbox generates IR that you can compare with your own IR generator’s output, but note that there are sometimes multiple reasonable ways to generate IR for a particular AST node.
Task 1
Start writing the IR generator, starting from the example snippets above.
- Copy the IR instruction and
IRVar
class definitions to your project. - Copy the incomplete function
generate_ir
to your project. - Add the example
case ast.BinaryOp():
code. - Implement the missing ”
...
” parts, except for other AST node types. - Create a way to manually test IR generation for a given input
and check that the output looks right for input
1 + 2 * 3
.
Task 2
Implement the rest of the IR generator.
Note the following special cases:
- Operator
=
should require the left hand side to be an identifier and emit aCopy
instruction. - Operators
and
andor
should short-circuit just like in the Task 4 of the interpreter - As in the interpreter, unary operator names should be prefixed with
unary_
to distinguish e.g. the binary minus-
and the unary minusunary_-
. - Adding operators
==
and!=
toroot_types
is not necessary.
Manual testing is enough at this point. Unit tests are not as valuable here as they were in the previous stages, because the correctness of IR generation is best verified by checking whether the generated IR ”does the right thing”, but we don’t have a way to run the IR yet. We’ll write end-to-end tests after doing Assembly generation.