1. Overview
A compiler is typically split into stages, each taking as input the result of the previous stage. We will look very briefly at each of the stages here and then learn how to build them one by one in the following chapters.
This is a typical ”textbook” arrangement of compiler stages. Production-quality compilers may have more stages, such as various optimizers. It’s also sometimes possible to skip or combine some stages.
Tokenizer
A tokenizer takes source text and splits it into a list of tokens i.e. pieces of text that are meaningful ”words” in the program. It discards whitespace and comments.
Example input:
if a < 10 then {
print_int(3*x); # this here is a comment
}
Example output:
['if', 'a', '<', '10', 'then', '{',
'print_int', '(', '3', '*', 'x', ')', ';', '}']
Parser
A parser takes a list of tokens and produces an abstract syntax tree (AST), which connects operations to their parameters. It discards syntactic helpers such as parentheses and semicolons.
Example input:
['if', 'a', '<', '10', 'then', '{',
'print_int', '(', '3', '*', 'x', ')', ';', '}']
Example output:
Simple interpreted languages can stop here and interpret the AST directly. We will write such an AST interpreter on this course and keep going after that.
Type checker
A type checker looks at the AST to infer and check the types of expressions,
e.g. that the arguments for *
are both numbers, etc.
Type information can optionally be added to the AST: each AST node can be annotated with the type of its result.
Example input:
Example output:
For instance, the type of operator <
is (Int, Int) → Bool
, which means that
it expects two integer parameters and results in a boolean.
Type checking verifies that both parameters (a
and 10
) are indeed integers,
and then it knows that the type of the result of the expression a < 10
is a boolean.
This picture omits some details that we’ll return to later.
IR generator
An intermediate representation (IR) generator (or ”intermediate code generator”) turns an AST (usually a type-annotated AST) into assembly-like machine-independent code.
Each compiler has its own flavor of IR, but they all tend to follow similar principles.
Our IR is a list of simple instructions that operate on fixed-size memory locations called IR variables. Examples of IR instructions:
LoadIntConst(<constant>, <variable>)
: load <constant> into <variable>.Call("+", [<variable 1>, <variable 2>], <variable 3>)
: sum the numbers in <variable 1> and <variable 2>, and store the result in <variable 3>,CondJump(<variable>, <location 1>, <location 2>)
: if <variable> contains true, continue execution at <location 1>, otherwise at <location 2>.
Example input:
Example output:
# We assume 'a' is stored in IR variable 'x1' and 'x' in 'x2'.
LoadIntConst(10, x3) # Load constant 10 to register x3.
Call("<", [x1, x3], x4) # Store result of comparison in x4.
CondJump(x4, L1, L2) # Continue execution from L1
# if x4 is true, otherwise from L2
L1 # Code for label L1 starts here.
LoadIntConst(3, x5) # Load constant 3 to register x5.
Call("*", [x5, x2], x6) # Store result of multiply in x6.
Call("print_int", [x6], x7) # Call 'print_int' with parameter x6
# and store (ignored) result in x7.
L2 # Jumping to label L2 skips the
# "then"-clause.
The IR is still more abstract than machine code: it uses an infinite number of abstract registers instead of mapping values to a finite number of physical registers, and it may use higher-level instructions like ”call” that don’t map to a single machine instruction.
Assembly generator
An assembly generator (or ”machine code generator”) translates IR code into symbolic machine code (assembly code).
It expands abstract instructions into the necessary sequence of machine instructions and maps abstract registers to a limited set of physical registers, or to main memory locations when physical registers are exhausted.
Example input:
LoadIntConst(10, x3)
Call("<", [x1, x3], x4)
CondJump(x4, L1, L2)
L1
LoadIntConst(3, x5)
Call("*", [x5, x2], x6)
Call("print_int", [x6], x7)
L2
Example output:
# Entry into function
pushq %rbp
movq %rsp, %rbp
subq $96, %rsp
# LoadIntConst(123, x1) (source code variable "a")
movq $123, -8(%rbp)
# LoadIntConst(456, x2) (source code variable "x")
movq $456, -16(%rbp)
# LoadIntConst(10, x3)
movq $10, -24(%rbp)
# Call(<, [x, x3], x4)
xor %rax, %rax
movq -8(%rbp), %rdx
cmpq -24(%rbp), %rdx
setl %al
movq %rax, -40(%rbp)
# CondJump(x4, L1, L2)
cmpq $0, -40(%rbp)
jne .L1
jmp .L2
.L1:
# LoadIntConst(3, x5)
movq $3, -48(%rbp)
# Call(*, [x5, x2], x6)
movq -48(%rbp), %rax
imulq -16(%rbp), %rax
movq %rax, -64(%rbp)
# Call(print_int, [x6], x7)
movq -64(%rbp), %rdi
call print_int
movq %rax, -80(%rbp)
.L2:
# Return from function
movq -88(%rbp), %rax
movq %rbp, %rsp
popq %rbp
ret
Don’t worry if you’re not comfortable reading assembly code. We will review it when we get there.
Assembler
The assembler turns symbolic machine code i.e. assembly code into machine code, specific to the target machine and target operating system. This is code that the CPU can understand.
This translation is conceptually very simple and boring – almost a 1:1 lookup of instruction to binary representation. We will use an existing assembler on this course instead of writing our own.
Example input:
pushq %rbp
movq %rsp, %rbp
subq $96, %rsp
movq $1, -8(%rbp)
movq $2, -16(%rbp)
movq $10, -24(%rbp)
xor %rax, %rax
movq -8(%rbp), %rdx
cmpq -24(%rbp), %rdx
setl %al
movq %rax, -40(%rbp)
cmpq $0, -40(%rbp)
jne .L1
jmp .L2
.L1:
movq $3, -48(%rbp)
movq -48(%rbp), %rax
imulq -16(%rbp), %rax
movq %rax, -64(%rbp)
movq -64(%rbp), %rdi
call print_int
movq %rax, -80(%rbp)
.L2:
movq -88(%rbp), %rax
movq %rbp, %rsp
popq %rbp
ret
Example output:
01010101 01001000 10001001 11100101 01001000 10000011
11101100 01100000 01001000 11000111 01000101 11111000
00000001 00000000 00000000 00000000 01001000 11000111
01000101 11110000 00000010 00000000 00000000 00000000
01001000 11000111 01000101 11101000 00001010 00000000
00000000 00000000 01001000 00110001 11000000 01001000
10001011 01010101 11111000 01001000 00111011 01010101
11101000 00001111 10011100 11000000 01001000 10001001
01000101 11011000 01001000 10000011 01111101 11011000
00000000 01110101 00000010 11101011 00100010 01001000
11000111 01000101 11010000 00000011 00000000 00000000
00000000 01001000 10001011 01000101 11010000 01001000
00001111 10101111 01000101 11110000 01001000 10001001
01000101 11000000 01001000 10001011 01111101 11000000
11101000 00000000 00000000 00000000 00000000 01001000
10001001 01000101 10110000 01001000 10001011 01000101
10101000 01001000 10001001 11101100 01011101 11000011
... plus some platform-specific metadata ...
The machine code file produced by the assembler is typically only one module of a full program. These modules are combined into a full executable file by a linker. The language we make on this course will not compile down to multiple modules, but we may link our compiled code to the C standard library to use some of the functions there.