7. Assembly generator
The final step in our compiler is to turn IR code into Assembly language, which an assembler can translate to machine code that the computer can execute.
For serious production languages, it’s usually a bad idea to write an Assembly generator ourselves, because it’s difficult to do that well and there exist many high-quality off-the-shelf compiler backends. We will write a simple Assembly generator here, because it’s valuable to have a clear picture of the task and the problems involved.
First we’ll learn the basics of Assembly code and then we’ll see how to structure the Assembly generator. Since this is a compiler course and not an Assembly language course, you don’t need to become proficient with Assembly language.
Introduction to Assembly code
Different computers understand different machine instructions (they have different ”instruction sets”). We will be compiling for the x86-64 family of processors, since they are currently most common on Windows and Linux PCs and servers. Almost all mobile devices, as well as recent Apple computers and some cloud servers, use ARM processors instead. We’d have to write a separate Assembly generator for them if we wanted to target them too, but the principles remain the same.
A processor has a fixed number of fixed-size
extremely fast memory locations called registers.
In x86-64, the so-called ”general-purpose registers” are
(for historical reasons) named: %rax
, %rbx
, %rcx
,
%rdx
, %rsi
, %rdi
, %r8
, %r9
, …, %r15
.
These are sized 8 bytes i.e. 64-bits each.
General purpose registers can (mostly) be used freely by the program. There are other registers with special behaviour or usage conventions too. We’ll look at some of them later.
Machine instructions perform computations on values held in registers. We will write machine instructions in a language called Assembly code. Each line of Assembly code (usually) corresponds to one machine instruction.
Examples of x86-64 Assembly code lines:
- The instruction
addq %rbx, %rax
adds two 8-byte values in registers%rbx
and%rax
, and stores the result in register%rax
. - The instruction
negq %rcx
negates the 8-byte value in register%rcx
and stores the result in register%rcx
.
Computers can’t read Assembly code directly. They require machine instructions to be in a binary form called machine code. Assembly code is translated into machine code by a program called the assembler.
Main memory
Data that does not fit in a computer’s physical registers can be stored in main memory. Main memory can be thought of as an enormous list of 1-byte (8-bit) memory slots, each having a sequence number (0, 1, 2, …) called an address. Variables or registers that hold memory addresses are often called pointers.
For instance, if we store some 8-byte (64-bit) integer value at memory address 200, it would occupy 8 memory slots with addresses 200 to 207. Then a pointer holding the value 200 could be used to read the value from memory, or to overwrite it with a new value.
This is a simplification, but more than good enough for our purposes.
The movq
machine instruction copies 8 bytes (64 bits) of data between registers
and main memory. Putting a register in parentheses, e.g. (%rax)
,
refers to the memory location pointed to by the register.
Examples:
movq %rax, %rbx
copies 8 bytes from register%rax
to register%rbx
.movq (%rax), %rbx
copies 8 bytes from the memory slots starting from the address held in register%rax
, into register%rbx
.movq %rax, (%rbx)
copies 8 bytes from the register%rax
into the memory slots starting from the address held in register%rbx
.movq %rax, 128(%rbx)
copies 8 bytes from the register%rax
into the memory slots starting from: the address held in register%rbx
plus 128.- (
movq
cannot move directly from one memory location to another)
Exercise (optional)
Write machine instructions that compute the sum of the values in registers
%rax
, %rbx
and %rcx
, and store the result into a memory location
48 bytes ahead of where %rdx
points.
Running Assembly code
Let’s see how to run Assembly code on an x86-64 Linux computer (or virtual machine).
Put the following code into a file asmprogram.s
.
Don’t worry about understanding the details just yet.
# Metadata for debuggers and other tools
.global main
.type main, @function
.extern printf
.extern scanf
.section .text # Begins code and data
# Label that marks beginning of main function
main:
# Function stack setup
pushq %rbp
movq %rsp, %rbp
subq $128, %rsp # Reserve 128 bytes of stack space
# Read an integer into -8(%rbp)
movq $scan_format, %rdi # 1st param: "%ld"
leaq -8(%rbp), %rsi # 2nd param: (%rbp - 8)
callq scanf
# If the input was invalid, jump to end
cmpq $1, %rax
jne .Lend
# *** REPLACE THESE TWO LINES WITH YOUR CODE ***
movq -8(%rbp), %rsi # Copy input number to `%rsi`
imulq $2, %rsi # Double the input
# Call function 'printf("%ld\n", %rsi)'
# to print the number in %rsi.
movq $print_format, %rdi
callq printf
# Labels starting with ".L" are local to this function,
# i.e. another function than "main" could have its own ".Lend".
.Lend:
# Return from main with status code 0
movq $0, %rax
movq %rbp, %rsp
popq %rbp
ret
# String data that we pass to functions 'scanf' and 'printf'
scan_format:
.asciz "%ld"
print_format:
.asciz "%ld\n"
Now use the following commands to compile and run the Assembly program:
gcc -g -no-pie -o asmprogram asmprogram.s
./asmprogram
This command uses the GCC compiler collection to compile the source file asmprogram.s
into the final program asmprogram
,
with debug information (-g
) and without position-independence
(-no-pie
, not important to understand yet).
GCC (by default) links our program with the C standard library.
That’s good, since we use the C function printf
in our program.
Later we’ll make our programs independent of the C standard library
and run the Assembler directly without needing GCC.
Exercise (optional)
Modify asmprogram.s
to read two integers instead of one and
to compute the sum of their squares.
Functions and the stack
When we run out of register space, we need to store data in main memory. Our language so far is simple enough that we could give each IR variable a fixed spot in memory, effectively implementing all IR variables as global variables. That would work, but it would break as soon as we add potentially recursive functions to our language: a second instantiation of a function would overwrite the data of the previous instantiation.
The conventional solution is to place a function’s local variables on the stack, which is an area of memory reserved for function locals. When a function starts, it grows the stack enough to fit its local variables. Before a function returns, it restores the stack to its original size. The stack is also sometimes used to store function parameters when calling other functions.
A calling convention specifies how functions should use registers and stack space. To anticipate adding functions to our language, and to make our machine code easier to analyze with a debugger, we will now see how to generate Assembly that is compliant with the System V 64-bit calling convention, which is what Linux uses.
A function call’s stack frame means the range of stack addresses used
by that instantiation of the function.
When a function is called, the stack pointer register %rsp
points to the beginning of the stack frame.
The following are stored just before the stack frame:
- any function parameters that didn’t fit into registers
- the address of the code to return to when the function is done
Beyond these, a function is free to move %rsp
to grow its stack frame
in order to store any local variables that don’t fit in registers,
or to store function parameters and return addresses when calling other functions.
On most systems, including x86-64 processors, the convention is for
the stack to grow downwards, i.e. growing the stack means
subtracting from the stack pointer %rsp
.
Defining functions
Let’s now look at an example of the conventional way to write functions on Linux. We’ll go through the instructions one by one.
pushq %rbp
movq %rsp, %rbp
subq $32, %rsp # reserve 32 bytes for locals
# ... function implementation goes here ...
movq %rbp, %rsp
popq %rbp
ret
When the function is entered, before the first instruction is executed, the stack looks like this:
( ... caller's stuff ... ) <-- %rbp points somewhere here
[ arguments that don't fit in registers (if any) ]
[ return address ] <-- %rsp points here
Register %rbp
is conventionally the stack base pointer.
It’s used to point to the start of the current function’s stack.
On x86-64, at the start of a function call, %rbp
still points to
the caller’s stack base, so the first thing to do is to fix this.
Our first instruction pushq %rbp
stores the value of the stack base pointer
on the stack, because we intend to write to %rbp
next, and the
calling convention says we must restore %rbp
to its original value
before returning from a function.
The stack now looks like this:
( ... caller's stuff ... ) <-- %rbp still points somewhere here
[ arguments ]
[ return address ]
[ caller's %rbp ] <-- %rsp points here
The 2nd instruction movq %rsp, %rbp
copies %rsp
to %rbp
.
Now %rbp
points to the base of our stack, and we’re free to change %rsp
as needed (e.g. to push function parameters before calling another function).
Having %rbp
always point to the base of the stack frame even as %rsp
may move around
is useful for two reasons:
- We can now conveniently refer to locals and arguments using addresses relative to
(%rbp)
, e.g. the first local will be at-8(%rbp)
and the first argument that didn’t fit in registers will be at16(%rbp)
. - Debuggers can display the call stack correctly when
%rbp
always points to the return address.
( ... caller's stuff ... )
[ arguments ]
[ return address ]
[ caller's %rbp ] <-- %rsp and %rbp both point here
The 3rd instruction subq $32, %rsp
reserves space on the stack
by moving %rsp
downwards by that amount. The required space
(32 bytes in this example) is determined by the compiler based on
the number of local variables.
( ... caller's stuff ... )
[ arguments ]
[ return address ]
[ caller's %rbp ] <-- %rbp points here
[ space for variables ] <-- %rsp points to lowest byte of this
Now the function implementation is free to:
- push and pop more things on the stack, e.g. in order to call other functions,
- use expressions like
-8(%rbp)
and-16(%rbp)
to access local variables, - use expressions like
16(%rbp)
and24(%rbp)
to access arguments that didn’t fit in registers.
When we want to return from the function,
the instructions movq %rbp, %rsp
and popq %rbp
undo what we did at the beginning:
they restore %rsp
and %rbp
to how they were initially.
( ... caller's stuff ... ) <-- %rbp points somewhere here again
[ arguments ]
[ return address ] <-- %rsp points here again
Finally, the instruction ret
returns to the calling function by popping
the return address from the stack and continuing execution from there.
The function must place its return value, if any, in register
%rax
before invoking ret
.
Calling functions
Let’s now look at what this looks like on the caller’s side.
The Linux x86-64 calling convention specifies that 64-bit integer parameters
are passed in registers %rdi
, %rsi
, %rdx
, %rcx
, %r8
and %r9
.
If there are more parameters, they must be pushed on the stack
in last-to-first order, and later popped.
Here’s an example:
movq ..., %rdi
movq ..., %rsi
...
movq ..., %r9
pushq ...
pushq ...
callq function_to_call
addq $16, %rsp
The initial movq
instructions set the parameters that fit in registers,
and the following pushq
instructions push two more parameters
that don’t fit in registers.
The callq
instruction pushes the address of the next instruction to the stack
and continues execution from the called function’s first instruction.
The final addq $16, %rsp
moves the stack pointer upwards by 16 bytes,
which effectively undoes the two pushes that we did.
The calling convention has a requirement that the stack is
aligned to 16 bytes (why?).
This means that when executing callq
(i.e. after stack arguments are pushed),
%rsp
must be a multiple of 16.
If it is not, a subq $8, %rsp
(subtract 8 from %rsp
) instruction
must be added before pushing the arguments to the stack.
There’s one more important point in the calling convention:
a function is allowed to modify any register it wants,
except for %rbp
, %rbx
, %r12
, %r13
, %r14
and %r15
(why?).
These are named callee-saved registers because the called function must
restore their previous values before returning.
The rest of the registers are caller-saved – the caller must
back them up before calling a function if it wants to keep their values.
Exercise (optional)
Write Assembly code for a function that:
- takes two parameters,
- returns the sum of the parameters, plus 10.
Exercise (optional)
Add the function you wrote to the program template
under a new label f
, and change function main
to
- read two input integers instead of one,
- call
f
with these two integers as parameters, - print the return value.
Resources for x86-64 Assembly
Warning
When looking at Assembly code examples on the internet,
be aware that there are two widespread syntaxes:
Intel syntax and AT&T syntax. On this course, we’ve made
the arbitrary choice to use AT&T syntax,
which you’ll recognize by register names starting with %
.
Crucially, the syntaxes have different argument order: in Intel syntax, the output register is written first, in AT&T it’s written last.
Much online material is also old and written for 32-bit x86 computers.
You can recognize this from its use of registers starting with
the letter ’e’ instead of ’r’, e.g. %eax
instead of %rax
.
The instructions and ideas are very similar,
but some instructions need a suffix (like the ’q’ in movq
)
to operate in 64-bit mode, and the calling conventions are different.
The calling conventions are also different in code written for
other operating systems.
- Stanford CS107 slides provide an excellent introduction: lecture 15, lecture 16, lecture 17, lecture 18
- Introduction to the GNU Assembler is a compact tutorial for writing and running Assembly code on a Linux machine, and interfacing with C code.
- Compiler Explorer turns your C/C++ code into Assembly code. Excellent for answering ”how do I write this in Assembly?” questions.
- To see assembly code in AT&T syntax, click on the gear icon and deselect ”Intel syntax”.
- To see a more compact but sometimes more confusing / ”clever” translation, add
-O2
to ”Compiler options” to enable optimizations.
- Short summary of registers (instruction table uses Intel argument order)
- x86-64 reference (detailed and technical, uses Intel argument order)
- System V ABI (Linux calling convention)
- Linux system call reference
Assembly generator
A simple non-optimizing Assembly generator is at its core pretty straightforward: it takes a list of IR instructions and translates each of them into one or more Assembly instructions.
The only wrinkle is how to map the unbounded set of IR variables to the very limited set of available registers to minimize the need to access the stack. This is the register allocation problem.
A good register allocator requires first analyzing the IR in ways that we’ve not gotten to yet, so we’ll write our initial assembly generator without a register allocator: we’ll just store all our variables on the stack. Since modern processors are quite well optimized for reading and writing stack values, we can expect only a 2-5x slowdown compared to using registers efficiently.
Tracking the stack
Let’s create a class that calculates the required stack space and maps each IR variable to a stack location.
We call this class Locals
, because global variables will not use stack space.
(For now, our only global variables are functions like print_int
.)
# src/compiler/assembly_generator.py
class Locals:
"""Knows the memory location of every local variable."""
_var_to_location: dict[ir.IRVar, str]
_stack_used: int
def __init__(self, variables: list[ir.IRVar]) -> None:
... # Completed in task 1
def get_ref(self, v: ir.IRVar) -> str:
"""Returns an Assembly reference like `-24(%rbp)`
for the memory location that stores the given variable"""
return self._var_to_location[v]
def stack_used(self) -> int:
"""Returns the number of bytes of stack space needed for the local variables."""
return self._stack_used
You are now ready to do task 1.
To create an instance of Locals
, we will need to find all IR variables used by
our IR instructions. You can use
this implementation of get_all_ir_variables
,
or implement it yourself.
Generating Assembly code
We’ll now sketch the function that iterates over IR instructions and emits Assembly code. You’ll get to complete this function yourself.
First we need to make some declarations and set up the stack correctly, as described above.
.extern print_int
.extern print_bool
.extern read_int
.global main
.type main, @function
.section .text
main:
pushq %rbp
movq %rsp, %rbp
subq $40, %rsp # ($40 is just an example)
.extern print_int
means that we expect functionprint_int
to be defined in another file..global main
means that the symbolmain
is to be made available to other files..type main, @function
marksmain
as a function. Not essential, but some debugging tools use this information..section .text
means that the following code goes into the ”text” section of the file. That’s the correct place for program code.main:
is a label, similar to IR labels. It sets the symbol ”main” to the address of the next instruction.- The Assembly instructions that set up the function stack as we saw earlier.
The first parameter of subq
must be the amount of stack space
reserved by Locals
.
The rest of the output consists of instructions generated from each IR instruction. Here is a partial implementation:
def generate_assembly(instructions: list[ir.Instruction]) -> str:
lines = []
def emit(line: str) -> None: lines.append(line)
locals = Locals(
variables=get_all_ir_variables(instructions)
)
# ... Emit initial declarations and stack setup here ...
for insn in instructions:
emit('# ' + str(insn))
match insn:
case ir.Label():
emit("")
# ".L" prefix marks the symbol as "private".
# This makes GDB backtraces look nicer too:
# https://stackoverflow.com/a/26065570/965979
emit(f'.L{insn.name}:')
case ir.LoadIntConst():
if -2**31 <= insn.value < 2**31:
emit(f'movq ${insn.value}, {locals.get_ref(insn.dest)}')
else:
# Due to a quirk of x86-64, we must use
# a different instruction for large integers.
# It can only write to a register,
# not a memory location, so we use %rax
# as a temporary.
emit(f'movabsq ${insn.value}, %rax')
emit(f'movq %rax, {locals.get_ref(insn.dest)}')
case ir.Jump():
emit(f'jmp .L{insn.label.name}')
... # Completed in task 2
You are now ready to do task 2.
Intrinsics
When we encounter an instruction like Call(print_int, x1, x2)
,
we will translate it into a call to function print_int
.
However it would be silly (sort of)
to define basic operators like +
as callable functions.
Instead, we define +
as an intrinsic: a special case that
emits the appropriate Assembly code for each call to +
.
We want +
to compile to assembly code that reads arguments from any arbitrary
memory locations A
and B
, and leaves the sum in %rax
:
movq A, %rax # Copy from A to %rax
addq B, %rax # Write (B + %rax) to %rax
We can define other intrinsics similarly.
Sample code
If you have a good grasp of x86-64 or a desire to learn more of it, you are welcome to write all the necessary intrinsics in your code yourself, but you may also download and use this intrinsics.py.
This module defines a dictionary all_intrinsics
, which maps names like +
to functions that emit the given Assembly code. These functions are called like this:
all_intrinsics['+'](IntrinsicArgs(
arg_refs=...,
result_register=...,
emit=...
))
arg_refs
must be a list of registers or memory locations references of the arguments. You can get them fromLocals
.result_register
must be a register name. For now you can always use%rax
and copy the result from there to the appropriate stack location.emit
must be a function that adds a given line of Assembly code to the output.
You are now ready to do task 3.
Running the assembler
The final step to complete your compiler is to pass the generated Assembly code to an assembler, which translates the Assembly code to a binary form that the computer can execute.
You can download assembler.py
and call its function assembler(code, filename)
. This does two things:
- It includes a definition for
print_int
,print_bool
andread_int
. - It invokes the GNU Assembler to produce the finished program.
It does these things slightly differently than how we did it above so as not to depend on the C standard library.
Tasks
Task 1
Complete Locals.__init__
. It must:
- initialize
_var_to_location
to map each IR var to stack locations-8(%rbp)
,-16(%rbp)
, …, - initialize
_stack_used
to the number of bytes used.
Task 2
Add the generate_assembly
example function to your project,
and implement cases for the following IR instructions:
LoadBoolConst
: representtrue
as1
andfalse
as0
Copy
: copy via%rax
becausemovq
can’t have two memory argumentsCondJump
: usecmpq $0, _condition_
to set comparison flags, thenjne _label_
to ”jump if not equal” (i.e. ”condition != 0”), thenjmp _label_
to jump in the other case.
Finally, restore the stack and return from the function as described above.
Check manually that the expression { var x = true; if x then 1 else 2; }
yields sensible-looking Assembly code. We’ll test more thoroughly once
we’ve implemented Call
.
Task 3
Implement the Assembly generation for IR instruction Call
.
Make it use an intrinsic if available, and generate function
calling code otherwise.
Pass arguments according to the calling convention. It’s OK to not implement stack-based arguments, i.e. you may limit the number of arguments to 6, so they all fit in registers.
Task 4
Add assembler.py
to your codebase and
edit function call_compiler
in __main__.py
to run all compiler stages,
including the assembler.
Now try out your finished compiler! 🎉
Task 5 (optional)
While you can rely on Test Gadget for end-to-end testing, you may still want to write your own end-to-end testing framework. This task is optional, but very useful if you’d like to develop your language beyond this course.
With an end-to-end test framework, you can easily write programs in your language, along with inputs and expected outputs, and have the test framework run the compiler and check that the compiled program produces the expected results.
Here is one possible design:
- Read all files from directory
test_programs
. Let’s call these test files. - Let each such test file contain a sequence of test cases separated by lines containing only
---
. - Handle each line of a test case like this:
- If the line is
input ...
then it defines an input to give the program. - If the line is
prints ...
then it defines an expected output that the compiled program should print. - All other lines are considered part of the test program.
- If the line is
- For each test case, define a Python test function like this:
import sys ... # for each test case found: sys.modules[__name__].__setattr__( f'test_{test_case.name}', run_test_case )
where
run_test_case
is a Python function that compiles and runs the test program, giving the required inputs and checking that the expected outputs are printed in the expected order, and that no other outputs are printed.