Compilers

spring 2024

7. Assembly generator

The final step in our compiler is to turn IR code into Assembly language, which an assembler can translate to machine code that the computer can execute.

For serious production languages, it’s usually a bad idea to write an Assembly generator ourselves, because it’s difficult to do that well and there exist many high-quality off-the-shelf compiler backends. We will write a simple Assembly generator here, because it’s valuable to have a clear picture of the task and the problems involved.

First we’ll learn the basics of Assembly code and then we’ll see how to structure the Assembly generator. Since this is a compiler course and not an Assembly language course, you don’t need to become proficient with Assembly language.

Introduction to Assembly code

Different computers understand different machine instructions (they have different ”instruction sets”). We will be compiling for the x86-64 family of processors, since they are currently most common on Windows and Linux PCs and servers. Almost all mobile devices, as well as recent Apple computers and some cloud servers, use ARM processors instead. We’d have to write a separate Assembly generator for them if we wanted to target them too, but the principles remain the same.

A processor has a fixed number of fixed-size extremely fast memory locations called registers. In x86-64, the so-called ”general-purpose registers” are (for historical reasons) named: %rax, %rbx, %rcx, %rdx, %rsi, %rdi, %r8, %r9, …, %r15. These are sized 8 bytes i.e. 64-bits each.

General purpose registers can (mostly) be used freely by the program. There are other registers with special behaviour or usage conventions too. We’ll look at some of them later.

Machine instructions perform computations on values held in registers. We will write machine instructions in a language called Assembly code. Each line of Assembly code (usually) corresponds to one machine instruction.

Examples of x86-64 Assembly code lines:

Computers can’t read Assembly code directly. They require machine instructions to be in a binary form called machine code. Assembly code is translated into machine code by a program called the assembler.

Main memory

Data that does not fit in a computer’s physical registers can be stored in main memory. Main memory can be thought of as an enormous list of 1-byte (8-bit) memory slots, each having a sequence number (0, 1, 2, …) called an address. Variables or registers that hold memory addresses are often called pointers.

For instance, if we store some 8-byte (64-bit) integer value at memory address 200, it would occupy 8 memory slots with addresses 200 to 207. Then a pointer holding the value 200 could be used to read the value from memory, or to overwrite it with a new value.

This is a simplification, but more than good enough for our purposes.

The movq machine instruction copies 8 bytes (64 bits) of data between registers and main memory. Putting a register in parentheses, e.g. (%rax), refers to the memory location pointed to by the register. Examples:

Exercise (optional)

Write machine instructions that compute the sum of the values in registers %rax, %rbx and %rcx, and store the result into a memory location 48 bytes ahead of where %rdx points.

Running Assembly code

Let’s see how to run Assembly code on an x86-64 Linux computer (or virtual machine).

Put the following code into a file asmprogram.s. Don’t worry about understanding the details just yet.

        # Metadata for debuggers and other tools
        .global main
        .type main, @function
        .extern printf
        .extern scanf

        .section .text  # Begins code and data
# Label that marks beginning of main function
main:
        # Function stack setup
        pushq %rbp
        movq %rsp, %rbp
        subq $128, %rsp  # Reserve 128 bytes of stack space

        # Read an integer into -8(%rbp)
        movq $scan_format, %rdi  # 1st param: "%ld"
        leaq -8(%rbp), %rsi      # 2nd param: (%rbp - 8)
        call scanf
        # If the input was invalid, jump to end
        cmpq $1, %rax
        jne .Lend

        # *** REPLACE THESE TWO LINES WITH YOUR CODE ***
        movq -8(%rbp), %rsi  # Copy input number to `%rsi`
        imulq $2, %rsi  # Double the input

        # Call function 'printf("%ld\n", %rsi)'
        # to print the number in %rsi.
        movq $print_format, %rdi
        call printf

# Labels starting with ".L" are local to this function,
# i.e. another function than "main" could have its own ".Lend".
.Lend:
        # Return from main with status code 0
        movq $0, %rax
        movq %rbp, %rsp
        popq %rbp
        ret

# String data that we pass to functions 'scanf' and 'printf'
scan_format:
        .asciz "%ld"
print_format:
        .asciz "%ld\n"

Now use the following commands to compile and run the Assembly program:

gcc -g -no-pie -o asmprogram asmprogram.s

./asmprogram

This command uses the GCC compiler collection to compile the source file asmprogram.s into the final program asmprogram, with debug information (-g) and without position-independence (-no-pie, not important to understand yet).

GCC (by default) links our program with the C standard library. That’s good, since we use the C function printf in our program. Later we’ll make our programs independent of the C standard library and run the Assembler directly without needing GCC.

Exercise (optional)

Modify asmprogram.s to read two integers instead of one and to compute the sum of their squares.

Functions and the stack

When we run out of register space, we need to store data in main memory. Our language so far is simple enough that we could give each IR variable a fixed spot in memory, effectively implementing all IR variables as global variables. That would work, but it would break as soon as we add potentially recursive functions to our language: a second instantiation of a function would overwrite the data of the previous instantiation.

The conventional solution is to place a function’s local variables on the stack, which is an area of memory reserved for function locals. When a function starts, it grows the stack enough to fit its local variables. Before a function returns, it restores the stack to its original size. The stack is also sometimes used to store function parameters when calling other functions.

A calling convention specifies how functions should use registers and stack space. To anticipate adding functions to our language, and to make our machine code easier to analyze with a debugger, we will now see how to generate Assembly that is compliant with the System V 64-bit calling convention, which is what Linux uses.

A function call’s stack frame means the range of stack addresses used by that instantiation of the function. When a function is called, the stack pointer register %rsp points to the beginning of the stack frame.

The following are stored just before the stack frame:

  1. any function parameters that didn’t fit into registers
  2. the address of the code to return to when the function is done

Beyond these, a function is free to move %rsp to grow its stack frame in order to store any local variables that don’t fit in registers, or to store function parameters and return addresses when calling other functions.

On most systems, including x86-64 processors, the convention is for the stack to grow downwards, i.e. growing the stack means subtracting from the stack pointer %rsp.

Defining functions

Let’s now look at an example of the conventional way to write functions on Linux. We’ll go through the instructions one by one.

        pushq  %rbp
        movq   %rsp, %rbp
        subq   $32, %rsp    # reserve 32 bytes for locals

        # ... function implementation goes here ...

        movq   %rbp, %rsp
        popq   %rbp
        ret

When the function is entered, before the first instruction is executed, the stack looks like this:

( ... caller's stuff ... )  <-- %rbp points somewhere here
[ arguments that don't fit in registers (if any) ]
[ return address ]          <-- %rsp points here

Register %rbp is conventionally the stack base pointer. It’s used to point to the start of the current function’s stack. On x86-64, at the start of a function call, %rbp still points to the caller’s stack base, so the first thing to do is to fix this.

Our first instruction pushq %rbp stores the value of the stack base pointer on the stack, because we intend to write to %rbp next, and the calling convention says we must restore %rbp to its original value before returning from a function.

The stack now looks like this:

( ... caller's stuff ... )  <-- %rbp still points somewhere here
[ arguments ]
[ return address ]
[ caller's %rbp ]           <-- %rsp points here

The 2nd instruction movq %rsp, %rbp copies %rsp to %rbp. Now %rbp points to the base of our stack, and we’re free to change %rsp as needed (e.g. to push function parameters before calling another function).

Having %rbp always point to the base of the stack frame even as %rsp may move around is useful for two reasons:

  1. We can now conveniently refer to locals and arguments using addresses relative to (%rbp), e.g. the first local will be at -8(%rbp) and the first argument that didn’t fit in registers will be at 16(%rbp).
  2. Debuggers can display the call stack correctly when %rbp always points to the return address.
( ... caller's stuff ... )
[ arguments ]
[ return address ]
[ caller's %rbp ]          <-- %rsp and %rbp both point here

The 3rd instruction subq $32, %rsp reserves space on the stack by moving %rsp downwards by that amount. The required space (32 bytes in this example) is determined by the compiler based on the number of local variables.

( ... caller's stuff ... )
[ arguments ]
[ return address ]
[ caller's %rbp ]        <-- %rbp points here
[ space for variables ]  <-- %rsp points to lowest byte of this

Now the function implementation is free to:

When we want to return from the function, the instructions movq %rbp, %rsp and popq %rbp undo what we did at the beginning: they restore %rsp and %rbp to how they were initially.

( ... caller's stuff ... )  <-- %rbp points somewhere here again
[ arguments ]
[ return address ]          <-- %rsp points here again

Finally, the instruction ret returns to the calling function by popping the return address from the stack and continuing execution from there. The function must place its return value, if any, in register %rax before invoking ret.

Calling functions

Let’s now look at what this looks like on the caller’s side.

The Linux x86-64 calling convention specifies that 64-bit integer parameters are passed in registers %rdi, %rsi, %rdx, %rcx, %r8 and %r9. If there are more parameters, they must be pushed on the stack in last-to-first order, and later popped.

Here’s an example:

movq ..., %rdi
movq ..., %rsi
...
movq ..., %r9
pushq ...
pushq ...

call function_to_call

addq $16, %rsp

The initial movq instructions set the parameters that fit in registers, and the following pushq instructions push two more parameters that don’t fit in registers.

The call instruction pushes the address of the next instruction to the stack and continues execution from the called function’s first instruction.

The final addq $16, %rsp moves the stack pointer upwards by 16 bytes, which effectively undoes the two pushes that we did.

The calling convention has a requirement that the stack is aligned to 16 bytes (why?). This means that when executing call (i.e. after stack arguments are pushed), %rsp must be a multiple of 16. If it is not, a subq $8, %rsp (subtract 8 from %rsp) instruction must be added before pushing the arguments to the stack.

There’s one more important point in the calling convention: a function is allowed to modify any register it wants, except for %rbp, %rbx, %r12, %r13, %r14 and %r15 (why?). These are named callee-saved registers because the called function must restore their previous values before returning. The rest of the registers are caller-saved – the caller must back them up before calling a function if it wants to keep their values.

Exercise (optional)

Write Assembly code for a function that:

  • takes two parameters,
  • calls another function f2 with the second parameter plus 5,
  • returns the sum of first parameter plus 10.

Exercise (optional)

Add the function you wrote to the program template under a new label f, and change function main to

  • read two input integers instead of one,
  • call f with these two integers as parameters,
  • print the return value.

Resources for x86-64 Assembly

Warning

When looking at Assembly code examples on the internet, be aware that there are two widespread syntaxes: Intel syntax and AT&T syntax. On this course, we’ve made the arbitrary choice to use AT&T syntax, which you’ll recognize by register names starting with %.

Crucially, the syntaxes have different argument order: in Intel syntax, the output register is written first, in AT&T it’s written last.

Much online material is also old and written for 32-bit x86 computers. You can recognize this from its use of registers starting with the letter ’e’ instead of ’r’, e.g. %eax instead of %rax. The instructions and ideas are very similar, but some instructions need a suffix (like the ’q’ in movq) to operate in 64-bit mode, and the calling conventions are different. The calling conventions are also different in code written for other operating systems.

Assembly generator

A simple non-optimizing Assembly generator is at its core pretty straightforward: it takes a list of IR instructions and translates each of them into one or more Assembly instructions.

The only wrinkle is how to map the unbounded set of IR variables to the very limited set of available registers to minimize the need to access the stack. This is the register allocation problem.

A good register allocator requires first analyzing the IR in ways that we’ve not gotten to yet, so we’ll write our initial assembly generator without a register allocator: we’ll just store all our variables on the stack. Since modern processors are quite well optimized for reading and writing stack values, we can expect only a 2-5x slowdown compared to using registers efficiently.

Tracking the stack

Let’s create a class that calculates the required stack space and maps each IR variable to a stack location.

We call this class Locals, because global variables will not use stack space. (For now, our only global variables are functions like print_int.)

# src/compilers/assembly_generator.py

class Locals:
    """Knows the memory location of every local variable."""
    _var_to_location: dict[ir.IRVar, str]
    _stack_used: int

    def __init__(self, variables: list[ir.IRVar]) -> None:
        ...  # Completed in task 1

    def get_ref(self, v: ir.IRVar) -> str:
        """Returns an Assembly reference like `-24(%rbp)`
        for the memory location that stores the given variable"""
        return self._var_to_location[v]

    def stack_used(self) -> int:
        """Returns the number of bytes of stack space needed for the local variables."""
        return self._stack_used

You are now ready to do task 1.

To create an instance of Locals, we will need to find all IR variables used by our IR instructions. You can use this implementation of get_all_ir_variables, or implement it yourself.

Generating Assembly code

We’ll now sketch the function that iterates over IR instructions and emits Assembly code. You’ll get to complete this function yourself.

First we need to make some declarations and set up the stack correctly, as described above.

    .extern print_int
    .extern print_bool
    .extern read_int
    .global main
    .type main, @function

    .section .text

main:
    pushq %rbp
    movq %rsp, %rbp
    subq $40, %rsp  # ($40 is just an example)

The first parameter of subq must be the amount of stack space reserved by Locals.

The rest of the output consists of instructions generated from each IR instruction. Here is a partial implementation:

def generate_assembly(instructions: list[ir.Instruction]) -> str:
    lines = []
    def emit(line: str) -> None: lines.append(line)

    locals = Locals(
        variables=get_all_ir_variables(instructions)
    )

    # ... Emit initial declarations and stack setup here ...

    for insn in instructions:
        emit('# ' + str(insn))
        match insn:
            case ir.Label():
                emit("")
                # ".L" prefix marks the symbol as "private".
                # This makes GDB backtraces look nicer too:
                # https://stackoverflow.com/a/26065570/965979
                emit(f'.L{insn.name}:')
            case ir.LoadIntConst():
                if -2**31 <= insn.value < 2**31:
                    emit(f'movq ${insn.value}, {locals.get_ref(insn.dest)}')
                else:
                    # Due to a quirk of x86-64, we must use
                    # a different instruction for large integers.
                    # It can only write to a register,
                    # not a memory location, so we use %rax
                    # as a temporary.
                    emit(f'movabsq ${insn.value}, %rax')
                    emit(f'movq %rax, {locals.get_ref(insn.dest)}')
            case ir.Jump():
                emit(f'jmp .L{insn.label.name}')
            ...  # Completed in task 2

You are now ready to do task 2.

Intrinsics

When we encounter an instruction like Call(print_int, x1, x2), we will translate it into a call to function print_int. However it would be silly (sort of) to define basic operators like + as callable functions. Instead, we define + as an intrinsic: a special case that emits the appropriate Assembly code for each call to +.

We want + to compile to assembly code that reads arguments from any arbitrary memory locations A and B, and leaves the sum in %rax:

movq A, %rax  # Copy from A to %rax
addq B, %rax  # Write (B + %rax) to %rax

We can define other intrinsics similarly.

Sample code

If you have a good grasp of x86-64 or a desire to learn more of it, you are welcome to write all the necessary intrinsics in your code yourself, but you may also download and use this intrinsics.py.

This module defines a dictionary all_intrinsics, which maps names like + to functions that emit the given Assembly code. These functions are called like this:

all_intrinsics['+'](IntrinsicArgs(
    arg_refs=...,
    result_register=...,
    emit=...
))
  • arg_refs must be a list of registers or memory locations references of the arguments. You can get them from Locals.
  • result_register must be a register name. For now you can always use %rax and copy the result from there to the appropriate stack location.
  • emit must be a function that adds a given line of Assembly code to the output.

You are now ready to do task 3.

Running the assembler

The final step to complete your compiler is to pass the generated Assembly code to an assembler, which translates the Assembly code to a binary form that the computer can execute.

You can download assembler.py and call its function assembler(code, filename). This does two things:

  1. It includes a definition for print_int, print_bool and read_int.
  2. It invokes the GNU Assembler to produce the finished program.

It does these things slightly differently than how we did it above so as not to depend on the C standard library.

Tasks

Task 1

Complete Locals.__init__. It must:

  • initialize _var_to_location to map each IR var to stack locations -8(%rbp), -16(%rbp), …,
  • initialize _stack_used to the number of bytes used.

Task 2

Add the generate_assembly example function to your project, and implement cases for the following IR instructions:

  • LoadBoolConst: represent true as 1 and false as 0
  • Copy: copy via %rax because movq can’t have two memory arguments
  • CondJump: use cmpq $0, _condition_ to set comparison flags, then jne _label_ to ”jump if not equal” (i.e. ”condition != 0”), then jmp _label_ to jump in the other case.

Finally, restore the stack and return from the function as described above.

Check manually that the expression { var x = true; if x then 1 else 2; } yields sensible-looking Assembly code. We’ll test more thoroughly once we’ve implemented Call.

Task 3

Implement the Assembly generation for IR instruction Call. Make it use an intrinsic if available, and generate function calling code otherwise.

Pass arguments according to the calling convention.

It’s OK to not implement stack-based arguments, i.e. you may limit the number of arguments to 6, so they all fit in registers.

Task 4

Add assembler.py to your codebase and make a command compile in __main__.py that runs all compiler stages, including the assembler.

Now try out your finished compiler! 🎉

Task 5

Write an end-to-end test framework where you can easily write programs in your language, along with inputs and expected outputs for the compiled programs.

Here is one possible design:

  • Read all files from directory test_programs. Let’s call these test files.
  • Let each such test file contain a sequence of test cases separated by lines containing only ---.
  • Handle each line of a test case like this:
    • If the line is input ... then it defines an input to give the program.
    • If the line is prints ... then it defines an expected output that the compiled program should print.
    • All other lines are considered part of the test program.
  • For each test case, define a Python test function like this:
    import sys
    ...
      # for each test case found:
          sys.modules[__name__].__setattr__(
              f'test_{test_case.name}',
              run_test_case
          )
    

    where run_test_case is a Python function that compiles and runs the test program, giving the required inputs and checking that the expected outputs are printed in the expected order, and that no other outputs are printed.

Use this system to write end-to-end test cases for your compiler until you’re reasonably confident that it works correctly.