Project

The course project consists of a basic compiler that we build together during the course, plus a set of extra features for you to choose from to raise your grade.

Chapters 2-7 have tasks that guide you in building up a minimally acceptable course project which, if done properly, gives grade 2. To get a better grade, you can pick more features to implement from this page.

General requirements

You should cover all features reasonably well with automated tests (unit tests or end-to-end tests, or both).

Your compiler must target Linux x86-64 or some other machine with real hardware. If it doesn’t target x86-64, you must provide a way to easily run the compiled programs in a suitable emulator on a Linux x86-64 machine. Including an interpreter and maintaining it as you add features is not required, but it gives some extra points.

While the course’s example code is in Python, you are welcome to write your compiler in any language you like. If you use anything not available in standard Ubuntu 22.04 repositories, then please include clear instructions for running your compiler and your automated tests.

You may use your language’s general-purpose standard library features, including the regular expression library. No other libraries or tools meant for building compilers are allowed, e.g. no parser combinators. General-purpose libraries like collection extensions and unit testing libraries are OK. When in doubt, ask. Your compiler may optionally link its outputs with the C standard library.

ChatGPT, Copilot and similar AI systems are not allowed for code generation. You may ask them theory questions if you want, but don’t give them your code to debug or explain, and don’t take any code from them. This is so that you don’t miss out on valuable practice in writing, debugging and understanding code. While AI tools are impressive, they still frequently make mistakes, lead you down undesirable paths, and have trouble remembering the big picture.

Grade requirements

‼️ Note (2024-05-10): Requirements for grades 2-5 were lowered from what is shown here. See the feedback e-mail.

Grade	Minimum points	Example of how to get enough points
1	10	Implement base language poorly or incompletely.
2	18	Implement base language well.
3	30	Implement base language + functions well.
4	50	Implement most recommended features.
5	75	Implement some advanced features or analyses too.

Detailed grading guidelines are below. It’s a good idea to aim for slightly more points than your target grade requires.

The features below are divided into the following categories:

Recommended features
Advanced features
Analyses and optimizations

Some features have other features as prerequisites, but other than that you can pick and choose features in any order you like.

Deadline and submitting

The project’s deadline is March 18, 23:59:59.

To submit your project, send it by e-mail as a .tar.gz or .zip attachment to martin.partel@helsinki.fi. Please also mention your student number in the e-mail.

Recommended features

These features develop the language to be quite close to the C language.

Base language (20p)

You must implement a tokenizer, parser, type checker, IR generator and Assembly generator for the following language constructs, as instructed by the tasks in chapters 2-7. Chapter 4’s interpreter is optional but recommended.

Integer literals
Boolean literals (compiled to values 1 and 0 for now)
Binary operators +, -, *, /, %, ==, !=, <, <=, >, >=, and, or with precedence and left-associativity.
Unary operators - and not
Library functions print_int, read_int and print_bool
Blocks of statements
Local variables with initializers
Assignment statements (right-associative)
if-then and if-then-else
while loops

Functions (15p)

Initially our compiler only compiles a single top-level expression. Make it possible to define new functions as well.

Example:

fun square(x: Int): Int {
    return x * x;
}

fun vec_len_squared(x: Int, y: Int): Int {
    return square(x) + square(y);
}

fun print_int_twice(x: Int) {
    print_int(x);
    print_int(x);
}

print_int_twice(vec_len_squared(3, 4));

Guide:

Currently our language’s top-level syntactic element is one or more top-level expressions. Let’s change it to be a module instead. A module can contain function definitions in addition to the top-level expressions. (In this design, you should not consider modules and function definitions themselves to be expressions).

Make the following structural changes, but don’t write code to actually do anything with function definitions just yet:
- Define AST node types for function definitions and modules.
- Change the parser to emit a module AST node instead of an expression AST node. (For now it can emit a module containing just a top-level expression and no functions.)
- Change the interpreter, type-checker and IR generator to take a module AST node as input (For now they can ignore any functions and just work on the top-level expression.)
- Change the IR generator to return a dictionary of function names to instruction lists. You can hard-code ”main” as the name of the top-level expression.
- Change the Assembly generator to take this dictionary and emit Assembly code for each function.
In the parser, implement the parsing functions for modules and for function definitions.
In the stages that use a symbol table, put all user-defined functions in the symbol table and treat them similarly to how you’ve treated built-in functions so far.
- In the interpreter, map each user-defined function to a Python function that runs the interpreter on the user-defined function’s body.
- In the type-checker, map each user-defined function to a function type based on the type annotations.
- In the IR and Assembly generators, do the same as with built-in functions.
Change the type-checker, IR generator and Assembly generator to process function definitions in addition to the top-level expression.
Implement return ... expressions in all compiler stages.
Make incoming parameters available as variables in all compiler stages.

Also note:

Function calls and function return values must be correctly type-checked.
Recursion and mutual recursion (f calls g, g calls f) must work.
What happens when the function execution runs to the end without executing a return expression? There are several possible design choices, such as:
- Do nothing (call a missing return ”undefined behaviour” - like in C/C++)
- Check at compile-time that this cannot happen (e.g. using a suitable dataflow analysis)
- Check for this at run-time by inserting guard code that crashes the program at the end of each function.
- Make a rule that if the last expression is not a return, and does not end in a semicolon, then its value becomes the implicit return value.
It is not required to make variables declared in top-level expressions available to functions.

Break and continue (5p)

Add expressions break and continue to the language. Just like in Python, break must exit the innermost active loop, and continue must go back to its beginning.

In the interpreter, it’s probably easiest to implement these by abusing Python exceptions.
The types of these expressions can be Unit.
In IR generation, you need to keep track of the current innermost loop’s start and end labels.
Bonus task (+2p): allow break to take an optional result value, and make that the result value of the loop (instead of Unit). The type checker must check that every break of the same loop has a result value of the same type.

Pointers (10p)

A pointer is a value that holds the address of some other value. For instance, saying var x: Int* = &y; means that x shall have type ”int pointer” (Int*) and value ”address of y” (&y).

We can then dereference i.e. use the pointed-to value with a new unary operator *, like this: a = b + *x;.

Example program:

fun square(p: Int*): Unit {
    *p = *p * *p;
}

var x: Int = 3;
square(&x);
print_int(x);  # Prints 9

Requirements:

For each type T, allow a pointer type T*.
Implement the ”address of” unary operator &.
Implement the ”dereference” unary operator *.
Pointers to pointers (e.g. Int**) should work too.
Pointers to functions (e.g. ((Int) => Int)*) should work too.

A dangling pointer, is a pointer whose target has been removed from the stack. The compiled program is allowed to have undefined behaviour if a dangling pointer is dereferenced.

Consider the following function:

fun oops(): Int* {
    var x: Int = 123;
    return &x;
}

This function returns the address of x, which is on the stack. When we return from the function oops, that stack space is freed for other functions to use. Therefore it’s very unclear what variable might later take the place of x.

Some languages like C and C++ allow this but don’t give any guarantees what happens if such a pointer is ever dereferenced.

Other languages like Rust have sophisticated type systems that prevent such mistakes at compile-time.

Still other languages like Go allocate all variables whose address is ever taken on the heap (a separate memory area) instead of the stack. They make that memory available for reuse only after all pointers to it are gone (garbage collection).

Heap allocation (5p)

Prerequisite: pointer

Currently our pointers can only point to values on the stack. That means we can’t safely return a pointer to a value we created in a function, since it would become a dangling pointer as soon as the function returns. The solution is to have a different memory region, usually called the heap, where we can store values that aren’t tied to any single function call’s lifetime.

Our goal is to enable two new constructs in the language:

Creating a value on the heap: var p: Int* = new Int(123) makes p a pointer to an integer value 123 that resides on the heap.
Destroying a value on the heap: delete p marks the memory occupied by the value pointed to by p as free for reuse. This leaves p (and any copies of p) as dangling pointers.

The operating system can give us a large regions of memory for use as the heap, but we still need a system called a memory allocator to keep track of which parts of the heap are in use and which are free.

A memory allocator consists of two functions:

malloc(n): reserves a memory range of n bytes and returns a pointer to it
free(p): takes a pointer p previously returned by malloc, and marks that memory range as available for reuse.

Implementing even a simple memory allocator is a complicated task, so we shall use the allocator in the C standard library.

You can link to the C standard library by passing extra_libraries=['c'] to the assemble function in assembler.py. This will make malloc and free available for calling.

Now you can implement new and delete as unary operators that call malloc and free respectively.

Example program:

fun make_an_int(n: Int): Int* {
    return new Int(n);
}

var p1 = make_an_int(123);
var p2 = make_an_int(456);
print_int(*p1);  # Prints 123
print_int(*p2);  # Prints 456
*p1 = 789;
print_int(*p1);  # Prints 789
print_int(*p2);  # Prints 456
delete p1;
delete p2;

Pointers to structs (10p)

Prerequisites: pointers, heap allocation.

Structs are datastructures with named and typed fields. They’re similar to Python classes, except they can’t have methods.

First you need to add struct type definitions to modules. The following defines type Point to mean a struct with two Int-typed fields x and y:

struct Point { x: Int, y: Int };

Until now, all the values in our language have been 8 bytes in size – small enough to fit in a register, which has been very convenient. Structs are about to change this. We can delay this complication a little by initially only permitting structs to appear behind pointers, like Point*. The task for ”direct structs” lifts this restriction.

We’ll change the new operator to support creating structs in heap memory, with syntax like this:

var p: Point* = new Point { x: 123, y: 456 }

Operator new should ask malloc to allocate enough space for the struct’s fields.

We won’t allow dereferencing struct pointers with * yet, but we do need to access their fields. We’ll do this by adding a left-associative highest-precedence operator ., which is used like in Python: e.g. p.a accesses field a of the struct pointed to by pointer p.

Operator . can also be chained, so e.g. p.a.b accesses field b of the struct pointed to by field a of the struct pointed to by p.

The Assembly implementation for operator . should essentially look up the offset of the field from the struct and adds it to the struct’s address. This requires the Assembly generator to know the left hand side’s type. You’ll need to add type information to IR variables.

Note that this decision makes type-checking an essential part of compilation. We can’t just check types and forget about them like before (aside from needing the top-level type).

Checking and then forgetting about types is called type erasure.

Most compiled languages cannot erase (all) types: when compiling an expression, the types of the variables involved may affect what code is generated. Depending on the language feature using the type information, this differences in code generation may amount entirely to performance differences, or they may affect program behaviour too.

Structs are an example of types affecting performance only. If we wanted to erase all type-information before generating code, we would have to implement structs as some kind of hash table to support any field names on any structs. This would be wildly inefficient.

Function overloading (i.e. ad-hoc polymorphism) is an example of types affecting behaviour. The type of a function argument affects program behaviour by choosing which function implementation to call.

Python and TypeScript are notable in that their type systems are optional and type information is fully erased after checking. This means that their type systems only aim to increase programmer productivity. This approach sacrifices performance and makes some useful language features impossible, but it has advantages too:

The existing dynamically typed language doesn’t have to be changed, so backwards compatibility is assured.
The type system can be looser and even incorrect (”unsound”) in edge cases, which may help make it easier to use
Programmers can add types to existing projects gradually.

In the case of Python, there is a complementary project called Mojo that aims to take the other approach: it preserves Python-like syntax and partial Python compatibility but requires correct typing for excellent performance.

Java also deserves a mention here. Ordinary Java types are not erased, but when type parameters (generics) were introduced in Java version 5, it was decided that they would be erased, mainly for backwards compatiblity and future implementation flexibility. This means that a class or function taking a type parameter T can’t make decisions based on what type T was passed to it (though there is a convention to partially work around this).

C# went the other way and made its type parameters fully reified i.e. represented in compiled code. One consequence is that in C#, you can define an overloaded function where one version requires a List<String> argument, and the other requires a List<int>. In Java, you can’t do this since both types are erased to just List, so the functions’ parameter lists become identical.

Neither approach is strictly superior. Java’s system is arguably easier to understand, to use reflection on, and to generate code for from other languages. C# has a more direct path to generating performant code, and it has an easier time supporting features like ad-hoc polymorphism and extension methods.

Next we need to consider operator &. Until now, operator & has only really made sense on variables, but it is reasonable to also support taking the addresses of struct fields, so e.g. &p.a.b should work. This needs some special case code, since expression p.a.b normally returns the value stored in b, not its address.

Finally, the assignemtn operator = needs to be extended so that p.a.b = x works.

In summary, here is a high-level task list for supporting structs behind pointers:

Add struct type definitions to the language. As with functions, their order in the input shouldn’t matter.
Implement syntax new <structname> { <fields> } to create correctly sized structs on the heap.
Attach type information to IR variables.
Implement operator . so that a .-separated sequence like p.a.b.c where p is an expression for a pointer. (This is generalized in direct structs.)
Extend operator & to support .-separated sequences.
Extend the assignment operator = to support a .-separated sequence on its left.

Example program:

struct Point { x: Int, y: Int };
struct Line { start: Point*, end: Point* };

fun manhattan_len(line: Line*): Int {
    return abs(line.start.x - line.end.x) +
           abs(line.start.y - line.end.y);
}
fun abs(x: Int): Int { return if x >= 0 then x else -x; }

var line1: Line* = new Line {
    start: new Point { x: 1, y: 2 },
    end: new Point { x: 5, y: 9 },
};
line1.end.y = 10;
var end_x = &line1.end.x;
*end_x = 7;
print_int(manhattan_len(line1));

Advanced features

These features are not as commonly seen in languages as the previous ones, and many of these are more difficult too.

Direct structs (15p)

Prerequisite: functions, pointers to structs.

Currently our structs resemble Python objects in that they are always allocated on the heap and always accessed through a pointer. Our next goal is to allow placing structs directly on the stack or in the fields of other structs. We want to be able to copy whole structs between variables and pass them as function arguments (example).

Note that structs don’t necessarily fit in a register, so we can’t pass copies of whole structs around as easily as we’ve done with integers and booleans so far.

Even if we could magically fit large structs into registers, copying them around at every turn would be very inefficient. Consider e.g. the expression a.b + a.c + .... If it made a full copy of a potentially large struct a every time a is mentioned, that would be quite wasteful.

This suggests the following design: we can internally pass around hidden pointers to structs. We’ll call references. When a copy really must to be made, e.g. when passing a struct as a parameter, we make that copy on the stack and pass on a reference to that copy.

References are equivalent to pointers at the Assembly code level, but they are typically separate concepts at the language implementer’s level and sometimes also the programmer’s level.

The exact definition of ”reference” and ”pointer” varies by programming language, and higher level languages tend to only expose one of the two to the programmer. Some lower-level languages like C++ and Rust expose both pointers and references to the programmer.

In C++, a reference and pointer are very similar, with subtle differences like references not being allowed to be null. Usually the choice between them is a matter of convention.

In Rust, pointers are only allowed in code marked as ”unsafe”. Rust’s references work like pointers, but the compiler heavily restricts their use to guarantee that memory errors, like dereferencing dangling pointers, cannot occur. Thanks to their safety, Rust programmers use references a lot while pointers are typically only used internally by libraries that define safe abstractions around them.

Here is a high-level task list for supporting direct structs:

Allow creating a struct on the stack like this: var p = Point { x: 1, y: 2 }; (no new). In the Assembly generator, Locals should reserve enough stack space for the entire struct.
For each type T, allow an internal reference type Ref(T). (The programmer need not ever see this type.)
Whenever a struct-typed IR variable is read, return a reference to it i.e. a Ref(T)-typed IR variable.
Make operator . work on struct pointers T* as well as struct references Ref(T). Accessing a field a.b should result in a reference to that field. Assignments a.b = c should assign to the field pointed to by the reference on the left.
Ensure that assigning to a struct (b = a) makes a copy of the struct instead of copying the reference.
Ensure that structs can be passed to and returned from functions. The x86-64 calling convention is complicated when it comes to structs, so for simplicity, always pass structs via the stack. To return structs, let the caller reserve stack space for the return value and pass the address of that space in register %rdi (so it takes the place of the first argument).
Ensure that structs can contain other structs directly (not just via pointers like before), e.g. struct Line { start: Point, end: Point }. (Note that recursively nested structs should be forbidden, as their size would be infinite.)
Make == and != work for structs. Given two struct references, they should compare the contents of the structs.
Ensure that struct pointers like p: Point* can be dereferenced explicitly too (e.g. (*p).x = 123).

Arrays (5-10p)

Prerequisites: functions, pointers, heap allocation.

Add the following built-in functions and operators:

int_array_create(n) allocates zero-initialized memory for storing n integers and returns a pointer to it.
int_array_delete(a) deallocates a previously allocated array a.
int_array_get(a, i) returns array a element i.
int_array_set(a, i, x) sets array a element i to value x.

Bonus task

If you’ve implemented direct structs, then instead of adding functions int_array_get and int_array_set, you can introduce syntax a[i] to reference an array element (+3p).

You can also introduce new and delete -based syntax for creating and deleting arrays of any type, instead of functions int_array_create and int_array_delete. (+2p)

You may prefer to allocate array memory with the C function calloc, which guarantees the allocated memory is initialized with zeroes. Note that unlike malloc, it takes two parameters.

For full credit, you must take into account the following conditions:

an array is accessed out-of-bounds
an array is deleted twice
an uninitialized array is deleted

You can choose to implement run-time checks for these conditions, or leave them as undefined behaviour and write a detailed design for the run-time checks, including how the required information is stored at run-time, what the generated IR would look like, etc.

Lambdas and closures (20p)

Prerequisite: pointers to structs.

Add inline anonymous functions i.e. ”lambdas” to the language.

Basic example:

var f = new fun(x: Int): Int { return x * x; };
print_int(f(5));  # Prints 25

Lambdas must be able to access variables declared outside of them. These variables form the lambda’s closure.

Example:

var k = 10;
var f = new fun(x: Int): Int { return x * k; }
print_int(f(5));  # Prints 50
delete f;

Closures should contain the variables by reference i.e. modifying a closure variable in the function should modify the original, not a copy.

Example:

var k = 10;
var set_k = new fun(x: Int): Int { k = x; }
set_k(123);
print_int(k);  # Prints 123
delete set_k;

There are no memory safety requirements. The program is allowed to have undefined behaviour if a lambda is called when variables in its closure no longer exist.

Recommended design: the easiest approach is to represent lambdas internally as pointers to heap-allocated structs that contain:

a pointer to the function’s code
pointers to every closure variable

The function can take the pointer to the lambda’s representation as a hidden parameter.

Allocating lambdas on the heap is a common approach taken by almost all high-level languages like Java, C# and Python (although their compilers may in some cases find a way to avoid the heap allocation).

Some languages like C++ and Rust also allow lambdas to reside on the stack. Making this work well requires non-trivial type-system support.

A lambda’s closure determines how large it is, and this must be somehow represented in its type so that when the lambda is passed around, the compiler knows how much space to reserve for it. Working with lambda types that contain information about the closure would be extremely unwieldy for the programmer, so while C++ and Rust do internally generate such types, they don’t expose them to the programmer. Instead they rely on type inference and polymorphism to make working with these hidden lambda types bearable.

C++ and Rust also have ways to allocate lambdas on the heap, leaving it up to the programmer to navigate the tradeoffs in efficiency, program size, programmer convenience and interface flexibility.

Read-only types (10p)

Prerequisite: pointers

For every type T, allow a new read-only type readonly T. During type-checking, check that no value of type readonly T is overwritten.

Example:

var x: readonly Int = 123;  # OK: can initialize readonly
x = 456;  # Error: can't overwrite readonly
var y: Int = x;  # OK: can copy from readonly to non-readonly

Type readonly T* means ”a pointer to a readonly T” while type readonly (T*) would mean ”a readonly pointer to a non-readonly T”.

Interactions of pointers and readonly require care.

You must not allow a value to be overwritten through a read-only pointer.
You must not allow converting a pointer to a readonly into a non-readonly pointer, i.e. you must not allow giving a readonly T* where a T* is expected.
The other way is OK: a T* can be given where a readonly T* is expected.

Example:

var x: Int = 123;
var p1: readonly Int* = &x;  # OK: converting Int* to readonly Int*
var p2: Int* = p1;  # Error: can't convert readonly Int* to Int*

*p1 = 456;  # Error: can't overwrite dereferenced readonly

The rules outlined above can be understood through the intuition ”if I have something marked readonly, then I’ve promised not to modify it (but some other code might modify it)”. This is not the only possible design.

[Your feature here] (?p)

Got some other feature that you’d really like to implement? Want to take the language in an entirely different direction? Ask, and we’ll see if it’s feasible, and how to score it!

Some ideas:

extension methods
generics
type inference, including function parameters and return types
overloading
traits/interfaces
garbage collection
IDE support

Analyses and optimizations

Dataflow framework (10p)

Implement the dataflow framework, reaching definitions and live variables analysis by doing the tasks in chapter 8.

You may want to implement break and continue before doing dataflow analysis, because they make it more convenient to construct interesting dataflow graphs.

break and continue don’t technically add any new capabilities to the language: it was already Turing-complete. You could always rewrite a program using break like this:

while A {
    ...
    if B {
        break;
    }
    ...
}

# can be written as

var done = false;
while A and not done {
    ...
    if B {
        done = true;
    } else {
        ...
    }
}

Similarly continue can be rewritten like this:

while A {
    ...
    if B {
        continue;
    }
    ...
}

# can be written as

while A {
    ...
    if not B {
        ...
    }
}

Report useless writes (5p)

Prerequisite: dataflow framework.

Using the dataflow framework and a suitable analysis, have the compiler print warnings for places where a stack-allocated variable is written but the written value is never read.

Example: in the following code, writing 3 to x is useless.

x = 3;
x = 4;
f(x);

An intraprocedural analysis is enough.

Constant propagation (5p)

Prerequisite: dataflow framework.

Using the dataflow framework, find situations where a variable’s value is a known constant and replace uses of the variable with loads of the constant into a new temporary variable.

Doing constant propagation may open up new opportunities for more constant propagation, so you should repeat it iteratively until no more opportunities are found. However, because our IR requires using constants through Load...Const instructions, don’t do constant propagation on variables that are already only set in a single Load...Const or you’ll end up in an infinite loop.

Finally, remove all Load...Const and Copy that write to the variables that you just removed.

An intraprocedural analysis is enough.

Constant folding (5-10p)

Prerequisite: functions.

Constant folding means evaluating expressions using only constant values, such as 1 + 2 at compile-time.

Let’s define a function or operator to be pure if it only uses its parameters, constants, and other pure functions or operators. (The built-in operators are all pure, but the builtin print_... and read_... functions are not.)

Recursively find AST subtrees that only use constants and pure functions or operators, and use the interpreter to evaluate these subtrees and replace them with literals. It is enough to limit constant folding to AST subtrees that only use base language features and function calls.

Note: if you intend to do the bonus task, read it first, as the recommended approach there is to do constant folding at the IR level instead of on the AST.

Bonus task

Combine constant folding with constant propagation (+5p).

In principle, constant folding can be combined with constant propagation to fold expressions like { var x = 1 + 2; print_int(x); f(x + 3); } where x gets a constant value after folding the 1 + 2, which then enables constant propagation of x into x + 3 making it 3 + 3, which in turn enables constant folding of 3 + 3 into 6. Note that the print_int prevents simply constant-folding the entire block.

Here we do constant folding at the AST level. It’s convenient because our interpreter works on an AST. However, it’s difficult for us to do mix constant propagation with constant folding because our constant propagation works at the IR level.

We have a copule of ways to address this:

We could make constant propagation work at the AST level, either by
1. doing it in some simpler way that doesn’t use dataflow analysis, or
2. by somehow mapping the dataflow analysis results back from IR to AST.
(Recommended) We could write an IR-level interpreter to do constant folding at the IR level. The interpreter should try to run starting from each constant loading instruction and stop if it encounters a variable whose value it can’t know. Any Call instructions that it can evaluate can be replaced by loads of the result.

Implement any of these solutions and do constant propagation and constant folding repeatedly until convergence.

Inlining (5p)

Prerequisite: functions.

Inlining means replacing a function call with the implementation of the function. That is, it transforms a program like

fun f(x: Int) { return x + x; }
print_int(f(123));

into

print_int(123 + 123);

Inlining by itself removes function call overhead at the cost of enlarging the program. This is a worthwhile trade if the called function is quite small.

Often the greater benefit of inlining is that it enables more optimizations e.g. making the resulting program more amenable to intraprocedural analyses.

Make your compiler inline functions whose bodies are smaller than e.g. 20 IR instructions.

Things to remember:

Rename the inlined IR variables to avoid clashes with the caller’s existing IR variables.
After inlining f into g, the size of g has changed, which may change whether it’s worth inlining g into something else. You can inline in depth-first order or in breadth-first order (but note that neither strategy is universally superior).
Remember to test that your inliner doesn’t crash when it sees a recursive or mutually recursive function.

Dead code elimination (5p)

Removing code that does useless work is an optimization that is not very useful directly, but it’s a good cleanup after other optimizations such as constant folding and inlining that may leave useless code behind.

Let’s define dead code as code that only does useless writes and only calls pure operations.

Write an optimization that iteratively removes dead code until it can’t find more.

Grading guidelines

Points are awarded for a working implementation that satisfies the requirements at least reasonably well. All features should include automated tests. Good attempts may get partial credit.

Here are guidelines for penalties for various deficiencies, as long as core requirements are satisfied:

-1..5p per feature for bugs or missed requirements, depending on severity. Documenting the bug reduces the penalty.
-1..10p to final score for exceptionally hard-to-read code.

Don’t worry – this penalty won’t be applied lightly, and no particular style is preferred. Avoid things like cryptic variable names, enormously complicated functions with tons of variables in a large scope, mutable globals etc., and everything will be fine.

Also, Haskellites, please don’t go overboard with the category theory 🙂
-1..5p per feature for poor automated testing.
-15% to final score for missing the deadline, and an additional -15% for every additional 24h of deadline missed thereafter.

Not to be all negative, it’s possible to get bonuses too:

+1..5p to final score, depending on complexity of featureset, for maintaining a working interpreter for all (or most) features, with semantics close to or identical to the compiled code.
+1..5p to final score for exceptionally thorough automated tests, depending on featureset complexity and subtlety.
+1..5p per feature for implementing a feature exceptionally well (e.g. a more advanced version, non-trivial extra error checks, …).
+1..10p if the instructor has severely underestimated the difficulty of the feature.

Grading code and finding bugs is inherently subjective and uncertain, so it’s good to aim a little higher than your target grade’s minimum requirement.

Compilers

spring 2024

Project

General requirements

Grade requirements

Deadline and submitting

Recommended features

Base language (20p)

Functions (15p)

Break and continue (5p)

Pointers (10p)

Heap allocation (5p)

Pointers to structs (10p)

Advanced features

Direct structs (15p)

Arrays (5-10p)

Lambdas and closures (20p)

Read-only types (10p)

[Your feature here] (?p)

Analyses and optimizations

Dataflow framework (10p)

Report useless writes (5p)

Constant propagation (5p)

Constant folding (5-10p)

Inlining (5p)

Dead code elimination (5p)

Grading guidelines