5. Type checker

A type checker verifies that the types of arguments are suitable for the functions and operators that they are passed to.

Our interpreter already checks the types of operations when they are executed, making the language dynamically typed. By adding a type checker, we make our language statically typed i.e. we check its types at compile-time rather than at run-time. This is usually a good trade-off.

Interpreted languages tend to be dynamically typed, and compiled languages tend to be statically typed, but there are exceptions that both ways, as well as hybrid solutions. E.g. our code examples use dynamically typed Python with a static typing extension.

Advanced interpreters for languages like JavaScript also blur this distinction with their Just-in-time (JIT) compilers: they compile dynamically typed code at run-time, but they can also infer the types most likely to be used in some parts of the code and compile a ”fast path” for that code. This fast path starts with a check like ”if these parameters are both numbers, then run this fast machine code, otherwise run this slower version of the code (or revert to interpreting)”.

A decade or so ago, there was considerable debate about whether dynamic typing (combined with an emphasis on unit testing) was better for programmer productivity than static typing. Now the consensus seems to favor static typing, with most common traditionally dynamically typed languages gaining static typing extensions that have become rather popular. JavaScript has TypeScript, Python has mypy, pyright etc., Ruby has Sorbet, and even PHP has steadily gained new type system features.

Static typing has value in at least the following ways:

Types act as documentation for the code.
It helps IDE autocomplete quality.
It catches many errors earlier, before testing or production.
It may help a compiler produce more efficient code.

In contrast, dynamic typing is argued to hold the programmer back less:

Some valid programs cannot be cleanly expressed in a static type system.
It reduces mental overhead during initial development.

The aforementioned examples of static typing extensions to dynamic languages tend to be more flexible than traditional statically typed languages. You can e.g. tell TypeScript to just ignore type errors on a given line. The trade-off is that these systems are a bit too imprecise to help the compiler confidently compile efficient code that’s free of run-time type checks. Future compilers may well overcome this limitation and ”know when they are sure”. In the meantime, JITs can still do a decent job even without static type information, and the other value propositions of static typing remain.

The type checker may annotate the AST with types for use by later compiler stages.

A basic type checker

Type checking works a lot like an interpreter: it recurses over an AST, but instead of returning a value, it returns a type, such as Int or Bool.

Say we’re looking to type an AST node for the operator +. In our language, the operands must always be integers, and the result is always an integer. Therefore the type checking logic for a + AST node is:

Recursively get the types for the left and right nodes.
Check that the left and right operators are of type Int.
Return Int as the type of the + node.

The code might look like this:

# src/compiler/type_checker.py

import compiler.ast as ast
from compiler.types import Int, Type

def typecheck(node: ast.Expr) -> Type:
    match node:
        case ast.BinaryOp():
            t1 = typecheck(node.left)
            t2 = typecheck(node.right)
            if node.op == '+':
                if t1 is not Int or t2 is not Int:
                    raise ...
                return Int
            else:
                ...
        ...

Now let’s consider and if-then-else expression. For it, the type checking logic goes like this:

Check that the type of the condition is a Bool.
Get the types of both branches and check that they are the same.
Return the type of both branches.

Code for type-checking if-then-else might look like this:

def typecheck(node: ast.Expr) -> Type:
    match node:
        ...
        case ast.IfThenElse():
            t1 = typecheck(node.condition)
            if t1 is not Bool:
                raise ...
            t2 = typecheck(node.then_branch)
            t3 = typecheck(node.else_branch)
            if t2 != t3:
                raise ...
            return t2
        ...

How do we type-check variables? Just like the interpreter had to look up a variable’s value from a symbol table, a type checker has to look up a variable’s type from a symbol table. The type checker’s symbol table should be hierarchical just like with the interpreter’s.

The unit type

The unit type Unit is a type that permits exactly one value: the unit value. In our interpreter, we used None as the unit value. Unit is used as the result type of expressions like while loops, which don’t have a natural result value.

Other than it being hard-coded as the result of certain constructs, Unit behaves just like any other type in our type checker.

More about Unit and related types.

In type theory, types are viewed as sets of permitted values. The unit type is called Unit because it permits just one value, i.e. it’s a unit set.

Many older languages have a special type marker ”void” type meaning ”this function returns no value”. Newer languages like Scala, Kotlin and Rust prefer Unit because it’s less of a special case: with Unit, we can treat each expression as returning something.

In languages where all expressions return something, it’s natural to have a different meaning for the empty type Nothing that permits no values. If an expression’s type is Nothing, it means that the expression never returns normally. Examples of expressions that might get type Nothing include infinite loops and functions that terminate the program.

Languages with the concept of a null object reference sometimes use it to double as their unit value. In these languages, the unit type Null permits only the null value. The fact that all object reference types also permit null may cause minor confusion. Examples of languages using null as their unit value are Python (where it’s called None) and Ruby (where it’s called nil).

Exercise (optional)

Sketch type checking code for:

while loops
if-then with no else

Function types

So far we’ve talked about the ”primitive” types Int, Bool and Unit, and we’ve hard-coded how to type-check built-in operators. Let’s generalize so that we can assign types to functions and operators instead of hard-coding them.

A function type is of the form (P1, P2, ...) => R where P1, P2, … are parameter types and R is the return type.

Examples:

operator + has type (Int, Int) => Int
operator < has type (Int, Int) => Bool
operator or has type (Bool, Bool) => Bool
function print_int has type (Int) => Unit

This way we can define the types of most of our built-in functions and operators in the top-level symbol table. This is again very similar to the interpreter, where we defined the built-in functions and operators in the symbol table in Task 3 of chapter 4.

About polymorphism

We currently have a separate print function for every supported type: print_int and print_bool. Suppose we wanted to have a single print function like in Python. What should be the type such a function be?

We’d want print to be able to take many types of value as its parameter and work differently depending on the type. In other words, we’d want the function to be polymorphic. Our type system doesn’t yet have a way to express this.

Operators == and != are similarly problematic: they should work for any pair of values with the same type: we want to be able to compare two integers or two booleans, but not an integer with a boolean.

We have a few options for how to add ==, != and the hypothetical print to our type system, but for now, we won’t implement any of them. We’ll leave == and != as special cases in the type checker code. That is, we hard-code that == and != require the same type on both sides.

Option 1: define a type Any that acts like a ”superclass” for all other types. Then print could have type (Any) => Unit, and == and != could have type (Any, Any) => Bool.

Adding type Any would introduce subtype polymorphism where more specific types (”subtypes”) can be given where less specific types (”supertypes”) are required: e.g. Int can be used where Any is required. This is the same idea as in e.g. Python’s inheritance.

Many real-world languages have subtyping, but recently the trend appears to be to avoid it or to emulate a restricted form of it. Subtyping’s convenience comes at a cost of ambiguous situations and hard conflicts with some other desirable type system features, such as certain kinds of type inference.

Note that type (Any, Any) => Bool for operator == allows you to compare e.g. a Bool with an Int, which might not be desirable.

Examples of languages with subtype polymorphism: C++, C#, Java, Scala, Python.
Examples of languages without subtype polymorphism, or with a restricted version of it: C, Rust, Haskell.

Option 2: make the type of print be (T) => Unit for any type T (or for any type T that obeys some constraint). Similarly, the type of == and != would be (T, T) => Bool for any type T.

This is called parametric polymorphism, or more colloquially either generics or templates, depending on the exact semantics.

Parameteric polymorphism is almost ubiquitous, but the implementation details vary significantly. This is a complex topic that we don’t have time to delve into here.

Examples of languages with parametric polymorphism: C++, C#, Java, Scala, Rust, Haskell, Python.
Examples of languages without parametric polymorphism: C, Go (until 2022), extremely old versions of C# and Java.

Option 3: allow multiple versions of function print, == and != with different types.

Some languages permit giving the two functions the same name, and the type system determines which one to call based on the parameter type. This way e.g. print could have two versions – one taking an Int and another taking a Bool. This is called function overloading or ad-hoc polymorphism.

Real-world languages are split on whether to allow function overloading. On one hand, it’s undeniably convenient - on the other, it non-trivially complicates the language since a single symbol now needs to have many alternative types, and resolving the correct one is not always simple or even possible.

Examples of languages with ad-hoc polymorphism: C++, C#, Java, Scala.
Examples of languages without ad-hoc polymorphism, or with a restricted version of it: C, Rust, Haskell, most if not all dynamically typed languages.

Option 4: handle print, == and != as a special cases in our type checker.

Real-world languages sometimes take this path in the name of (a certain interpretation of) simplicity.

Most notably, for over 10 years (until 2022) the language Go did not have parametric polymorphism, except for a fixed set of special-cased built-ins.

While modern Rust has reasonable flexibility in its restricted form of ad-hoc polymorphism, the way to compose a string from values of different types is still best implemented as a ”macro” i.e. a tiny compiler plugin, rather than a function.

Exercise (optional)

Can you think of code examples (in any language) where ad-hoc polymorphism together with some other form of polymorphism makes it unclear which implementation of a function gets called?

Tasks

Task 1

Define Python classes and objects for the types in the program. There should be a single object for each of Int, Bool and Unit, and there should be a class FunType for building function types.

(We use upper case names for our types for two reasons: because it avoids conflicts with Python types like int, and because it’s the prevailing convention in literature and most real-world languages.)

Task 2

Create a class to represent a hierarchical symbol table that maps variable names to types. You can either generalize the interpreter’s existing SymTab class to be generic in the type of values it holds, or you can copy-paste most of it.

Task 3

Starting from the typecheck function that was partially sketched above, complete support for the following:

literals
variables
untyped variable declarations var x = .... (The next task will handle typed variable declarations var x: Type = ...)
assignment x = ... (the assigned value must have the same type as the variable)
unary and binary operators (with == and != as special cases as discussed above)
function calls
blocks
if and while expressions

To implement typechecking for variables, you’ll need to pass around a symbol table. It should work similarly to the interpreter’s symbol table, with the same scoping logic, name prefix for unary operators, etc.

Task 4

Our language can currently infer the types of variables var x = ... from the initializer code. This is great, but sometimes a programmer might want to specify the type of the variable for readability reasons, and to get a potential type error earlier.

Extend the syntax of var declarations to allow an optional type declaration that looks like this: var name: Type = ....

Example: var x: Int = 1 + 1.

Implement type checking for these type declarations.

Unless you already did so in chapter 3, you’ll need to extend the parser to support type expressions first.

Hint

You may wonder whether to reuse your type Python classes as AST nodes. It is more sustainable to make a separate set of AST nodes ast.TypeExpr and subclasses (analagous to ast.Expr).

This is the way to go if you end up doing type-related optional tasks in the project, because the type expression and the type it evaluates to will no longer always be one-to-one.

Task 5

The following compiler stage will need type information, so let’s add type annotations to the AST.

Add the field type to your ast.Expr base class like this:

from dataclasses import dataclass, field
...

@dataclass
class Expr:
    ...

    type: Type = field(kw_only=True, default=Unit)

(First ensure you don’t already use the field name type in any of your existing AST node classes.)

By saying field(kw_only=True, default=None), we instruct @dataclass to not require this field in the constructor and to initialize the field to Unit by default. This way you won’t have to change existing code to explicitly initialize this field. (If you get an error from this line, see this.)

Modify typecheck so that whenever it visits a node, it also assigns the return value to the visited node’s type field.

Hint: Try to avoid having to remember to set node.type in every case. It may be convenient to put the bulk of typecheck into a separate function.

Compilers

spring 2024