3. Parser

Parsing means turning a list of tokens into an Abstract Syntax Tree (AST). An AST organizes tokens into a tree where operations are connected to their parameters.

For instance, the expression a + b * c (i.e. token list ['a', '+', 'b', '*', 'c']) turns into an AST that looks like this:

This tree means that the program calls operation + with the following parameters:

the variable a
the result of calling operation * with variables b and c as parameters

An AST for a function call like f(x, g(x), y + 3) looks like this:

This tree means that we call function f with the following parameters:

the variable x
the result of calling function g with variable x as the parameter
the result of calling operation + with variable y and constant 3 as parameters

Other constructs like control flow expressions turn into ASTs similarly. E.g. if x > 10 then print_int(x) else print_int(y) becomes:

We translate token lists into ASTs because an AST expresses clearly and unambiguously what operations need to happen and in which order. This makes the program much easier to analyze, interpret and compile.

You can try how different expressions turn into ASTs in the sandbox

Exercise (optional)

Draw ASTs for the following expressions:

a + b
f(a + b, b + c)
f(f(a))
f(a * f(b)) + c
while i < 100 do i = i + 1

Structuring the parser

Let’s define some classes to hold our AST:

# src/compiler/ast.py

@dataclass
class Expression:
    """Base class for AST nodes representing expressions."""

@dataclass
class Literal(Expression):
    value: int | bool | None
    # (value=None is used when parsing the keyword `unit`)

@dataclass
class Identifier(Expression):
    name: str

@dataclass
class BinaryOp(Expression):
    """AST node for a binary operation like `A + B`"""
    left: Expression
    op: str
    right: Expression

... # You get to define more later

Now we can construct AST nodes like this:

# Construct AST for "x + 3" using
example_ast = BinaryOp(
    left=Identifier(name="x"),
    op="+",
    right=Literal(value=3),
)

Our parser needs to turn a list of tokens into an AST:

# src/compiler/parser.py

from compiler.tokenizer import Token
import compiler.ast as ast

def parse(tokens: list[Token]) -> ast.Expression:
    ...

Let’s now see how to implement this function.

Exercise (optional)

Next we’ll start with a simple implementation of parse and add features to it bit by bit. You may find it useful to follow along in your own project and add some unit tests after each step.

Warning: when importing AST classes, don’t accidentally import from Python’s built-in ast module – remember to import from compiler.ast.

Parsing functions

The idea for implementing the parser is this:

Keep track of our position in the token list in a variable pos.
Define a set of parsing functions to parse different kinds of AST subtrees.

For instance, there can be one parsing function for if expressions, another one for function calls, another one for expressions like a + b, and so on.

Parsing functions follow this specification:

If the tokens starting at pos have the things the function wants to parse, then it returns an AST and moves pos forward past what was parsed.
Otherwise it raises an error.

Let’s look at an example of a parser that can parse a single ”integer +/- integer” token sequence, such as 3 + 4 or 7 - 2.

The example has two helper functions peek and consume for conveniently looking at the next token and advancing pos. After that, it has two parsing functions:

parse_int_literal - parses a single integer literal
parse_expression - uses parse_int_literal to parse an integer literal, then requires a + or -, then uses parse_int_literal to parse another integer literal.

def parse(tokens: list[Token]) -> ast.Expression:
    # This keeps track of which token we're looking at.
    pos = 0

    # Let's first define two useful helper functions:
    # 'peek' and 'consume'.

    # 'peek()' returns the token at 'pos',
    # or a special 'end' token if we're past the end
    # of the token list.
    # This way we don't have to worry about going past
    # the end elsewhere.
    def peek() -> Token:
        if pos < len(tokens):
            return tokens[pos]
        else:
            return Token(
                location=tokens[-1].location,
                type="end",
                text="",
            )

    # 'consume(expected)' returns the token at 'pos'
    # and moves 'pos' forward.
    #
    # If the optional parameter 'expected' is given,
    # it checks that the token being consumed has that text.
    # If 'expected' is a list, then the token must have
    # one of the texts in the list.
    def consume(expected: str | list[str] | None = None) -> Token:
        token = peek()
        if isinstance(expected, str) and token.text != expected:
            raise Exception(f'{token.location}: expected "{expected}"')
        if isinstance(expected, list) and token.text not in expected:
            comma_separated = ", ".join([f'"{e}"' for e in expected])
            raise Exception(f'{token.location}: expected one of: {comma_separated}')
        pos += 1
        return token

    # This is the parsing function for integer literals.
    # It checks that we're looking at an integer literal token,
    # moves past it, and returns a 'Literal' AST node
    # containing the integer from the token.
    def parse_int_literal() -> ast.Literal:
        if peek().type != 'int_literal':
            raise Exception(f'{peek().location}: expected an integer literal')
        token = consume()
        return ast.Literal(int(token.text))

    # This is our main parsing function for this example.
    # To parse "integer + integer" expressions,
    # it uses `parse_int_literal` to parse the first integer,
    # then it checks that there's a supported operator,
    # and finally it uses `parse_int_literal` to parse the
    # second integer.
    def parse_expression() -> ast.BinaryOp:
        left = parse_int_literal()
        operator_token = consume(['+', '-'])
        right = parse_int_literal()
        return ast.BinaryOp(
            left,
            operator_token.text,
            right
        )

    return parse_expression()

This simple parser works for 2 + 2 and 55 + 7 etc, but it doesn’t yet support identifiers (e.g. a + 3) nor more than one operation (e.g. 1 - 2 + 3). Next we’ll address these deficiencies as we introduce the main techniques and concerns for writing parsing functions.

Lookahead

First, let’s allow identifiers in addition to integer literals, so we can parse a + 3, x + y and similar.

def parse(tokens: list[Token]) -> ast.Expression:
    ...

    def parse_int_literal() -> ast.Literal:
        ... # (same as before)

    def parse_identifier() -> ast.Identifier
        ... # (very similar to parse_int_literal)

    # Borrowing from mathematics, we say that
    # "an identifier or an integer literal"
    # is called a "term".
    def parse_term() -> ast.Expression:
        if peek().type == 'int_literal':
            return parse_int_literal()
        elif peek().type == 'identifier':
            return parse_identifier()
        else:
            raise Exception(f'{peek().location}: expected an integer literal or an identifier')

    def parse_expression() -> ast.Expression:
        left = parse_term()
        operator_token = consume(['+', '-'])
        right = parse_term()
        return ast.BinaryOp(
            left,
            operator_token.text,
            right
        )

    return parse_expression()

Here parse_term looks ahead to see what kind of token comes next and uses that to choose which parsing function to call. We’ll be seeing this pattern a lot.

Associativity

Ok, how do we allow 1 - 2 + 3 + ...?

First, we need to decide whether we want 1 - 2 + 3 to be understood as left-associative (1 - 2) + 3 or right-associative 1 - (2 + 3).

The convention in existing languages is for most mathematical operators to be left-associative, so let’s do that. This also makes sense for + and -, since it would be very surprising if:

1 - 2 + 3  =  1 - (2 + 3)  =  -4   😱

To parse 1 - 2 + 3 + ..., we need to make parse_expression loop for as long as it finds more + or -:

def parse_expression() -> ast.Expression:
    # Parse the first term.
    left = parse_term()

    # While there are more `+` or '-'...
    while peek().text in ['+', '-']:
        # Move past the '+' or '-'.
        operator_token = consume()
        operator = operator_token.text

        # Parse the operator on the right.
        right = parse_term()

        # Combine it with the stuff we've
        # accumulated on the left so far.
        left = ast.BinaryOp(
            left,
            operator,
            right
        )
    
    return left

By adding each BinaryOp on top of left, we achieve left-associativity.

Exercise (optional)

Implement a version of parse_expression with right-associativity.
(This is not useless - we’ll want right-associativity for operator = later.)

Hint

Precedence

Now let’s add operators * and /.

We’d like operators to have precedence like in mathematics: a + b * c should be understood as a + (b * c). That is, we’d like * and / to have higher precedence than + and -.

Note that this does not contradict associativity. Associativity only determines how operators on the same precedence level are grouped.

The easiest way to implement precedence is to write separate parsing functions for each precedence level, and have lower precedence level functions call higher precedence functions.

Concretely, when parsing a + - expression, we allow the ”terms” to be * / expressions. When parsing a * / expression, we allow the ”factors” to be integer literals or identifiers.

def parse_expression() -> ast.Expression:
    # Same as before
    left = parse_term()
    while peek().text in ['+', '-']:
        operator_token = consume()
        operator = operator_token.text
        right = parse_term()
        left = ast.BinaryOp(
            left,
            operator,
            right
        )
    return left

def parse_term() -> ast.Expression:
    # Same structure as in 'parse_expression',
    # but the operators and function calls differ.
    left = parse_factor()
    while peek().text in ['*', '/']:
        operator_token = consume()
        operator = operator_token.text
        right = parse_factor()
        left = ast.BinaryOp(
            left,
            operator,
            right
        )
    return left

def parse_factor() -> ast.Expression:
    # Same as `parse_term` in the previous version.
    if peek().type == 'int_literal':
        return parse_int_literal()
    elif peek().type == 'identifier':
        return parse_identifier()
    else:
        raise Exception(f'{peek().location}: expected an integer literal or an identifier')

Recursion

Next it’d be nice to support parentheses, so we could write (a + b) * c and get this AST:

In (a + b) * c, the parenthesized expression (a + b) plays the role of a factor of the *. That suggests that we should handle parentheses as one possible case in parse_factor, and that indeed works:

def parse_factor() -> ast.Expression:
    if peek().text == '(':
        return parse_parenthesized()
    elif peek().type == 'int_literal':
        return parse_int_literal()
    elif peek().type == 'identifier':
        return parse_identifier()
    else:
        raise Exception(f'{peek().location}: expected "(", an integer literal or an identifier')

def parse_parenthesized() -> ast.Expression:
    consume('(')
    # Recursively call the top level parsing function
    # to parse whatever is inside the parentheses.
    expr = parse_expression()
    consume(')')
    return expr

Note that our parser is now recursive: parse_parenthesized calls parse_expression, which may end up calling parse_parenthesized again, etc.

We now have a pretty good parser for simple math expressions!

Other techniques

There are a few more techniques whose existence is good to know about, even if they aren’t needed or recommended on this course.

Backtracking is an alternative to lookahead: it means using try..except instead of if..else to determine which parsing function to call. This technique is not recommended, though it can be necessary for some difficult-to-parse languages.

Backtracking is when a parsing function speculatively calls another parsing function, and if it fails, it reverts pos and tries some other parsing function.

Here’s a contrived example of how parse_factor() could be written using backtracking:

prev_pos = pos
try:
    return parse_parenthesized()
except Exception:  # (a more specific exception type would be good)
    pos = prev_pos
    try:
        return parse_int_literal()
    except Exception:
        ...

Backtracking can be useful when when writing a suitable if would be difficult or impossible, i.e. when it’s difficult or impossible to determine which parsing function should be called just by looking at the next token or two.

Trade-offs

Backtracking is a powerful and general technique, but it can hurt parser performance, debuggability and especially error message quality.

If the parser backtracks from parsing function A and later fails in parsing function B, it’s sometimes not at all clear which error message would be more helpful for the user. There are ways to mitigate this, such as having the parser ”commit” to a branch at some point, but that adds yet more complexity.

Sometimes backtracking is difficult to avoid. One such case is type parameters in languages like Java, C++ and C#. When parsing f<T>(), looking at the <, it’s not clear without looking arbitrarily far ahead whether it means ”less than” or ”type parameter list begins”.

If the language is extremely flexible, it can become quite difficult or even impossible to reliably determine which interpretation is correct. Consider the C++ expression f<a < b>(). This parses as ”call f with compile-time expression a < b”. Now consider f<a > b>(). This fails to compile because the first > terminates the compile-time parameter list (forcing you to say f<(a > b)>).

Some languages like Rust and Scala are designed to avoid this ambiguity. Rust requires type parameters in expressions to look like f::<T>(). When the parser sees a :: followed by a <, it knows that it must be a type parameter list. Scala chose to use [] instead of <> for type parameters (and not use [] for anything else).

Grammars are a formal way to specify a language’s syntax. Tools exist to generate parser code automatically from suitable grammar specifications.

A grammar is a regex-like specification of what token sequences are allowed.

Here’s an example of a grammar specifying expressions formed of integer literals, identifiers, left-associative +, -, * and / operators with the usual precedence, and parentheses for grouping:

expression = term (('+' | '-') term)*
term = factor (('*' | '/') factor)*
factor = '(' expression ')'
       | int_literal
       | identifier

In the grammar, * means ”zero or more”, and | means ”or”, just like in regexes. What’s new compared to regexes is that rules like expression and term can refer to each other recursively. This recursion makes grammars more powerful than regexes.

Recursion gives grammars a simple ”memory” that regexes don’t have. Consider the task of matching only strings where parentheses are balanced, so e.g. (()) is matched but (() is not. Regexes can’t do that (for arbitrarily deep nesting), since they have no mechanism of ”tracking how deep they are”. In grammars, the ”recursion stack” provides such a tracking mechanism, with a rule like this:

parentheses = '(' parentheses ')' | <empty>

The grammars we’ve seen here are in the class of context-free grammars, since they only have a single symbol on the left of each rule’s =. More expressive classes of grammars exist, but they are rarely if ever used in compilers.

There are many ways to write a grammar for the same language. Here’s a grammar describing the same language as above in a different style:

expr = expr ('+' | '-') term
     | term
term = term ('*' | '/') factor
     | factor
factor = '(' expr ')'
       | int_literal
       | identifier

Various techniques exist to create parsers out of grammars like this. Each technique comes with its own performance characteristics as well as limitations and requirements for the grammar. For instance, the second example above uses ”left-recursive” rules (e.g. expr = expr ...). A class of top-down parsing algorithms ”LL” end up in an infinite loop on left-recursion, while a more powerful but harder to implement class of bottom-up parsing algorithms ”LR” can handle left-recursion even more efficiently than right-recursion.

The parsing algorithm developed in this chapter is in class ”LL(1)”, which roughly means that it works in a top-down manner without backtracking while looking at most 1 token ahead to decide which parsing rule to apply next.

Here are some tools that generate parsers from grammars:

While these can be useful for rapid prototyping if you learn their quirks, real production-quality compilers tend to use hand-written parsers for better performance and error handling.

We will not use automated parser generators on this course. In my opinion your time is better spent on the more satisfying and generally useful exercise of writing a parser manually, rather than learning the nuances of a particular tool and its grammar language. Grammars are, in my experience, a rather leaky abstraction.

Tasks

These tasks will guide you in building up the course project.

It’s a good idea to have passing tests and make a commit between each task.

Task 1

Implement and unit test in your project the parser that we’ve built up above.

Make sure you understand the code.

Clean up some details that we’ve glossed over:

Make sure the entire input is always parsed. Don’t allow garbage at the end of the input. Make sure e.g. a + b c gives an error.
Handle the case of empty input more nicely (currently peek() crashes uglily with IndexError if tokens is empty).
Remember to also write some test cases for inputs that should fail to parse.

Task 2

Add support for parsing if-then-else expressions like if a then b + c else x * y.

Make the else part optional i.e. also allow if a then b + c.

Remember to write unit tests.

Make sure if-expressions are allowed as parts of other expressions, such as in 1 + if true then 2 else 3.

Task 3

Add support for parsing function calls like f(x, y + z).

Hint

Task 4

Add support for some more operators:

Remainder operator: %
Binary comparison operators: ==, !=, <, <=, >, >=
Binary logical operators: and and or
Unary operators: not and -. (Be sure to allow chaining: not not x etc.)
Assignment = (right-associative!)

Use these precedence levels:

=
or
and,
==, !=
<, <=, >, >=
+, -
*, /, %
Unary - and not
All other constructs: literals, identifiers, if, while, var, blocks, parentheses, function calls.

Bonus task

The code is getting a bit repetitive, isn’t it? Also, what if we wanted to allow the language’s user to specify their own operators?

Generalize the code for binary operators to use a list of precedence levels such as the following, instead of repeating code for each level.

left_associative_binary_operators = [
    ['or'],
    ['and'],
    ['==', '!='],
    ['<', '<=', '>', '>='],
    ['+', '-'],
    ['*', '/'],
]

Task 5

A block is a sequence of expressions inside braces {}, separated and optionally terminated by a semicolon ;. It might look like this:

{
    f(a);
    x = y;
    f(x)
}

Note that the last semicolon is optional. When it’s missing, it means that the last expression in the block is a result expression i.e. it defines the value that the block evaluates to. So e.g. x = { f(a); b } means ”execute f(a) and assign the value b to x”.

If the final semicolon is present, then the parser should insert a Literal(value=None) as the result expression.

Larger example:

{
    while f() do {
        x = 10;
        y = if g(x) then {
            x = x + 1;
            x
        } else {
            g(x)
        };  # <-- (this semicolon will become optional later)
        g(y);
    };  # <------ (this too)
    123
}

Add support for these kinds of brace-delimited blocks to the parser.

Remember to also write tests for cases that should return an error, such as missing semicolons.

Task 6

Unlike Python, we want our language to distinguish between creating new variables and assigning to existing variables. A variable declaration introduces a new variable, and it should look like this:

var x = 123

Add support for parsing variable declarations.

Only allow variable declarations directly inside blocks ({ ... }) and in top-level expressions. (This will make later parts simpler, since we won’t have to consider which scope expressions like if ... then var x = 3 put the x in.)

Task 7

Consider the semicolon in this code:

{
    if ... {
        ...
    };  # <-- this one
    ...
}

It looks weird and unnecessary, right?

Modify your parser to work even if a semicolon is missing after a }.

Note

It is not correct to insert a semicolon after every }. You should not accidentally add a semicolon after the final expression of a block. In other words, { { x } { y } } should parse like { { x }; { y } } and not like { { x }; { y }; }.

Some test cases:

{ { a } { b } } should be allowed.
{ a b } should NOT be allowed.
{ if true then { a } b } should be allowed.
{ if true then { a }; b } should be allowed.
{ if true then { a } b c } should NOT be allowed.
{ if true then { a } b; c } should be allowed.
{ if true then { a } else { b } 3 } should be allowed.
x = { { f(a) } { b } } should be allowed.

Task 8

Let’s add source code locations to AST nodes so that later compiler stages can include locations in their error messages.

Add the field location: Location to the class Expression and pass a location from an appropriate token to each AST node you construct.

Task 9

Implement any remaining syntax features described in the language’s syntax spec, except you can leave typed variable declarations and type expressions for chapter 5 if you wish.

Remember to allow multiple top-level expressions.
Remember to still write unit tests 🙂

Compilers

spring 2024

3. Parser

Structuring the parser

Parsing functions

Lookahead

Associativity

Precedence

Recursion

Other techniques

Tasks