Compilers

spring 2024

3. Parser

Parsing means turning a list of tokens into an Abstract Syntax Tree (AST). An AST organizes tokens into a tree where operations are connected to their parameters.


For instance, the expression a + b * c (i.e. token list ['a', '+', 'b', '*', 'c']) turns into an AST that looks like this:

This tree means that the program calls operation + with the following parameters:

  1. the variable a
  2. the result of calling operation * with variables b and c as parameters


An AST for a function call like f(x, g(x), y + 3) looks like this:

This tree means that we call function f with the following parameters:

  1. the variable x
  2. the result of calling function g with variable x as the parameter
  3. the result of calling operation + with variable y and constant 3 as parameters


Other constructs like control flow expressions turn into ASTs similarly. E.g. if x > 10 then print_int(x) else print_int(y) becomes:


We translate token lists into ASTs because an AST expresses clearly and unambiguously what operations need to happen and in which order. This makes the program much easier to analyze, interpret and compile.

You can try how different expressions turn into ASTs in the sandbox

Exercise (optional)

Draw ASTs for the following expressions:

  • a + b
  • f(a + b, b + c)
  • f(f(a))
  • f(a * f(b)) + c
  • while i < 100 do i = i + 1

Structuring the parser

Let’s define some classes to hold our AST:

# src/compiler/ast.py

@dataclass
class Expression:
    """Base class for AST nodes representing expressions."""

@dataclass
class Literal(Expression):
    value: int | bool | None
    # (value=None is used when parsing the keyword `unit`)

@dataclass
class Identifier(Expression):
    name: str

@dataclass
class BinaryOp(Expression):
    """AST node for a binary operation like `A + B`"""
    left: Expression
    op: str
    right: Expression

... # You get to define more later

Now we can construct AST nodes like this:

# Construct AST for "x + 3" using
example_ast = BinaryOp(
    left=Identifier(name="x"),
    op="+",
    right=Literal(value=3),
)

Our parser needs to turn a list of tokens into an AST:

# src/compiler/parser.py

from compiler.tokenizer import Token
import compiler.ast as ast

def parse(tokens: list[Token]) -> ast.Expression:
    ...

Let’s now see how to implement this function.

Exercise (optional)

Next we’ll start with a simple implementation of parse and add features to it bit by bit. You may find it useful to follow along in your own project and add some unit tests after each step.

Warning: when importing AST classes, don’t accidentally import from Python’s built-in ast module – remember to import from compiler.ast.

Parsing functions

The idea for implementing the parser is this:

  1. Keep track of our position in the token list in a variable pos.
  2. Define a set of parsing functions to parse different kinds of AST subtrees.

For instance, there can be one parsing function for if expressions, another one for function calls, another one for expressions like a + b, and so on.

Parsing functions follow this specification:

Let’s look at an example of a parser that can parse a single ”integer +/- integer” token sequence, such as 3 + 4 or 7 - 2.

The example has two helper functions peek and consume for conveniently looking at the next token and advancing pos. After that, it has two parsing functions:

def parse(tokens: list[Token]) -> ast.Expression:
    # This keeps track of which token we're looking at.
    pos = 0

    # Let's first define two useful helper functions:
    # 'peek' and 'consume'.

    # 'peek()' returns the token at 'pos',
    # or a special 'end' token if we're past the end
    # of the token list.
    # This way we don't have to worry about going past
    # the end elsewhere.
    def peek() -> Token:
        if pos < len(tokens):
            return tokens[pos]
        else:
            return Token(
                location=tokens[-1].location,
                type="end",
                text="",
            )

    # 'consume(expected)' returns the token at 'pos'
    # and moves 'pos' forward.
    #
    # If the optional parameter 'expected' is given,
    # it checks that the token being consumed has that text.
    # If 'expected' is a list, then the token must have
    # one of the texts in the list.
    def consume(expected: str | list[str] | None = None) -> Token:
        token = peek()
        if isinstance(expected, str) and token.text != expected:
            raise Exception(f'{token.location}: expected "{expected}"')
        if isinstance(expected, list) and token.text not in expected:
            comma_separated = ", ".join([f'"{e}"' for e in expected])
            raise Exception(f'{token.location}: expected one of: {comma_separated}')
        pos += 1
        return token

    # This is the parsing function for integer literals.
    # It checks that we're looking at an integer literal token,
    # moves past it, and returns a 'Literal' AST node
    # containing the integer from the token.
    def parse_int_literal() -> ast.Literal:
        if peek().type != 'int_literal':
            raise Exception(f'{peek().location}: expected an integer literal')
        token = consume()
        return ast.Literal(int(token.text))

    # This is our main parsing function for this example.
    # To parse "integer + integer" expressions,
    # it uses `parse_int_literal` to parse the first integer,
    # then it checks that there's a supported operator,
    # and finally it uses `parse_int_literal` to parse the
    # second integer.
    def parse_expression() -> ast.BinaryOp:
        left = parse_int_literal()
        operator_token = consume(['+', '-'])
        right = parse_int_literal()
        return ast.BinaryOp(
            left,
            operator_token.text,
            right
        )

    return parse_expression()

This simple parser works for 2 + 2 and 55 + 7 etc, but it doesn’t yet support identifiers (e.g. a + 3) nor more than one operation (e.g. 1 - 2 + 3). Next we’ll address these deficiencies as we introduce the main techniques and concerns for writing parsing functions.

Lookahead

First, let’s allow identifiers in addition to integer literals, so we can parse a + 3, x + y and similar.

def parse(tokens: list[Token]) -> ast.Expression:
    ...

    def parse_int_literal() -> ast.Literal:
        ... # (same as before)

    def parse_identifier() -> ast.Identifier
        ... # (very similar to parse_int_literal)

    # Borrowing from mathematics, we say that
    # "an identifier or an integer literal"
    # is called a "term".
    def parse_term() -> ast.Expression:
        if peek().type == 'int_literal':
            return parse_int_literal()
        elif peek().type == 'identifier':
            return parse_identifier()
        else:
            raise Exception(f'{peek().location}: expected an integer literal or an identifier')

    def parse_expression() -> ast.Expression:
        left = parse_term()
        operator_token = consume(['+', '-'])
        right = parse_term()
        return ast.BinaryOp(
            left,
            operator_token.text,
            right
        )

    return parse_expression()

Here parse_term looks ahead to see what kind of token comes next and uses that to choose which parsing function to call. We’ll be seeing this pattern a lot.

Associativity

Ok, how do we allow 1 - 2 + 3 + ...?

First, we need to decide whether we want 1 - 2 + 3 to be understood as left-associative (1 - 2) + 3 or right-associative 1 - (2 + 3).

The convention in existing languages is for most mathematical operators to be left-associative, so let’s do that. This also makes sense for + and -, since it would be very surprising if:

1 - 2 + 3  =  1 - (2 + 3)  =  -4   😱

To parse 1 - 2 + 3 + ..., we need to make parse_expression loop for as long as it finds more + or -:

def parse_expression() -> ast.Expression:
    # Parse the first term.
    left = parse_term()

    # While there are more `+` or '-'...
    while peek().text in ['+', '-']:
        # Move past the '+' or '-'.
        operator_token = consume()
        operator = operator_token.text

        # Parse the operator on the right.
        right = parse_term()

        # Combine it with the stuff we've
        # accumulated on the left so far.
        left = ast.BinaryOp(
            left,
            operator,
            right
        )
    
    return left

By adding each BinaryOp on top of left, we achieve left-associativity.

Exercise (optional)

Implement a version of parse_expression with right-associativity.
(This is not useless - we’ll want right-associativity for operator = later.)

Hint

Precedence

Now let’s add operators * and /.

We’d like operators to have precedence like in mathematics: a + b * c should be understood as a + (b * c). That is, we’d like * and / to have higher precedence than + and -.

Note that this does not contradict associativity. Associativity only determines how operators on the same precedence level are grouped.

The easiest way to implement precedence is to write separate parsing functions for each precedence level, and have lower precedence level functions call higher precedence functions.

Concretely, when parsing a + - expression, we allow the ”terms” to be * / expressions. When parsing a * / expression, we allow the ”factors” to be integer literals or identifiers.

def parse_expression() -> ast.Expression:
    # Same as before
    left = parse_term()
    while peek().text in ['+', '-']:
        operator_token = consume()
        operator = operator_token.text
        right = parse_term()
        left = ast.BinaryOp(
            left,
            operator,
            right
        )
    return left

def parse_term() -> ast.Expression:
    # Same structure as in 'parse_expression',
    # but the operators and function calls differ.
    left = parse_factor()
    while peek().text in ['*', '/']:
        operator_token = consume()
        operator = operator_token.text
        right = parse_factor()
        left = ast.BinaryOp(
            left,
            operator,
            right
        )
    return left

def parse_factor() -> ast.Expression:
    # Same as `parse_term` in the previous version.
    if peek().type == 'int_literal':
        return parse_int_literal()
    elif peek().type == 'identifier':
        return parse_identifier()
    else:
        raise Exception(f'{peek().location}: expected an integer literal or an identifier')

Recursion

Next it’d be nice to support parentheses, so we could write (a + b) * c and get this AST:

In (a + b) * c, the parenthesized expression (a + b) plays the role of a factor of the *. That suggests that we should handle parentheses as one possible case in parse_factor, and that indeed works:

def parse_factor() -> ast.Expression:
    if peek().text == '(':
        return parse_parenthesized()
    elif peek().type == 'int_literal':
        return parse_int_literal()
    elif peek().type == 'identifier':
        return parse_identifier()
    else:
        raise Exception(f'{peek().location}: expected "(", an integer literal or an identifier')

def parse_parenthesized() -> ast.Expression:
    consume('(')
    # Recursively call the top level parsing function
    # to parse whatever is inside the parentheses.
    expr = parse_expression()
    consume(')')
    return expr

Note that our parser is now recursive: parse_parenthesized calls parse_expression, which may end up calling parse_parenthesized again, etc.

We now have a pretty good parser for simple math expressions!

Other techniques

There are a few more techniques whose existence is good to know about, even if they aren’t needed or recommended on this course.

Backtracking is an alternative to lookahead: it means using try..except instead of if..else to determine which parsing function to call. This technique is not recommended, though it can be necessary for some difficult-to-parse languages.

Grammars are a formal way to specify a language’s syntax. Tools exist to generate parser code automatically from suitable grammar specifications.

Tasks

These tasks will guide you in building up the course project.

It’s a good idea to have passing tests and make a commit between each task.

Task 1

Implement and unit test in your project the parser that we’ve built up above.

Make sure you understand the code.

Clean up some details that we’ve glossed over:

  • Make sure the entire input is always parsed. Don’t allow garbage at the end of the input. Make sure e.g. a + b c gives an error.
  • Handle the case of empty input more nicely (currently peek() crashes uglily with IndexError if tokens is empty).
  • Remember to also write some test cases for inputs that should fail to parse.

Task 2

Add support for parsing if-then-else expressions like if a then b + c else x * y.

Make the else part optional i.e. also allow if a then b + c.

Remember to write unit tests.

Make sure if-expressions are allowed as parts of other expressions, such as in 1 + if true then 2 else 3.

Task 3

Add support for parsing function calls like f(x, y + z).

Hint

Task 4

Add support for some more operators:

  • Remainder operator: %
  • Binary comparison operators: ==, !=, <, <=, >, >=
  • Binary logical operators: and and or
  • Unary operators: not and -. (Be sure to allow chaining: not not x etc.)
  • Assignment = (right-associative!)

Use these precedence levels:

  1. =
  2. or
  3. and,
  4. ==, !=
  5. <, <=, >, >=
  6. +, -
  7. *, /, %
  8. Unary - and not
  9. All other constructs: literals, identifiers, if, while, var, blocks, parentheses, function calls.

Bonus task

The code is getting a bit repetitive, isn’t it? Also, what if we wanted to allow the language’s user to specify their own operators?

Generalize the code for binary operators to use a list of precedence levels such as the following, instead of repeating code for each level.

left_associative_binary_operators = [
    ['or'],
    ['and'],
    ['==', '!='],
    ['<', '<=', '>', '>='],
    ['+', '-'],
    ['*', '/'],
]

Task 5

A block is a sequence of expressions inside braces {}, separated and optionally terminated by a semicolon ;. It might look like this:

{
    f(a);
    x = y;
    f(x)
}

Note that the last semicolon is optional. When it’s missing, it means that the last expression in the block is a result expression i.e. it defines the value that the block evaluates to. So e.g. x = { f(a); b } means ”execute f(a) and assign the value b to x”.

If the final semicolon is present, then the parser should insert a Literal(value=None) as the result expression.

Larger example:

{
    while f() do {
        x = 10;
        y = if g(x) then {
            x = x + 1;
            x
        } else {
            g(x)
        };  # <-- (this semicolon will become optional later)
        g(y);
    };  # <------ (this too)
    123
}


Add support for these kinds of brace-delimited blocks to the parser.

Remember to also write tests for cases that should return an error, such as missing semicolons.

Task 6

Unlike Python, we want our language to distinguish between creating new variables and assigning to existing variables. A variable declaration introduces a new variable, and it should look like this:

var x = 123

Add support for parsing variable declarations.

Only allow variable declarations directly inside blocks ({ ... }) and in top-level expressions. (This will make later parts simpler, since we won’t have to consider which scope expressions like if ... then var x = 3 put the x in.)

Task 7

Consider the semicolon in this code:

{
    if ... {
        ...
    };  # <-- this one
    ...
}

It looks weird and unnecessary, right?

Modify your parser to work even if a semicolon is missing after a }.

Note

It is not correct to insert a semicolon after every }. You should not accidentally add a semicolon after the final expression of a block. In other words, { { x } { y } } should parse like { { x }; { y } } and not like { { x }; { y }; }.

Some test cases:

  • { { a } { b } } should be allowed.
  • { a b } should NOT be allowed.
  • { if true then { a } b } should be allowed.
  • { if true then { a }; b } should be allowed.
  • { if true then { a } b c } should NOT be allowed.
  • { if true then { a } b; c } should be allowed.
  • { if true then { a } else { b } 3 } should be allowed.
  • x = { { f(a) } { b } } should be allowed.

Task 8

Let’s add source code locations to AST nodes so that later compiler stages can include locations in their error messages.

Add the field location: Location to the class Expression and pass a location from an appropriate token to each AST node you construct.

Task 9

Implement any remaining syntax features described in the language’s syntax spec, except you can leave typed variable declarations and type expressions for chapter 5 if you wish.

  • Remember to allow multiple top-level expressions.
  • Remember to still write unit tests 🙂