3. Parser
Parsing means turning a list of tokens into an Abstract Syntax Tree (AST). An AST organizes tokens into a tree where operations are connected to their parameters.
For instance, the expression a + b * c
(i.e. token list ['a', '+', 'b', '*', 'c']
) turns into
an AST that looks like this:
This tree means that the program calls operation +
with the following parameters:
- the variable
a
- the result of calling operation
*
with variablesb
andc
as parameters
An AST for a function call like f(x, g(x), y + 3)
looks like this:
This tree means that we call function f
with the following parameters:
- the variable
x
- the result of calling function
g
with variablex
as the parameter - the result of calling operation
+
with variabley
and constant3
as parameters
Other constructs like control flow expressions turn into ASTs similarly.
E.g. if x > 10 then print_int(x) else print_int(y)
becomes:
We translate token lists into ASTs because an AST expresses clearly and unambiguously what operations need to happen and in which order. This makes the program much easier to analyze, interpret and compile.
You can try how different expressions turn into ASTs in the sandbox
Exercise (optional)
Draw ASTs for the following expressions:
a + b
f(a + b, b + c)
f(f(a))
f(a * f(b)) + c
while i < 100 do i = i + 1
Structuring the parser
Let’s define some classes to hold our AST:
# src/compiler/ast.py
@dataclass
class Expression:
"""Base class for AST nodes representing expressions."""
@dataclass
class Literal(Expression):
value: int | bool
@dataclass
class Identifier(Expression):
name: str
@dataclass
class BinaryOp(Expression):
"""AST node for a binary operation like `A + B`"""
left: Expression
op: str
right: Expression
... # You get to define more later
Now we can construct AST nodes like this:
# Construct AST for "x + 3" using
example_ast = BinaryOp(
left=Identifier(name="x"),
op="+",
right=Literal(value=3),
)
Our parser needs to turn a list of tokens into an AST:
# src/compiler/parser.py
from compiler.tokenizer import Token
import compiler.ast as ast
def parse(tokens: list[Token]) -> ast.Expression:
...
Let’s now see how to implement this function.
Exercise (optional)
Next we’ll start with a simple implementation of parse
and add features to it bit by bit.
You may find it useful to follow along in your own project and add some unit tests
after each step.
Warning: when importing AST classes, don’t accidentally import from Python’s
built-in ast
module – remember to import from compiler.ast
.
Parsing functions
The idea for implementing the parser is this:
- Keep track of our position in the token list in a variable
pos
. - Define a set of parsing functions to parse different kinds of AST subtrees.
For instance, there can be one parsing function for if
expressions, another one
for function calls, another one for expressions like a + b
, and so on.
The implementations of parsing functions vary, but they must all follow this specification:
- If the tokens starting at
pos
match the things that the parsing function wants to parse, then movepos
forward past the matching tokens and return an AST representing what was parsed. - Otherwise, raise an error.
Let’s look at an example of a parser that can parse a single ”integer +/- integer” token sequence,
such as 3 + 4
or 7 - 2
.
The example has two helper functions peek
and consume
for conveniently looking
at the next token and advancing pos
. After that, it has two parsing functions:
parse_int_literal
- parses a single integer literalparse_expression
- usesparse_int_literal
to parse an integer literal, then requires a+
or-
, then usesparse_int_literal
to parse another integer literal.
def parse(tokens: list[Token]) -> ast.Expression:
# This keeps track of which token we're looking at.
pos = 0
# Let's first define two useful helper functions:
# 'peek' and 'consume'.
# 'peek()' returns the token at 'pos',
# or a special 'end' token if we're past the end
# of the token list.
# This way we don't have to worry about going past
# the end elsewhere.
def peek() -> Token:
if pos < len(tokens):
return tokens[pos]
else:
return Token(
location=tokens[-1].location,
type="end",
text="",
)
# 'consume(expected)' returns the token at 'pos'
# and moves 'pos' forward.
#
# If the optional parameter 'expected' is given,
# it checks that the token being consumed has that text.
# If 'expected' is a list, then the token must have
# one of the texts in the list.
def consume(expected: str | list[str] | None = None) -> Token:
token = peek()
if isinstance(expected, str) and token.text != expected:
raise Exception(f'{token.location}: expected "{expected}"')
if isinstance(expected, list) and token.text not in expected:
comma_separated = ", ".join([f'"{e}"' for e in expected])
raise Exception(f'{token.location}: expected one of: {comma_separated}')
pos += 1
return token
# This is the parsing function for integer literals.
# It checks that we're looking at an integer literal token,
# moves past it, and returns a 'Literal' AST node
# containing the integer from the token.
def parse_int_literal() -> ast.Literal:
if peek().type != 'int_literal':
raise Exception(f'{peek().location}: expected an integer literal')
token = consume()
return ast.Literal(int(token.text))
# This is our main parsing function for this example.
# To parse "integer + integer" expressions,
# it uses `parse_int_literal` to parse the first integer,
# then it checks that there's a supported operator,
# and finally it uses `parse_int_literal` to parse the
# second integer.
def parse_expression() -> ast.BinaryOp:
left = parse_int_literal()
operator_token = consume(['+', '-'])
right = parse_int_literal()
return ast.BinaryOp(
left,
operator_token.text,
right
)
return parse_expression()
This simple parser works for 2 + 2
and 55 + 7
etc,
but it doesn’t yet support identifiers (e.g. a + 3
) nor
more than one operation (e.g. 1 - 2 + 3
). Next we’ll address these
deficiencies as we introduce the main techniques and concerns for writing
parsing functions.
Lookahead
First, let’s allow identifiers in addition to integer literals, so we can parse
a + 3
, x + y
and similar.
def parse(tokens: list[Token]) -> ast.Expression:
...
def parse_int_literal() -> ast.Literal:
... # (same as before)
def parse_identifier() -> ast.Identifier
... # (very similar to parse_int_literal)
# Borrowing from mathematics, we say that
# "an identifier or an integer literal"
# is called a "term".
def parse_term() -> ast.Expression:
if peek().type == 'int_literal':
return parse_int_literal()
elif peek().type == 'identifier':
return parse_identifier()
else:
raise Exception(f'{peek().location}: expected an integer literal or an identifier')
def parse_expression() -> ast.Expression:
left = parse_term()
operator_token = consume(['+', '-'])
right = parse_term()
return ast.BinaryOp(
left,
operator_token.text,
right
)
return parse_expression()
Here parse_term
looks ahead to see what kind of token comes next,
and uses that to choose which parsing function to call.
We’ll be seeing this pattern a lot.
Associativity
Ok, how do we allow 1 - 2 + 3 + ...
?
First, we need to decide whether we want 1 - 2 + 3
to be understood as
left-associative (1 - 2) + 3
or right-associative 1 - (2 + 3)
.
The convention in existing languages is for most mathematical operators to be
left-associative, so let’s do that. This also makes sense for +
and -
, since
it would be very surprising if:
1 - 2 + 3 = 1 - (2 + 3) = -4 😱
To parse 1 - 2 + 3 + ...
, we need to make parse_expression
loop for as long as
it finds more +
or -
:
def parse_expression() -> ast.Expression:
# Parse the first term.
left = parse_term()
# While there are more `+` or '-'...
while peek().text in ['+', '-']:
# Move past the '+' or '-'.
operator_token = consume()
operator = operator_token.text
# Parse the operator on the right.
right = parse_term()
# Combine it with the stuff we've
# accumulated on the left so far.
left = ast.BinaryOp(
left,
operator,
right
)
return left
By adding each BinaryOp
on top of left
, we achieve left-associativity.
Exercise (optional)
Implement a version of parse_expression
with right-associativity.
(This is not useless - we’ll want right-associativity for operator =
soon.)
Hint
Precedence
Now let’s add operators *
and /
.
We’d like operators to have precedence like in mathematics:
a + b * c
should be understood as a + (b * c)
.
That is, we’d like *
and /
to have higher precedence
than +
and -
.
Note that this does not contradict associativity. Associativity only determines how operators on the same precedence level are grouped.
The easiest way to implement precedence is to write separate parsing functions for each precedence level, and have lower precedence level functions call higher precedence functions.
Concretely, when parsing a + -
expression,
we allow the ”terms” to be * /
expressions.
When parsing a * /
expression,
we allow the ”factors” to be integer literals or identifiers.
def parse_expression() -> ast.Expression:
# Same as before
left = parse_term()
while peek().text in ['+', '-']:
operator_token = consume()
operator = operator_token.text
right = parse_term()
left = ast.BinaryOp(
left,
operator,
right
)
return left
def parse_term() -> ast.Expression:
# Same structure as in 'parse_expression',
# but the operators and function calls differ.
left = parse_factor()
while peek().text in ['*', '/']:
operator_token = consume()
operator = operator_token.text
right = parse_factor()
left = ast.BinaryOp(
left,
operator,
right
)
return left
def parse_factor() -> ast.Expression:
# Same as `parse_term` in the previous version.
if peek().type == 'int_literal':
return parse_int_literal()
elif peek().type == 'identifier':
return parse_identifier()
else:
raise Exception(f'{peek().location}: expected an integer literal or an identifier')
Parentheses
Next it’d be nice to support parentheses, so we could write
(a + b) * c
and get this AST:
In (a + b) * c
, the parenthesized expression (a + b)
plays the role of
a factor of the *
. That suggests that we should handle parentheses as one possible case
in parse_factor
, and that indeed works:
def parse_factor() -> ast.Expression:
if peek().text == '(':
return parse_parenthesized()
elif peek().type == 'int_literal':
return parse_int_literal()
elif peek().type == 'identifier':
return parse_identifier()
else:
raise Exception(f'{peek().location}: expected "(", an integer literal or an identifier')
def parse_parenthesized() -> ast.Expression:
consume('(')
# Recursively call the top level parsing function
# to parse whatever is inside the parentheses.
expr = parse_expression()
consume(')')
return expr
Note that our parser is now recursive: parse_parenthesized
calls parse_expression
,
which may end up calling parse_parenthesized
again, etc.
We now have a pretty good parser for simple math expressions!
Other techniques
There are a few more techniques whose existence is good to know about, even if they aren’t needed or recommended on this course.
Backtracking is an alternative to lookahead:
it means using try..except
instead of if..else
to determine which parsing function to call.
This technique is not recommended, though it can be necessary for some difficult-to-parse
languages.
Grammars are a formal way to specify a language’s syntax. Tools exist to generate parser code automatically from suitable grammar specifications.
Tasks
These tasks will guide you in building up the course project.
It’s a good idea to have passing tests and make a commit between each task.
Task 1
Implement and unit test in your project the parser that we’ve built up above.
Make sure you understand the code.
Clean up some details that we’ve glossed over:
- Make sure the entire input is always parsed. Don’t allow garbage at
the end of the input. Make sure e.g.
a + b c
gives an error. - Handle the case of empty input more nicely
(currently
peek()
crashes uglily withIndexError
iftokens
is empty). - Remember to also write some test cases for inputs that should fail to parse.
Task 2
Add support for parsing if-then-else expressions like if a then b + c else x * y
.
Make the else
part optional i.e. also allow if a then b + c
.
Remember to write unit tests.
Make sure if
-expressions are allowed as parts of other expressions,
such as in 1 + if true then 2 else 3
. Also check that nested if
-expressions work.
Task 3
Add support for parsing function calls like f(x, y + z)
.
Hint
Once you’ve parsed something at the highest precedence level,
check if it’s followed by a (
. If it is, then the thing that was
just parsed is the function and what follows is the argument list.
Task 4
Add support for some more operators:
- Remainder operator:
%
- Binary comparison operators:
==
,!=
,<
,<=
,>
,>=
- Binary logical operators:
and
andor
- Unary operators:
not
and-
. (Be sure to allow chaining:not not x
etc.) - Assignment
=
(right-associative! e.g.a = b = c
must parse asa = (b = c)
)
Use these precedence levels:
=
or
and
,==
,!=
<
,<=
,>
,>=
+
,-
*
,/
,%
- Unary
-
andnot
- All other constructs: literals, identifiers,
if
,while
,var
, blocks, parentheses, function calls.
Extra exercise
The code is getting a bit repetitive, isn’t it? Also, what if we wanted to allow the language’s user to specify their own operators?
Generalize the code for binary operators to use a list of precedence levels such as the following, instead of repeating code for each level.
left_associative_binary_operators = [
['or'],
['and'],
['==', '!='],
['<', '<=', '>', '>='],
['+', '-'],
['*', '/'],
]
Task 5
A block is a sequence of expressions inside braces {}
,
separated and optionally terminated by a semicolon ;
.
It might look like this:
{
f(a);
x = y;
f(x)
}
Note that the last semicolon is optional.
When it’s missing, it means that the last expression in the block is a result expression
i.e. it defines the value that the block evaluates to.
So e.g. x = { f(a); b }
means ”execute f(a)
and assign the value b
to x
”.
If the final semicolon is present,
then the parser should insert a Literal(value=None)
as the result expression.
Larger example:
{
while f() do {
x = 10;
y = if g(x) then {
x = x + 1;
x
} else {
g(x)
}; # <-- (this semicolon will become optional later)
g(y);
}; # <------ (this too)
123
}
Add support for these kinds of brace-delimited blocks to the parser.
Remember to also write tests for cases that should return an error, such as missing semicolons.
Task 6
Unlike Python, we want our language to distinguish between creating new variables and assigning to existing variables. A variable declaration introduces a new variable, and it should look like this:
var x = 123
Add support for parsing variable declarations.
Only allow variable declarations directly inside blocks ({ ... }
)
and in top-level expressions. (This will make things simpler later,
since we won’t have to consider which scope
expressions like if ... then var x = 3
put the x
in.)
Task 7
Consider the semicolon in this code:
{
if ... {
...
}; # <-- this one
...
}
It looks weird and unnecessary, right?
Modify your parser to work even if a semicolon is
missing after a }
.
We could go even further. E.g. JavaScript, Haskell, Go and Scala can be written with almost no semicolons, since their parsers can in most cases infer where the semicolons should go, often assisted by newlines.
Go has documented
quite concisely how it does this, but that solution won’t work for us as-is
if we want to keep the distinction between { x }
and { x; }
.
Note
It is not correct to insert a semicolon after every }
.
You should not accidentally add a semicolon after the final expression of a block.
In other words, { { x } { y } }
should parse like { { x }; { y } }
and not like { { x }; { y }; }
.
Some test cases:
{ { a } { b } }
should be allowed.{ a b }
should NOT be allowed.{ if true then { a } b }
should be allowed.{ if true then { a }; b }
should be allowed.{ if true then { a } b c }
should NOT be allowed.{ if true then { a } b; c }
should be allowed.{ if true then { a } else { b } 3 }
should be allowed.x = { { f(a) } { b } }
should be allowed.
Task 8
Let’s add source code locations to AST nodes so that later compiler stages can include locations in their error messages.
Add the field location: Location
to the class Expression
and pass a location from an appropriate token to each AST node you construct.
Task 9
Implement any remaining syntax features described in the language’s syntax spec, except you can leave typed variable declarations and type expressions for chapter 5 if you wish.
- Remember to allow multiple top-level expressions as if they were inside a block. As in blocks, multiple top-level expressions must be separated by semicolons, and optionally end in a semicolon.
- Writing thorough unit tests is still a good idea, to prevent surprises later.