2. Tokenizer
Tokenizing means splitting a source code text into a set of tokens.
For instance, the code snippet
if a <= bee then print_int(123)
would turn into the token list
['if', 'a', '<=', 'bee', 'then', 'print_int', '(', '123', ')']
A token is a word or symbol that has some meaning for the program. Splitting the source code into tokens makes the next stage, parsing, much easier. Otherwise we’d have to do essentially the same work in the parser, and parsers are difficult enough as it is.
Regular expressions
Regular expressions (regexes) are a text matching tool that is very useful in building tokenizers.
If you’ve already used regexes a lot, you can safely skip this section. If you remember regexes from the course ”Models of Computation” but haven’t used them much in practice, give this section a skim. If you’d like a more thorough and interactive introduction than this, try this online tutorial.
A regex is a pattern that matches some input text.
Let’s look at an example of a regex: [Hh]ello (you|there)[!.]?
It matches inputs like hello you
and Hello there!
The regex means:
- The first letter must be an
H
or anh
. - Then we require
ello
. - Then we require either
you
orthere
. - Then we optionally allow a
!
or a.
.
Generally, a regex can use the following operators:
X?
to makeX
optionalX*
to allow zero or more repetitions ofX
X+
to allow one or more repetitions ofX
X|Y
to allow eitherX
orY
(...)
for grouping, like in math and programming[...]
for sets of allowed characters[^...]
for sets of disallowed characters.
to allow any single character
Operators ?
, *
and +
are greedy: they match as much of the text
as possible without preventing the regex as a whole from matching.
To match as little as possible i.e. to make the operators non-greedy,
append a ?
, so e.g. X*
becomes X*?
.
Let’s look at some more examples.
ab?c
- It means ”a”, followed optionally by ”b”, followed by ”c”.
- It matches inputs
ac
andabc
, but not e.g.a
orabcd
.
ab*c
- It means ”a”, followed by zero or more ”b”, followed by ”c”.
- It matches inputs
ac
,abc
,abbbbc
etc.
ab+c
- It means ”a”, followed by one or more ”b”, followed by ”c”.
- It matches inputs
abc
,abbbbc
etc., but notac
.
a(b|B)*c
- It means ”a”, followed by zero or more ”b” or ”B”, followed by ”c”.
- It matches inputs
abc
,aBc
,aBbBbbbBBc
etc.
a(b*|B*)c
- It means ”a”, followed by either zero or more ”b” or zero or more ”B”, followed by ”c”.
- It matches inputs
aBBc
andabbbbc
but notaBbBbc
.
a[0-9xyz]*c
- It means ”a”, followed by zero or more of: a digit, ”x”, ”y”, or ”z”.
- It matches inputs
a7y7c
,axzc
, etc., but not e.g.abc
.
a[^0-9xyz]*c
- It means ”a”, followed by zero or more of anything except a digit, ”x”, ”y”, or ”z”.
- It matches inputs
abc
,abcdc
, etc., but not e.g.a7y7c
.
a.*c
- It means ”a”, followed by anything at all, followed by ”c”.
- It matches any input that starts with an
a
and ends in ac
.
If you’d like to match characters like +
or *
that happen to also be regex operators,
you either need to escape them with \
or put them in angle brackets []
.
For instance, the regexes a(\+|\*)c
and a[+*]c
both match a+c
and a*c
.
Regex implementations often have many more features than what we’ve mentioned here, but they are rarely needed.
There are many online environments that assist in learning and developing regexes, e.g. regexlearn.com/playground. They are very useful for the following exercises, as well as for debugging regexes during the course project.
Exercises (optional)
- Create a regex that matches ”hello”, ”hello!”, ”hello!!”, ”hello!!!” etc.
- Modify that regex to allow the word ”there” after ”hello”, separated by at least one space.
- Create a regex that matches one of the arithmetic operators
+
,-
,*
and/
. - Create a regex that matches integers like
7
,123
, and-95
. - Create a regex that matches quoted strings like
"hi"
and"so cool"
(but not"a"b"
. - Modify that regex to allow escaped double quotes, like in
"hello \"human\""
. - Create a regex that matches what you’d consider a reasonable variable name in a programming language.
Using regexes in Python
There are two primary ways to use regexes:
- Validation: check whether the input (all or part of it) matches a regex.
- Search: find parts of the input that match a regex.
Python has a built-in module re
that you can use to create and use regexes:
import re
### Compile the regex into a regex object ###
r = re.compile(r'ab+c')
### Check whether the entire input matches ###
print(r.fullmatch('abc'))
# Prints an object describing the match
print(r.fullmatch('axc'))
# Prints "None"
### Search a string for matches ###
print(r.search('the abc book'))
# Prints "<re.Match object; span=(4, 7), match='abc'>"
print(r.findall('the abc book. abbbc!'))
# Prints "['abc', 'abbbc']"
### Check whether a pattern occurs at an index ###
print(r.match('the abc book', 4))
# Prints "<re.Match object; span=(4, 7), match='abc'>"
print(r.match('the abc book', 5))
# Prints "None", because it didn't match at index 5
Note that we constructed the regex with a string like r'ab+c'
.
The r
prefix makes it a ”raw string” where backslashes (\
) are
not interpreted by Python as string escape characters. So while '\n'
means ”a newline character”, r'\n'
means ”a backslash and an ’n’”.
This leaves backslashes available for regex escapes,
so to create a regex that matches the string a.b
,
you can write r'a\.b'
instead of having to write 'a\\.b'
.
You can find the extensive reference manual for module re
here.
All widely used programming languages have a regex facility, usually as part of the standard library. They all behave similarly and with similar syntax, with small differences in advanced features.
If you’d like to learn the computer science theory behind regexes, including proofs of what they can and cannot do, and instructions for implementing them, see e.g. Introduction to the Theory of Computation.
Tokenization algorithm
With the right regexes, the code for tokenization is reasonably straightforward:
- Create regexes to recognize whitespace, comments and each kind of token (positive integer, identifier, operator, …).
- Start looking at the input string from the beginning.
- While we’ve not reached the end of the input:
- For each regex: try to match the input at the current location.
- If the whitespace regex matches, skip over the whitespace.
- If the comment regex matches, skip over the comment.
- If some token regex matches, add the token to the result and then skip over it.
- Otherwise raise an error.
Tasks
These tasks will guide you in building up the course project.
Hint: Test-driven development works well here. If you write some test cases before or just after adding a feature to your tokenizer, then you can iterate quickly and confidently.
Task 1
To get started, download the project template,
extract it and follow the instructions in README.md
to set it up.
Create src/compiler/tokenizer.py
and write in it a tokenizer function
tokenize(source_code: str) -> list[str]
that recognizes the following tokens:
- Identifiers i.e. variable names and keywords like
if
,while
, etc.- They can start with any character a-z / A-Z or an underscore (
_
), and any following characters may be a-z / A-Z, digits (0-9) or underscores.
- They can start with any character a-z / A-Z or an underscore (
- Non-negative integer literals like
7
, and123
.- Don’t accept a minus sign
-
as part of an integer literal. We’ll later tokenize the minus sign as an operator. That’s because tokenizing e.g.3-2
as['3', '-', '2']
is easier to handle later than tokenizing it as['3', '-2']
.
- Don’t accept a minus sign
Have it skip whitespace, including newlines. Don’t worry about comments or other token types yet.
Create tests/tokenizer_test.py
and write some tests. Example:
from compiler.tokenizer import tokenize
def test_tokenizer_basics() -> None:
assert tokenize("if 3\nwhile") == ['if', '3', 'while']
...
Task 2
So far we’ve built the tokenizer to return a simple list of strings (['if', '3', ...]
).
For the benefit of later compiler stages, it’s useful to change tokens to be objects that include three things:
- The text of the token.
- The type of the token (integer, operator, punctuation, identifier)
- The source location (e.g. as file, line and column) of the token.
The type information makes writing the parser easier, and the source location can be used in error messages later.
Hint: It’s best to define a separate class for source locations, because locations will be passed along in later compiler stages.
Hint: Python’s dataclasses are worth learning at this point.
About unit tests
Now that you’re turning tokens into objects, you’d have to write full token objects, including file, line and column information, on the right hand side of each comparison in each unit test. That sounds tedious and brittle, right?
Here are two ways to deal with this:
Option 1 (easier for now): Test file, line, column and type information in a few dedicated tests.
For the rest, define a wrapper function for tokenize
that returns just the texts of the tokens.
Option 2 (more useful later): Create a special source location object that is considered equal to all source location objects. Then you can use it as a placeholder in your tests when you don’t care about the location.
Here’s an example where this special object is called L
(short name because it’ll be used a lot):
assert tokenize('aaa 123 bbb') == [
Token(loc=L, type="identifier", text="aaa"),
Token(loc=L, type="int_literal", text="123"),
Token(loc=L, type="identifier", text="bbb"),
]
To define such an L
, you need to define an __eq__
method for your token type so that it returns true if self
or the argument is L
.
Task 3
Extend the tokenizer to recognize the following:
- Operators:
+
,-
,*
,/
,=
,==
,!=
,<
,<=
,>
,>=
- Hint: if you use
|
in your regex, put the longer operators first, because the first matching option wins (not the longest).
- Hint: if you use
- Punctuation:
(
,)
,{
,}
,,
,;
- Skip over one-line comments like
// bla
and# bla
.
Extra exercise
You can support multi-line comments too if you wish. They look like this:
/*
Many lines
of comment
text.
*/
print_int(123)
/* Another
comment. */
Note that regexes are not powerful enough to recognize nested multi-line comments. This is fine – most programming languages don’t support them either.
Hint