2. Tokenizer

Tokenizing means splitting a source code text into a set of tokens.

For instance, the code snippet

if a <= bee then print_int(123)

would turn into the token list

['if', 'a', '<=', 'bee', 'then', 'print_int', '(', '123', ')']

A token is a word or symbol that has some meaning for the program. Splitting the source code into tokens makes the next stage, parsing, much easier. Otherwise we’d have to do essentially the same work in the parser, and parsers are difficult enough as it is.

Regular expressions

Regular expressions (regexes) are a text matching tool that is very useful in building tokenizers.

If you’ve already used regexes a lot, you can safely skip this section. If you remember regexes from the course ”Models of Computation” but haven’t used them much in practice, give this section a skim. If you’d like a more thorough and interactive introduction than this, try this online tutorial.

A regex is a pattern that matches some input text.

Let’s look at an example of a regex: [Hh]ello (you|there)[!.]?

It matches inputs like hello you and Hello there!

The regex means:

The first letter must be an H or an h.
Then we require ello .
Then we require either you or there.
Then we optionally allow a ! or a ..

Generally, a regex can use the following operators:

X? to make X optional
X* to allow zero or more repetitions of X
X+ to allow one or more repetitions of X
X|Y to allow either X or Y
(...) for grouping, like in math and programming
[...] for sets of allowed characters
[^...] for sets of disallowed characters
. to allow any single character

Operators ?, * and + are greedy: they match as much of the text as possible without preventing the regex as a whole from matching. To match as little as possible i.e. to make the operators non-greedy, append a ?, so e.g. X* becomes X*?.

Let’s look at some more examples.

ab?c
- It means ”a”, followed optionally by ”b”, followed by ”c”.
- It matches inputs ac and abc, but not e.g. a or abcd.
ab*c
- It means ”a”, followed by zero or more ”b”, followed by ”c”.
- It matches inputs ac, abc, abbbbc etc.
ab+c
- It means ”a”, followed by one or more ”b”, followed by ”c”.
- It matches inputs abc, abbbbc etc., but not ac.
a(b|B)*c
- It means ”a”, followed by zero or more ”b” or ”B”, followed by ”c”.
- It matches inputs abc, aBc, aBbBbbbBBc etc.
a(b*|B*)c
- It means ”a”, followed by either zero or more ”b” or zero or more ”B”, followed by ”c”.
- It matches inputs aBBc and abbbbc but not aBbBbc.
a[0-9xyz]*c
- It means ”a”, followed by zero or more of: a digit, ”x”, ”y”, or ”z”.
- It matches inputs a7y7c, axzc, etc., but not e.g. abc.
a[^0-9xyz]*c
- It means ”a”, followed by zero or more of anything except a digit, ”x”, ”y”, or ”z”.
- It matches inputs abc, abcdc, etc., but not e.g. a7y7c.
a.*c
- It means ”a”, followed by anything at all, followed by ”c”.
- It matches any input that starts with an a and ends in a c.

If you’d like to match characters like + or * that happen to also be regex operators, you either need to escape them with \ or put them in angle brackets []. For instance, the regexes a(\+|\*)c and a[+*]c both match a+c and a*c.

Regex implementations often have many more features than what we’ve mentioned here, but they are rarely needed.

There are many online environments that assist in learning and developing regexes, e.g. regexlearn.com/playground. They are very useful for the following exercises, as well as for debugging regexes during the course project.

Exercises (optional)

Create a regex that matches ”hello”, ”hello!”, ”hello!!”, ”hello!!!” etc.
Modify that regex to allow the word ”there” after ”hello”, separated by at least one space.
Create a regex that matches one of the arithmetic operators +, -, * and /.
Create a regex that matches integers like 7, 123, and -95.
Create a regex that matches quoted strings like "hi" and "so cool".
Modify that regex to allow escaped double quotes, like in "hello \"human\"".
Create a regex that matches what you’d consider a reasonable variable name in a programming language.

Using regexes in Python

There are two primary ways to use regexes:

Validation: check whether the input (all or part of it) matches a regex.
Search: find parts of the input that match a regex.

Python has a built-in module re that you can use to create and use regexes:

import re

### Compile the regex into a regex object ###
r = re.compile(r'ab+c')

### Check whether the entire input matches ###
print(r.fullmatch('abc'))
# Prints an object describing the match

print(r.fullmatch('axc'))
# Prints "None"


### Search a string for matches ###
print(r.search('the abc book'))
# Prints "<re.Match object; span=(4, 7), match='abc'>"

print(r.findall('the abc book. abbbc!'))
# Prints "['abc', 'abbbc']"


### Check whether a pattern occurs at an index ###
print(r.match('the abc book', 4))
# Prints "<re.Match object; span=(4, 7), match='abc'>"

print(r.match('the abc book', 5))
# Prints "None", because it didn't match at index 5

Note that we constructed the regex with a string like r'ab+c'. The r prefix makes it a ”raw string” where backslashes (\) are not interpreted by Python as string escape characters. So while '\n' means ”a newline character”, r'\n' means ”a backslash and an ’n’”. This leaves backslashes available for regex escapes, so to create a regex that matches the string a.b, you can write r'a\.b' instead of having to write 'a\\.b' .

You can find the extensive reference manual for module re here.

All widely used programming languages have a regex facility, usually as part of the standard library. They all behave similarly and with similar syntax, with small differences in advanced features.

If you’d like to learn the computer science theory behind regexes, including proofs of what they can and cannot do, and instructions for implementing them, see e.g. Introduction to the Theory of Computation.

Tokenization algorithm

With the right regexes, the code for tokenization is reasonably straightforward:

Create regexes to recognize whitespace, comments and each type of token.
Start looking at the input string from the beginning.
While we’ve not reached the end of the input:
- For each regex: try to match the input at the current location.
- If the whitespace regex matches, skip over the whitespace.
- If the comment regex matches, skip over the comment.
- If some token regex matches, add the token to the result and then skip over it.
- Otherwise raise an error.

Tasks

These tasks will guide you in building up the course project.

Hint: Test-driven development works well here. If you write some test cases before or just after adding a feature to your tokenizer, then you can iterate quickly and confidently.

Task 1

To get started, download the project template, extract it and follow the instructions in README.md to set it up.

Create src/compilers/tokenizer.py and write in it a tokenizer function tokenize(source_code: str) -> list[str] that recognizes the following tokens:

Identifiers i.e. variable names and keywords like if, while, etc.
- They can start with any character a-z / A-Z or an underscore (_), and any following characters may be a-z / A-Z, digits (0-9) or underscores.
Non-negative integer literals like 7, and 123.
- Don’t accept a minus sign - as part of an integer literal. We’ll later tokenize the minus sign as an operator. That’s because tokenizing e.g. 3-2 as ['3', '-', '2'] is easier to handle later than tokenizing it as ['3', '-2'].

Have it skip whitespace, including newlines. Don’t worry about comments or other token types yet.

Create tests/tokenizer_test.py and write some tests. Example:

from compiler.tokenizer import tokenize

def test_tokenizer_basics():
    assert tokenize("if  3\nwhile") == ['if', '3', 'while']
    ...

Task 2

So far we’ve built the tokenizer to return a simple list of strings (['if', '3', ...]).

For the benefit of later compiler stages, it’s useful to change tokens to be objects that include three things:

The text of the token.
The type of the token (integer, operator, punctuation, identifier)
The source location (e.g. as file, line and column) of the token.

The type information makes writing the parser easier, and the source location can be used in error messages later.

Hint: It’s best to define a separate class for source locations, because locations will be passed along in later compiler stages.

Hint: Python’s dataclasses are worth learning at this point.

About unit tests

Now that you’re turning tokens into objects, you’d have to write token objects on the right hand side, including file, line and column information. That sounds tedious and brittle, right?

Here are two ways to deal with this:

Option 1 (easier for now): Test file, line, column and type information in a few dedicated tests. For the rest, define a wrapper function for tokenize that returns just the texts of the tokens.

Option 2 (more useful later): Create a special source location object that is considered equal to all source location objects. Then you can use it as a placeholder in your tests when you don’t care about the location.

Here’s an example where this special object is called L (short name because it’ll be used a lot):

assert tokenize('aaa 123 bbb') == [
    Token(loc=L, type="identifier", text="aaa"),
    Token(loc=L, type="int_literal", text="123"),
    Token(loc=L, type="identifier", text="bbb"),
]

To define such an L, you need to define an __eq__ method for your token type so that it returns true if self or the argument is L.

Task 3

Extend the tokenizer to recognize the following:

Operators: +, -, *, /, =, ==, !=, <, <=, >, >=
- Hint: if you use | in your regex, put the longer operators first, because the first (not the longest) matching option wins.
Punctuation: (, ), {, }, ,, ;
Skip over one-line comments like // bla and # bla.

Bonus task

You can support multi-line comments too if you wish. They look like this:

/*
Many lines
of comment
text.
*/
print_int(123)
/* Another
comment. */

Note that regexes are not powerful enough to recognize nested multi-line comments. This is fine – most programming languages don’t support them either.

Hint

Compilers

spring 2024