When it comes to personal projects, one of my favorite design tenets is maximizing flexibility where possible. This allows me to experiment with different ideas quickly, and I seldom feel creatively shackled to my past decisions. This is why it is important to me that the compiler for my shader language supports Unicode - maybe one day I’ll decide to use emojis as keywords or nordic runes as operators! This requirement gave rise to many interesting problems, including one I call “the alphabet problem”, which this post is about.

The Alphabet Problem

When parsing a program, a compiler must first tokenize the input by building an NFA from a set of known patterns, converting it to a DFA, and using the DFA to match tokens in the input. NFAs and DFAs are partially defined by an alphabet - a finite and discrete set of symbols that form the atoms of the language they describe. In common implementations, like the one proposed in this excellent article by Russ Cox, each NFA or DFA state holds a transition table implemented as an array of pointers mapping alphabet characters to next states. This is tenable for small alphabets - a language allowing only ASCII characters would only need to allocate a block of 128 pointers per state.

Can we take a similar approach for any alphabet, say one containing every Unicode character? At the time of this writing, there are over 150,000 characters supported by the Unicode standard. On a 64 bit system, allocating an array of pointers whose elements represent each Unicode character would require about 1.2MB of memory. That’s 1.2MB per state, and remember: the conversion from NFA to DFA can result in a combinatoric explosion in the number of states. This is the alphabet problem: When constructing a DFA, space complexity scales with the number of states multiplied by the size of the alphabet, which can be problematic for large alphabets. Can we do better? As it turns out, we can!

Note: You may have already made the observation that valid state transitions are usually sparse and be tempted to claim that using a hash map instead of an array as the transition table will alleviate the issue. But, this does not help in the worst case. For example, consider the application of regular expressions, and imagine a pattern that matches a range over every CJK character, of which there are tens of thousands. In other words, we can’t assume that state transitions are always sparse!

Ranges to the Rescue

Think about your favorite programming language. There probably aren’t that many types of tokens - the compiler might need to recognize a few keywords, some special symbols, user-defined identifiers, and maybe some literal values. For reference, the first version of my language uses only 38 token types. Intuitively, we shouldn’t need a massive alphabet to match these tokens, even if some can contain virtually any Unicode character.

The solution to the alphabet problem is to represent the alphabet as a set of character ranges, not a set of characters. We can derive the valid ranges from the known patterns used to match each type of token, following these rules:

Any literal character specified by a token pattern is converted to a range that starts and ends with that same character, e.g. a-a
Any range of characters specified by a token pattern is added directly to the alphabet

For example, consider these token patterns:

KEYWORD_VAR     var
OP_ASSIGN       =
INTEGER_LIT     [0-9]+
IDENTIFIER      [a-zA-Z]+

Using the rules above, the alphabet would become

[v, v]
[a, a]
[r, r]
[=, =]
[0, 9]
[a, z]
[A, Z]

Notice that we’ve essentially compressed the alphabet by using ranges. Although this language supports every letter of the English alphabet, there are only 7 elements!

Eliminating Overlap

Problem solved, right? Not quite. Recall that the elements of an alphabet must be discrete, but in the example above, the ranges overlap! Therefore, we need to ensure they don’t overlap by creating an equivalent set of non-overlapping ranges. In this case, equivalent means:

The non-overlapping set covers the exact same characters as the overlapping set
Every range in the overlapping set can be represented by one or more ranges in the non-overlapping set

For example, the overlapping ranges in the example above can be converted to these equivalent, non-overlapping ranges:

[a, a]
[b, q]
[r, r]
[s, u]
[v, v]
[w, z]
[A, Z]
[0, 9]
[=, =]

Let’s design an algorithm to do this using pseudocode. You can find my implementation in C++ here.

The general approach is to break each range into two elements: the starting element and the ending element. We can use a min heap to construct an ordered view of these elements, sorted lexicographically. We also keep track of whether each element was starting or ending and use that as a tie breaker for sorting (for elements containing the same character, starting elements are processed before ending elements). Then, we construct the alphabet by popping each element from the queue and constructing new ranges, like this:

// begin() and end() are mutative operations that start and end ranges
// in the output

// input is a sorted list of tuples (C, S), where C is the character
// and S is a boolean representing whether it starts a range

q = PriorityQueue(input)

while !q.empty():
	curr = q.pop()
	character, start = curr.character, curr.start

	if start:
		begin(character)
	else:
		end(character)

The foundation is laid, but it will only work if elements alternate between starting and ending (thus, no overlap to begin with). What happens if we get two starting elements or two ending elements in a row?

If we get a starting element and the previous range has not been closed, we need to close it. Similarly, if we get and ending element and the previous range is already closed, we need to start a new one. A good way to think about this is that we are inserting synthetic elements to force an alternating order, ensuring ranges are contiguous and non-overlapping.

// begin() and end() are mutative operations that start and end ranges
// in the output

// isComplete() returns whether the last range in the output has both
// a start and an end

// last() returns the last character pushed to the output

// input is a sorted list of tuples (C, S), where C is the character
// and S is whether it starts a range

q = PriorityQueue(input)

while !q.empty():
	curr = q.pop()
	character, start = curr.character, curr.start

	if start:
		// End the previous range if it is not complete
		if !isComplete():
			end(character - 1)

		begin(character)
	else:
		// Start the next range if the previous range is already complete
		if isComplete():
			start(last() + 1)

		end(character)

If two ranges happen to both start or end on the same character, we should not consider them separate elements. For example, the overlapping ranges [a, z] and [y, z] should produce the non-overlapping ranges [a, x] and [y, z]. Notice that, ordinarily, two overlapping ranges are converted to three non-overlapping ranges, but in this case we only produce two because both ranges have the same ending element. Therefore, our algorithm also needs to filter duplicates.

// begin() and end() are mutative operations that start and end ranges
// in the output

// isComplete() returns whether the last range in the output has both
// a start and an end

// last() returns the last character pushed to the output

// input is a sorted list of tuples (C, S), where C is the character
// and S is whether it starts a range

q = PriorityQueue(input)
prev = null

while !q.empty():
	curr = q.pop()
	character, start = curr.character, curr.start
	
	// Filter duplicates
	if prev.character == character && prev.start == start:
		continue
	prev = curr

	if start:
		// End the previous range if it is not complete
		if !isComplete():
			end(character - 1)

		begin(character)
	else:
		// Start the next range if the previous range is already complete
		if isComplete():
			start(last() + 1)

		end(character)

Lastly, we need to account for non-contiguous ranges that are spanned by another range. For example, the ranges [a, e], [c, x], and [w, z] should be transformed to [a, b], [c, e], [f, v], [w, x], and [y, z]. Notice that neither f nor v appear in the original ranges, but the range [f, v] must be included to span the entire [c, x] range!

We can do this by incrementing and decrementing an integer depending on whether the current element is starting or ending. If the counter is greater than zero, we know the current element is overlapped by at least one other range, and any gaps between ranges must be filled.

// begin() and end() are mutative operations that start and end ranges
// in the output

// isComplete() returns whether the last range in the output has both
// a start and an end

// last() returns the last character pushed to the output

// input is a sorted list of tuples (C, S), where C is the character
// and S is whether it starts a range

q = PriorityQueue(input)
prev = null
counter = 0

while !q.empty():
	curr = q.pop()
	character, start = curr.character, curr.start

	// If there's no overlapping range, begin a new range and continue.
	// The remaining code in the loop assumes an overlapping range.
	if counter == 0:
		counter++
		begin(character)
		continue
	
	// Keep track of whether we are in an overlapping range
	if start:
		counter++
	else:
		counter--
	
	// Filter duplicates
	if prev.character == character && prev.start == start:
		continue
	prev = curr

	if start:
		if isComplete():
			// If there is a gap, fill it in since there is an overlapping
			// range
			if last() + 1 != character:
				begin(last() + 1)
				end(character - 1)
		else:
			// End the previous range if it is not complete
			end(character - 1)

		begin(character)
	else:
		// Begin the next range if the previous range is already complete
		if isComplete():
			begin(last() + 1)

		end(character)

That’s it! Now, we can use our new alphabet to construct an NFA that supports Unicode while using a reasonable amount of memory.

As a final note, this optimization doesn’t come for free; it’s a trade-off. Because we use ranges, we no longer get O(1) lookups into the transition table when traversing the state machine. Instead, we can achieve lookups in O(log K) time using a binary search over the sorted alphabet, where K is the length of the alphabet.