lexing fundamentals

innocentzero

2026-03-14

lexing fundamentals

Lexing

It returns two things, an identifier for the token it just read and possibly some sort of value for it.

We use flex for this. It takes multiple regex rules and for each regex it returns a token and its value.

For each rule, it matches tokens based on the Aho-Corasick algorithm. This is basically a preprocessing table for multiple words, similar to KMP algorithm.

vector<int> jump(const string &s) {
    int len = 0;
    vector<int> v(s.size());
    v[0] = 0;
    for (int i = 1; i < s.size(); i++) {
        if (s[len] == s[i]) {
          len++;
          v[i] = len;
        }
        else if (len != 0) {
          len = v[len - 1]; // check the previous smaller prefixes of the length
                            // matched
          i--; // we don't want to incrememt without it matching
        } else {
          v[i] = 0; // Can't be matched, has to be zero
        }
    }
    return v;
}

vector<int> kmp(const string &text, const string &pattern) {
    vector<int> jumps = jump(pattern);
    vector<int> result;
    int text_idx = 0, pat_idx = 0;
    while (text.size() - text_idx >= pattern.size() - pat_idx) {
        if (text[text_idx] == pattern[pat_idx]) {
            pat_idx++; text_idx++;
        }
        if (pat_idx == pattern.size()) {
            result.push_back(text_idx - pat_idx + 1);
            pat_idx = jumps[pat_idx - 1];
        }
        else if (text_idx < text.size() && text[text_idx] != pattern[pat_idx]) {
            if (pat_idx) {
                pat_idx = jumps[pat_idx - 1];
            } else {
                text_idx++;
            }
        }
    }
    return result;
}

To show that it's linear time, if the increment does not happen for some k iterations, we have already matched the length of atleast k of the pattern and then moved to this spot in the pattern and text. Therefore, it can only do at most ~ 2k iterations for a segment of length k. Hence it is linear in time.

This matches a pattern in a text. We can extend this idea to a DFA for multiple words and transitions based on the input character to achieve what's called as the Aho Corasick automaton.

Usually the longest match is picked. This is also true in (f)lex. In case two of the regexes have the same length, the one above gets matched.

All of this was for theoretical discussion based on string matching. Actual matching is done based on regex rules. There is a conversion from the regex to a DFA using a syntax tree.

We first make the syntax tree of the regex followed by a meta symbol (in this case #).

regex tree

We also assign a number to each of the leaves.

Now the DFA transitions are as follows:

  1. Take the root and mark its firstpos as the unmarked state
  2. While there is an unmarked state s, mark the state s.
  3. For each leaf symbol a (the input letter in the regex) in s, mark a transition from s to the union of followpos(p) where p are the leaf nodes corresponding to a. So if the state s has firstpos 123, with 1 and 3 being a, then transition on reading a to followpos(p)\bigcup followpos(p) where p is 1 and 3 in this case