lexing fundamentals

Lexing

It returns two things, an identifier for the token it just read and possibly some sort of value for it.

We use flex for this. It takes multiple regex rules and for each regex it returns a token and its value.

For each rule, it matches tokens based on the Aho-Corasick algorithm. This is basically a preprocessing table for multiple words, similar to KMP algorithm.

vector<int> jump(const string &s) {
    int len = 0;
    vector<int> v(s.size());
    v[0] = 0;
    for (int i = 1; i < s.size(); i++) {
        if (s[len] == s[i]) {
          len++;
          v[i] = len;
        }
        else if (len != 0) {
          len = v[len - 1]; // check the previous smaller prefixes of the length
                            // matched
          i--; // we don't want to incrememt without it matching
        } else {
          v[i] = 0; // Can't be matched, has to be zero
        }
    }
    return v;
}

vector<int> kmp(const string &text, const string &pattern) {
    vector<int> jumps = jump(pattern);
    vector<int> result;
    int text_idx = 0, pat_idx = 0;
    while (text.size() - text_idx >= pattern.size() - pat_idx) {
        if (text[text_idx] == pattern[pat_idx]) {
            pat_idx++; text_idx++;
        }
        if (pat_idx == pattern.size()) {
            result.push_back(text_idx - pat_idx + 1);
            pat_idx = jumps[pat_idx - 1];
        }
        else if (text_idx < text.size() && text[text_idx] != pattern[pat_idx]) {
            if (pat_idx) {
                pat_idx = jumps[pat_idx - 1];
            } else {
                text_idx++;
            }
        }
    }
    return result;
}

To show that it's linear time, if the increment does not happen for some k iterations, we have already matched the length of atleast k of the pattern and then moved to this spot in the pattern and text. Therefore, it can only do at most ~ 2k iterations for a segment of length k. Hence it is linear in time.

This matches a pattern in a text. We can extend this idea to a DFA for multiple words and transitions based on the input character to achieve what's called as the Aho Corasick automaton.

Usually the longest match is picked. This is also true in (f)lex. In case two of the regexes have the same length, the one above gets matched.

All of this was for theoretical discussion based on string matching. Actual matching is done based on regex rules. There is a conversion from the regex to a DFA using a syntax tree.

We first make the syntax tree of the regex followed by a meta symbol (in this case #).

We also assign a number to each of the leaves.

firstpos is the set of positions where the current node's subtree would start. For * it will be the same as its child, for concat it'll be the lesser of the two. For or node it will be the union of the two. For concat if the first one can null then it's a union.
lastpos is where they end. Rules are the same. For concat if the second one is nullable then it's a union.
followpos is the number that would immediately follow the token represented by the current node. If a and b are concatenated, then firstpos of b is the followpos for lastpos of a. For star node it's the firstpos of the same node. For or node you need to go up.

Now the DFA transitions are as follows:

Take the root and mark its firstpos as the unmarked state
While there is an unmarked state s, mark the state s.
For each leaf symbol a (the input letter in the regex) in s, mark a transition from s to the union of followpos(p) where p are the leaf nodes corresponding to a. So if the state s has firstpos 123, with 1 and 3 being a, then transition on reading a to $\bigcup followpos(p)$ where p is 1 and 3 in this case