2026-03-14
It returns two things, an identifier for the token it just read and possibly some sort of value for it.
We use flex for this. It takes multiple regex rules and for each regex it returns a token and its value.
For each rule, it matches tokens based on the Aho-Corasick algorithm. This is basically a preprocessing table for multiple words, similar to KMP algorithm.
vector<int> jump(const string &s) {
int len = 0;
vector<int> v(s.size());
v[0] = 0;
for (int i = 1; i < s.size(); i++) {
if (s[len] == s[i]) {
len++;
v[i] = len;
}
else if (len != 0) {
len = v[len - 1]; // check the previous smaller prefixes of the length
// matched
i--; // we don't want to incrememt without it matching
} else {
v[i] = 0; // Can't be matched, has to be zero
}
}
return v;
}
vector<int> kmp(const string &text, const string &pattern) {
vector<int> jumps = jump(pattern);
vector<int> result;
int text_idx = 0, pat_idx = 0;
while (text.size() - text_idx >= pattern.size() - pat_idx) {
if (text[text_idx] == pattern[pat_idx]) {
pat_idx++; text_idx++;
}
if (pat_idx == pattern.size()) {
result.push_back(text_idx - pat_idx + 1);
pat_idx = jumps[pat_idx - 1];
}
else if (text_idx < text.size() && text[text_idx] != pattern[pat_idx]) {
if (pat_idx) {
pat_idx = jumps[pat_idx - 1];
} else {
text_idx++;
}
}
}
return result;
}To show that it's linear time, if the increment does not happen for some k iterations, we have already matched the length of atleast k of the pattern and then moved to this spot in the pattern and text. Therefore, it can only do at most ~ 2k iterations for a segment of length k. Hence it is linear in time.
This matches a pattern in a text. We can extend this idea to a DFA for multiple words and transitions based on the input character to achieve what's called as the Aho Corasick automaton.
Usually the longest match is picked. This is also true in (f)lex. In case two of the regexes have the same length, the one above gets matched.
All of this was for theoretical discussion based on string matching. Actual matching is done based on regex rules. There is a conversion from the regex to a DFA using a syntax tree.
We first make the syntax tree of the regex followed by a meta symbol
(in this case #).
We also assign a number to each of the leaves.
firstpos is the set of positions where the current node's subtree
would start. For * it will be the same as its child, for
concat it'll be the lesser of the two. For or node it will be the union
of the two. For concat if the first one can null then it's a
union.
lastpos is where they end. Rules are the same. For concat if the second one is nullable then it's a union.
followpos is the number that would immediately follow the token
represented by the current node. If a and b
are concatenated, then firstpos of b is the followpos for
lastpos of a. For star node it's the firstpos of the same
node. For or node you need to go up.
Now the DFA transitions are as follows:
s, mark the state
s.a (the input letter in the regex)
in s, mark a transition from s to the union of
followpos(p) where p are the leaf nodes
corresponding to a. So if the state s has
firstpos 123, with 1 and 3 being a, then transition on
reading a to
where p is 1 and 3 in this case