lexing fundamentals
Fundamentals and basics of lexing in compilers, with various string algorithms.
lexing fundamentals
Lexing
It returns two things, an identifier for the token it just read and possibly some sort of value for it.
We use flex for this. It takes multiple regex rules and for each regex it returns a token and its value.
For each rule, it matches tokens based on the Aho-Corasick algorithm. This is basically a preprocessing table for multiple words, similar to KMP algorithm.
vector<int> jump(const string &s) {
int len = 0;
vector<int> v(s.size());
v[0] = 0;
for (int i = 1; i < s.size(); i++) {
if (s[len] == s[i]) {
len++;
v[i] = len;
}
else if (len != 0) {
len = v[len - 1]; // check the previous smaller prefixes of the length
// matched
i--; // we don't want to incrememt without it matching
} else {
v[i] = 0; // Can't be matched, has to be zero
}
}
return v;
}
vector<int> kmp(const string &text, const string &pattern) {
vector<int> jumps = jump(pattern);
vector<int> result;
int text_idx = 0, pat_idx = 0;
while (text.size() - text_idx >= pattern.size() - pat_idx) {
if (text[text_idx] == pattern[pat_idx]) {
pat_idx++; text_idx++;
}
if (pat_idx == pattern.size()) {
result.push_back(text_idx - pat_idx + 1);
pat_idx = jumps[pat_idx - 1];
}
else if (text_idx < text.size() && text[text_idx] != pattern[pat_idx]) {
if (pat_idx) {
pat_idx = jumps[pat_idx - 1];
} else {
text_idx++;
}
}
}
return result;
}To show that it's linear time, if the increment does not happen for some k iterations, we have already matched the length of atleast k of the pattern and then moved to this spot in the pattern and text. Therefore, it can only do at most ~ 2k iterations for a segment of length k. Hence it is linear in time.
This matches a pattern in a text. We can extend this idea to a DFA for multiple words and transitions based on the input character to achieve what's called as the Aho Corasick automaton.
Usually the longest match is picked. This is also true in (f)lex. In case two of the regexes have the same length, the one above gets matched.
All of this was for theoretical discussion based on string matching. Actual matching is done based on regex rules. There is a conversion from the regex to a DFA using a syntax tree.
We first make the syntax tree of the regex followed by a meta
symbol (in this case #).
We also assign a number to each of the leaves.
firstpos is the set of positions where the current node's subtree would start. For
*it will be the same as its child, for concat it'll be the lesser of the two. For or node it will be the union of the two. For concat if the first one can null then it's a union.lastpos is where they end. Rules are the same. For concat if the second one is nullable then it's a union.
followpos is the number that would immediately follow the token represented by the current node. If
aandbare concatenated, then firstpos ofbis the followpos for lastpos ofa. For star node it's the firstpos of the same node. For or node you need to go up.
Now the DFA transitions are as follows:
- Take the root and mark its firstpos as the unmarked state
- While there is an unmarked state
s, mark the states. - For each leaf symbol
a(the input letter in the regex) ins, mark a transition fromsto the union of followpos(p) wherepare the leaf nodes corresponding toa. So if the stateshas firstpos 123, with 1 and 3 beinga, then transition on readingato where p is 1 and 3 in this case