| tags |
|
|
|---|---|---|
| e_maxx_link | suffix_automata |
A suffix automaton is a powerful data structure that allows solving many string-related problems.
For example, you can search for all occurrences of one string in another, or count the amount of different substrings of a given string. Both tasks can be solved in linear time with the help of a suffix automaton.
Intuitively a suffix automaton can be understood as a compressed form of all substrings of a given string.
An impressive fact is, that the suffix automaton contains all this information in a highly compressed form.
For a string of length
The linearity of the size of the suffix automaton was first discovered in 1983 by Blumer et al., and in 1985 the first linear algorithms for the construction was presented by Crochemore and Blumer.
A suffix automaton for a given string
In other words:
- A suffix automaton is an oriented acyclic graph. The vertices are called states, and the edges are called transitions between states.
- One of the states
$t_0$ is the initial state, and it must be the source of the graph (all other states are reachable from$t_0$ ). - Each transition is labeled with some character. All transitions originating from a state must have different labels.
- One or multiple states are marked as terminal states.
If we start from the initial state
$t_0$ and move along transitions to a terminal state, then the labels of the passed transitions must spell one of the suffixes of the string$s$ . Each of the suffixes of$s$ must be spellable using a path from$t_0$ to a terminal state. - The suffix automaton contains the minimum number of vertices among all automata satisfying the conditions described above.
The simplest and most important property of a suffix automaton is, that it contains information about all substrings of the string
In order to simplify the explanations, we will say that the substring corresponds to that path (starting at
One or multiple paths can lead to a state. Thus, we will say that a state corresponds to the set of strings, which correspond to these paths.
Here we will show some examples of suffix automata for several simple strings.
We will denote the initial state with blue and the terminal states with green.
For the string
For the string
For the string
For the string
For the string
For the string
For the string
Before we describe the algorithm to construct a suffix automaton in linear time, we need to introduce several new concepts and simple proofs, which will be very important in understanding the construction.
Consider any non-empty substring
We will call two substrings
It turns out, that in a suffix machine
We will later describe the construction algorithm using this assumption. We will then see, that all the required properties of a suffix automaton, except for the minimality, are fulfilled. And the minimality follows from Nerode's theorem (which will not be proven in this article).
We can make some important observations concerning the values
Lemma 1:
Two non-empty substrings
The proof is obvious.
If
Lemma 2:
Consider two non-empty substrings
Proof:
If the sets
Lemma 3:
Consider an
Proof:
Fix some
According to Lemma 1, two different
Let's denote by
Consider some state
We also know the first few suffixes of a string
In other words, a suffix link
Here we assume that the initial state
Lemma 4:
Suffix links form a tree with the root
Proof:
Consider an arbitrary state
Lemma 5:
If we construct a tree using the sets
Proof:
The fact that we can construct a tree using the sets
Let us now consider an arbitrary state
which together with the previous lemma proves the assertion:
the tree of suffix links is essentially a tree of sets
Here is an example of a tree of suffix links in the suffix automaton build for the string
Before proceeding to the algorithm itself, we recap the accumulated knowledge, and introduce a few auxiliary notations.
- The substrings of the string
$s$ can be decomposed into equivalence classes according to their end positions$endpos$ . - The suffix automaton consists of the initial state
$t_0$ , as well as of one state for each$endpos$ -equivalence class. - For each state
$v$ one or multiple substrings match. We denote by$longest(v)$ the longest such string, and through$len(v)$ its length. We denote by$shortest(v)$ the shortest such substring, and its length with$minlen(v)$ . Then all the strings corresponding to this state are different suffixes of the string$longest(v)$ and have all possible lengths in the interval$[minlen(v); len(v)]$ . - For each state
$v \ne t_0$ a suffix link is defined as a link, that leads to a state that corresponds to the suffix of the string$longest(v)$ of length$minlen(v) - 1$ . The suffix links form a tree with the root in$t_0$ , and at the same time this tree forms an inclusion relationship between the sets$endpos$ . - We can express
$minlen(v)$ for$v \ne t_0$ using the suffix link$link(v)$ as:
- If we start from an arbitrary state
$v_0$ and follow the suffix links, then sooner or later we will reach the initial state$t_0$ . In this case we obtain a sequence of disjoint intervals$[minlen(v_i); len(v_i)]$ , which in union forms the continuous interval$[0; len(v_0)]$ .
Now we can proceed to the algorithm itself. The algorithm will be online, i.e. we will add the characters of the string one by one, and modify the automaton accordingly in each step.
To achieve linear memory consumption, we will only store the values
Initially the automaton consists of a single state
Now the whole task boils down to implementing the process of adding one character
-
Let
$last$ be the state corresponding to the entire string before adding the character$c$ . (Initially we set$last = 0$ , and we will change$last$ in the last step of the algorithm accordingly.) -
Create a new state
$cur$ , and assign it with$len(cur) = len(last) + 1$ . The value$link(cur)$ is not known at the time. -
Now we do the following procedure: We start at the state
$last$ . While there isn't a transition through the letter$c$ , we will add a transition to the state$cur$ , and follow the suffix link. If at some point there already exists a transition through the letter$c$ , then we will stop and denote this state with$p$ . -
If we haven't found such a state
$p$ , then we reached the fictitious state$-1$ , then we can just assign$link(cur) = 0$ and leave. -
Suppose now that we have found a state
$p$ , from which there exists a transition through the letter$c$ . We will denote the state, to which the transition leads, with$q$ . -
Now we have two cases. Either
$len(p) + 1 = len(q)$ , or not. -
If
$len(p) + 1 = len(q)$ , then we can simply assign$link(cur) = q$ and leave. -
Otherwise it is a bit more complicated. It is necessary to clone the state
$q$ : we create a new state$clone$ , copy all the data from$q$ (suffix link and transition) except the value$len$ . We will assign$len(clone) = len(p) + 1$ .After cloning we direct the suffix link from
$cur$ to$clone$ , and also from$q$ to clone.Finally we need to walk from the state
$p$ back using suffix links as long as there is a transition through$c$ to the state$q$ , and redirect all those to the state$clone$ . -
In any of the three cases, after completing the procedure, we update the value
$last$ with the state$cur$ .
If we also want to know which states are terminal and which are not, we can find all terminal states after constructing the complete suffix automaton for the entire string
In the next section we will look in detail at each step and show its correctness.
Here we only note that, since we only create one or two new states for each character of
The linearity of the number of transitions, and in general the linearity of the runtime of the algorithm is less clear, and they will be proven after we proved the correctness.
-
We will call a transition
$(p, q)$ continuous if$len(p) + 1 = len(q)$ . Otherwise, i.e. when$len(p) + 1 < len(q)$ , the transition will be called non-continuous.As we can see from the description of the algorithm, continuous and non-continuous transitions will lead to different cases of the algorithm. Continuous transitions are fixed, and will never change again. In contrast non-continuous transition may change, when new letters are added to the string (the end of the transition edge may change).
-
To avoid ambiguity we will denote the string, for which the suffix automaton was built before adding the current character
$c$ , with$s$ . -
The algorithm begins with creating a new state
$cur$ , which will correspond to the entire string$s + c$ . It is clear why we have to create a new state. Together with the new character a new equivalence class is created. -
After creating a new state we traverse by suffix links starting from the state corresponding to the entire string
$s$ . For each state we try to add a transition with the character$c$ to the new state$cur$ . Thus we append to each suffix of$s$ the character$c$ . However we can only add these new transitions, if they don't conflict with an already existing one. Therefore as soon as we find an already existing transition with$c$ we have to stop. -
In the simplest case we reached the fictitious state
$-1$ . This means we added the transition with$c$ to all suffixes of$s$ . This also means, that the character$c$ hasn't been part of the string$s$ before. Therefore the suffix link of$cur$ has to lead to the state$0$ . -
In the second case we came across an existing transition
$(p, q)$ . This means that we tried to add a string$x + c$ (where$x$ is a suffix of$s$ ) to the machine that already exists in the machine (the string$x + c$ already appears as a substring of$s$ ). Since we assume that the automaton for the string$s$ is built correctly, we should not add a new transition here.However there is a difficulty. To which state should the suffix link from the state
$cur$ lead? We have to make a suffix link to a state, in which the longest string is exactly$x + c$ , i.e. the$len$ of this state should be$len(p) + 1$ . However it is possible, that such a state doesn't yet exists, i.e.$len(q) > len(p) + 1$ . In this case we have to create such a state, by splitting the state$q$ . -
If the transition
$(p, q)$ turns out to be continuous, then$len(q) = len(p) + 1$ . In this case everything is simple. We direct the suffix link from$cur$ to the state$q$ . -
Otherwise the transition is non-continuous, i.e.
$len(q) > len(p) + 1$ . This means that the state$q$ corresponds to not only the suffix of$s + c$ with length$len(p) + 1$ , but also to longer substrings of$s$ . We can do nothing other than splitting the state$q$ into two sub-states, so that the first one has length$len(p) + 1$ .How can we split a state? We clone the state
$q$ , which gives us the state$clone$ , and we set$len(clone) = len(p) + 1$ . We copy all the transitions from$q$ to$clone$ , because we don't want to change the paths that traverse through$q$ . Also we set the suffix link from$clone$ to the target of the suffix link of$q$ , and set the suffix link of$q$ to$clone$ .And after splitting the state, we set the suffix link from
$cur$ to$clone$ .In the last step we change some of the transitions to
$q$ , we redirect them to$clone$ . Which transitions do we have to change? It is enough to redirect only the transitions corresponding to all the suffixes of the string$w + c$ (where$w$ is the longest string of$p$ ), i.e. we need to continue to move along the suffix links, starting from the vertex$p$ until we reach the fictitious state$-1$ or a transition that leads to a different state than$q$ .
First we immediately make the assumption that the size of the alphabet is constant.
If this is not the case, then it will not be possible to talk about the linear time complexity.
The list of transitions from one vertex will be stored in a balanced tree, which allows you to quickly perform key search operations and adding keys.
Therefore if we denote with
So we will consider the size of the alphabet to be constant, i.e. each operation of searching for a transition on a character, adding a transition, searching for the next transition - all these operations can be done in
If we consider all parts of the algorithm, then it contains three places in the algorithm in which the linear complexity is not obvious:
- The first place is the traversal through the suffix links from the state
$last$ , adding transitions with the character$c$ . - The second place is the copying of transitions when the state
$q$ is cloned into a new state$clone$ . - Third place is changing the transition leading to
$q$ , redirecting them to$clone$ .
We use the fact that the size of the suffix automaton (both in the number of states and in the number of transitions) is linear. (The proof of the linearity of the number of states is the algorithm itself, and the proof of linearity of the number of states is given below, after the implementation of the algorithm).
Thus the total complexity of the first and second places is obvious, after all each operation adds only one amortized new transition to the automaton.
It remains to estimate the total complexity of the third place, in which we redirect transitions, that pointed originally to
Thus, each iteration of this loop leads to the fact that the position of the string
First we describe a data structure that will store all information about a specific transition (
struct state {
int len, link;
map<char, int> next;
};The suffix automaton itself will be stored in an array of these structures
const int MAXLEN = 100000;
state st[MAXLEN * 2];
int sz, last;We give a function that initializes a suffix automaton (creating a suffix automaton with a single state).
void sa_init() {
st[0].len = 0;
st[0].link = -1;
sz++;
last = 0;
}And finally we give the implementation of the main function - which adds the next character to the end of the current line, rebuilding the machine accordingly.
void sa_extend(char c) {
int cur = sz++;
st[cur].len = st[last].len + 1;
int p = last;
while (p != -1 && !st[p].next.count(c)) {
st[p].next[c] = cur;
p = st[p].link;
}
if (p == -1) {
st[cur].link = 0;
} else {
int q = st[p].next[c];
if (st[p].len + 1 == st[q].len) {
st[cur].link = q;
} else {
int clone = sz++;
st[clone].len = st[p].len + 1;
st[clone].next = st[q].next;
st[clone].link = st[q].link;
while (p != -1 && st[p].next[c] == q) {
st[p].next[c] = clone;
p = st[p].link;
}
st[q].link = st[cur].link = clone;
}
}
last = cur;
}As mentioned above, if you sacrifice memory (
The number of states in a suffix automaton of the string
The proof is the construction algorithm itself, since initially the automaton consists of one state, and in the first and second iteration only a single state will be created, and in the remaining
However we can also show this estimation without knowing the algorithm.
Let us recall that the number of states is equal to the number of different sets
This bound of the number of states can actually be achieved for each
In each iteration, starting at the third one, the algorithm will split a state, resulting in exactly
The number of transitions in a suffix automaton of a string
Let us prove this:
Let us first estimate the number of continuous transitions.
Consider a spanning tree of the longest paths in the automaton starting in the state
Now let us estimate the number of non-continuous transitions.
Let the current non-continuous transition be
Combining these two estimates gives us the bound
This bound can also be achieved with the string:
Here we look at some tasks that can be solved using the suffix automaton.
For the simplicity we assume that the alphabet size
Given a text
We build a suffix automaton of the text
It is clear that this will take
Given a string
Let us build a suffix automaton for the string
Each substring of
Given that the suffix automaton is a directed acyclic graph, the number of different ways can be computed using dynamic programming.
Namely, let
I.e.
The number of different substrings is the value
Total time complexity:
Alternatively, we can take advantage of the fact that each state
This is demonstrated succinctly below:
long long get_diff_strings(){
long long tot = 0;
for(int i = 1; i < sz; i++) {
tot += st[i].len - st[st[i].link].len;
}
return tot;
}While this is also
Given a string
The solution is similar to the previous one, only now it is necessary to consider two quantities for the dynamic programming part:
the number of different substrings
We already described how to compute
We take the answer of each adjacent vertex
Again this task can be computed in
Alternatively, we can, again, take advantage of the fact that each state
long long get_tot_len_diff_substings() {
long long tot = 0;
for(int i = 1; i < sz; i++) {
long long shortest = st[st[i].link].len + 1;
long long longest = st[i].len;
long long num_strings = longest - shortest + 1;
long long cur = num_strings * (longest + shortest) / 2;
tot += cur;
}
return tot;
}This approach runs in
Given a string
The solution to this problem is based on the idea of the previous two problems.
The lexicographically
This takes
Given a string
We construct a suffix automaton for the string
Consequently the problem is reduced to finding the lexicographically smallest path of length
Total time complexity is
For a given text
We construct the suffix automaton for the text
Next we do the following preprocessing:
for each state
However we cannot construct the sets
To compute them we proceed as follows.
For each state, if it was not created by cloning (and if it is not the initial state
This gives the correct value for each state.
Why is this correct?
The total number of states obtained not via cloning is exactly
Then we apply the following operation for each
Why don't we overcount in this procedure (i.e. don't count some positions twice)? Because we add the positions of a state to only one other state, so it can not happen that one state directs its positions to another state twice in two different ways.
Thus we can compute the quantities
After that answering a query by just looking up the value
Given a text
We again construct a suffix automaton.
Additionally we precompute the position
To maintain these positions sa_extend().
When we create a new state
And when we clone a vertex
(since the only other option for a value would be
Thus the answer for a query is simply
This time we have to display all positions of the occurrences in the string
Again we construct a suffix automaton for the text
Clearly
Therefore to solve the problem we need to save for each state a list of suffix references leading to it.
The answer to the query then will then contain all
Overall, this requires
First, we walk down the automaton for each character in the pattern to find our starting node requiring
We only must take into account that two different states can have the same
Moreover, we can also get rid of the duplicate positions, if we don't output the positions from the cloned states.
In fact a state, that a cloned state can reach, is also reachable from the original state.
Thus if we remember the flag is_cloned for each state, we can simply ignore the cloned states and only output
Here are some implementation sketches:
struct state {
...
bool is_clone;
int first_pos;
vector<int> inv_link;
};
// after constructing the automaton
for (int v = 1; v < sz; v++) {
st[st[v].link].inv_link.push_back(v);
}
// output all positions of occurrences
void output_all_occurrences(int v, int P_length) {
if (!st[v].is_clone)
cout << st[v].first_pos - P_length + 1 << endl;
for (int u : st[v].inv_link)
output_all_occurrences(u, P_length);
}Given a string
We will apply dynamic programming on the suffix automaton built for the string
Let
The answer to the problem will be
Given two strings
We construct a suffix automaton for the string
We will now take the string
For this we will use two variables, the current state
Initially
Now let us describe how we can add a character
- If there is a transition from
$v$ with the character$T[i]$ , then we simply follow the transition and increase$l$ by one. - If there is no such transition, we have to shorten the current matching part, which means that we need to follow the suffix link:
$v = link(v)$ . At the same time, the current length has to be shortened. Obviously we need to assign$l = len(v)$ , since after passing through the suffix link we end up in state whose corresponding longest string is a substring. - If there is still no transition using the required character, we repeat and again go through the suffix link and decrease
$l$ , until we find a transition or we reach the fictional state$-1$ (which means that the symbol$T[i]$ doesn't appear at all in$S$ , so we assign$v = l = 0$ ).
The answer to the task will be the maximum of all the values
The complexity of this part is
Implementation:
string lcs (string S, string T) {
sa_init();
for (int i = 0; i < S.size(); i++)
sa_extend(S[i]);
int v = 0, l = 0, best = 0, bestpos = 0;
for (int i = 0; i < T.size(); i++) {
while (v && !st[v].next.count(T[i])) {
v = st[v].link ;
l = st[v].len;
}
if (st[v].next.count(T[i])) {
v = st [v].next[T[i]];
l++;
}
if (l > best) {
best = l;
bestpos = i;
}
}
return T.substr(bestpos - best + 1, best);
} There are
We join all strings into one large string
Then we construct the suffix automaton for the string
Now we need to find a string in the machine, which is contained in all the strings
Thus we need to calculate the attainability, which tells us for each state of the machine and each symbol







