| tags |
|
|
|---|---|---|
| e_maxx_link | string_tandems |
Given a string
A repetition is two occurrences of a string in a row.
In other words a repetition can be described by a pair of indices
The challenge is to find all repetitions in a given string
The algorithm described here was published in 1982 by Main and Lorentz.
Consider the repetitions in the following example string:
The string contains the following three repetitions:
$s[2 \dots 5] = abab$ $s[3 \dots 6] = baba$ $s[7 \dots 8] = ee$
Another example:
Here there are only two repetitions
$s[0 \dots 5] = abaaba$ $s[2 \dots 3] = aa$
In general there can be up to
On the other hand this fact does not prevent computing the number of repetitions in
There is even the concept, that describes groups of periodic substrings with tuples of size four. It has been proven that we the number of such groups is at most linear with respect to the string length.
Also, here are some more interesting results related to the number of repetitions:
-
The number of primitive repetitions (those whose halves are not repetitions) is at most
$O(n \log n)$ . -
If we encode repetitions with tuples of numbers (called Crochemore triples)
$(i,~ p,~ r)$ (where$i$ is the position of the beginning,$p$ the length of the repeating substring, and$r$ the number of repetitions), then all repetitions can be described with$O(n \log n)$ such triples. -
Fibonacci strings, defined as
[\begin{align} t_0 &= a, \\ t_1 &= b, \\ t_i &= t_{i-1} + t_{i-2}, \end{align}]
are "strongly" periodic. The number of repetitions in the Fibonacci string
$f_i$ , even in the compressed with Crochemore triples, is$O(f_n \log f_n)$ . The number of primitive repetitions is also$O(f_n \log f_n)$ .
The idea behind the Main-Lorentz algorithm is divide-and-conquer.
It splits the initial string into halves, and computes the number of repetitions that lie completely in each halve by two recursive calls. Then comes the difficult part. The algorithm finds all repetitions starting in the first half and ending in the second half (which we will call crossing repetitions). This is the essential part of the Main-Lorentz algorithm, and we will discuss it in detail here.
The complexity of divide-and-conquer algorithms is well researched.
The master theorem says, that we will end up with an
So we want to find all such repetitions that start in the first half of the string, let's call it
Their lengths are approximately equal to the length of
Consider an arbitrary repetition and look at the middle character (more precisely the first character of the second half of the repetition).
I.e. if the repetition is a substring
We call a repetition left or right depending on which string this character is located - in the string
We will now discuss how to find all left repetitions. Finding all right repetitions can be done in the same way.
Let us denote the length of the left repetition by
We will fixate this position
For example:
The vertical lines divides the two halves.
Here we fixated the position
It is clear, that if we fixate the position
Now, how can we find all such repetitions for a fixated
Let's again look at a visualization, this time for the repetition
Here we denoted the lengths of the two pieces of the repetition with
Let us generate necessary and sufficient conditions for such a repetition at position
- Let
$k_1$ be the largest number such that the first$k_1$ characters before the position$cntr$ coincide with the last$k_1$ characters in the string$u$ :
- Let
$k_2$ be the largest number such that the$k_2$ characters starting at position$cntr$ coincide with the first$k_2$ characters in the string$v$ :
- Then we have a repetition exactly for any pair
$(l_1,~ l_2)$ with
To summarize:
- We fixate a specific position
$cntr$ . - All repetition which we will find now have length
$2l = 2(|u| - cntr)$ . There might be multiple such repetitions, they depend on the lengths$l_1$ and$l_2 = l - l_1$ . - We find
$k_1$ and$k_2$ as described above. - Then all suitable repetitions are the ones for which the lengths of the pieces
$l_1$ and$l_2$ satisfy the conditions:
Therefore the only remaining part is how we can compute the values
- To can find the value
$k_1$ for each position by calculating the Z-function for the string$\overline{u}$ (i.e. the reversed string$u$ ). Then the value$k_1$ for a particular$cntr$ will be equal to the corresponding value of the array of the Z-function. - To precompute all values
$k_2$ , we calculate the Z-function for the string$v + # + u$ (i.e. the string$u$ concatenated with the separator character$#$ and the string$v$ ). Again we just need to look up the corresponding value in the Z-function to get the$k_2$ value.
So this is enough to find all left crossing repetitions.
For computing the right crossing repetitions we act similarly:
we define the center
Then the length
Thus we can find the values
After that we can find the repetitions by looking at all positions
The implementation of the Main-Lorentz algorithm finds all repetitions in form of peculiar tuples of size four:
Notice that if you want to expand these tuples to get the starting and end position of each repetition, then the runtime will be the runtime will be
vector<int> z_function(string const& s) {
int n = s.size();
vector<int> z(n);
for (int i = 1, l = 0, r = 0; i < n; i++) {
if (i <= r)
z[i] = min(r-i+1, z[i-l]);
while (i + z[i] < n && s[z[i]] == s[i+z[i]])
z[i]++;
if (i + z[i] - 1 > r) {
l = i;
r = i + z[i] - 1;
}
}
return z;
}
int get_z(vector<int> const& z, int i) {
if (0 <= i && i < (int)z.size())
return z[i];
else
return 0;
}
vector<pair<int, int>> repetitions;
void convert_to_repetitions(int shift, bool left, int cntr, int l, int k1, int k2) {
for (int l1 = max(1, l - k2); l1 <= min(l, k1); l1++) {
if (left && l1 == l) break;
int l2 = l - l1;
int pos = shift + (left ? cntr - l1 : cntr - l - l1 + 1);
repetitions.emplace_back(pos, pos + 2*l - 1);
}
}
void find_repetitions(string s, int shift = 0) {
int n = s.size();
if (n == 1)
return;
int nu = n / 2;
int nv = n - nu;
string u = s.substr(0, nu);
string v = s.substr(nu);
string ru(u.rbegin(), u.rend());
string rv(v.rbegin(), v.rend());
find_repetitions(u, shift);
find_repetitions(v, shift + nu);
vector<int> z1 = z_function(ru);
vector<int> z2 = z_function(v + '#' + u);
vector<int> z3 = z_function(ru + '#' + rv);
vector<int> z4 = z_function(v);
for (int cntr = 0; cntr < n; cntr++) {
int l, k1, k2;
if (cntr < nu) {
l = nu - cntr;
k1 = get_z(z1, nu - cntr);
k2 = get_z(z2, nv + 1 + cntr);
} else {
l = cntr - nu + 1;
k1 = get_z(z3, nu + 1 + nv - 1 - (cntr - nu));
k2 = get_z(z4, (cntr - nu) + 1);
}
if (k1 + k2 >= l)
convert_to_repetitions(shift, cntr < nu, cntr, l, k1, k2);
}
}