| tags |
|
|
|---|---|---|
| e_maxx_link | z_function |
Suppose we are given a string
In other words,
Note. In this article, to avoid ambiguity, we assume
The first element of Z-function,
This article presents an algorithm for calculating the Z-function in
For example, here are the values of the Z-function computed for different strings:
- "aaaaa" -
$[0, 4, 3, 2, 1]$ - "aaabaab" -
$[0, 2, 1, 0, 2, 1, 0]$ - "abacaba" -
$[0, 0, 1, 0, 3, 0, 1]$
Formal definition can be represented in the following elementary
vector<int> z_function_trivial(string s) {
int n = s.size();
vector<int> z(n);
for (int i = 1; i < n; i++) {
while (i + z[i] < n && s[z[i]] == s[i + z[i]]) {
z[i]++;
}
}
return z;
}We just iterate through every position
Of course, this is not an efficient implementation. We will now show the construction of an efficient implementation.
To obtain an efficient algorithm we will compute the values of
For the sake of brevity, let's call segment matches those substrings that coincide with a prefix of
To do this, we will keep the
Then, if the current index (for which we have to compute the next value of the Z-function) is
-
$i \geq r$ -- the current position is outside of what we have already processed.We will then compute
$z[i]$ with the trivial algorithm (that is, just comparing values one by one). Note that in the end, if$z[i] > 0$ , we'll have to update the indices of the rightmost segment, because it's guaranteed that the new$r = i + z[i]$ is better than the previous$r$ . -
$i < r$ -- the current position is inside the current segment match$[l, r)$ .Then we can use the already calculated Z-values to "initialize" the value of
$z[i]$ to something (it sure is better than "starting from zero"), maybe even some big number.For this, we observe that the substrings
$s[l \dots r)$ and$s[0 \dots r-l)$ match. This means that as an initial approximation for$z[i]$ we can take the value already computed for the corresponding segment$s[0 \dots r-l)$ , and that is$z[i-l]$ .However, the value
$z[i-l]$ could be too large: when applied to position$i$ it could exceed the index$r$ . This is not allowed because we know nothing about the characters to the right of$r$ : they may differ from those required.Here is an example of a similar scenario:
$$ s = "aaaabaa" $$
When we get to the last position (
$i = 6$ ), the current match segment will be$[5, 7)$ . Position$6$ will then match position$6 - 5 = 1$ , for which the value of the Z-function is$z[1] = 3$ . Obviously, we cannot initialize$z[6]$ to$3$ , it would be completely incorrect. The maximum value we could initialize it to is$1$ -- because it's the largest value that doesn't bring us beyond the index$r$ of the match segment$[l, r)$ .Thus, as an initial approximation for
$z[i]$ we can safely take:$$ z_0[i] = \min(r - i,; z[i-l]) $$
After having
$z[i]$ initialized to$z_0[i]$ , we try to increment$z[i]$ by running the trivial algorithm -- because in general, after the border$r$ , we cannot know if the segment will continue to match or not.
Thus, the whole algorithm is split in two cases, which differ only in the initial value of
The algorithm turns out to be very simple. Despite the fact that on each iteration the trivial algorithm is run, we have made significant progress, having an algorithm that runs in linear time. Later on we will prove that the running time is linear.
Implementation turns out to be rather concise:
vector<int> z_function(string s) {
int n = s.size();
vector<int> z(n);
int l = 0, r = 0;
for(int i = 1; i < n; i++) {
if(i < r) {
z[i] = min(r - i, z[i - l]);
}
while(i + z[i] < n && s[z[i]] == s[i + z[i]]) {
z[i]++;
}
if(i + z[i] > r) {
l = i;
r = i + z[i];
}
}
return z;
}The whole solution is given as a function which returns an array of length
Array
Inside the loop for
Thereafter, the trivial algorithm attempts to increase the value of
In the end, if it's required (that is, if
We will prove that the above algorithm has a running time that is linear in the length of the string -- thus, it's
The proof is very simple.
We are interested in the nested while loop, since everything else is just a bunch of constant operations which sums up to
We will show that each iteration of the while loop will increase the right border
To do that, we will consider both branches of the algorithm:
-
$i \geq r$ In this case, either the
whileloop won't make any iteration (if$s[0] \ne s[i]$ ), or it will take a few iterations, starting at position$i$ , each time moving one character to the right. After that, the right border$r$ will necessarily be updated.So we have found that, when
$i \geq r$ , each iteration of thewhileloop increases the value of the new$r$ index. -
$i < r$ In this case, we initialize
$z[i]$ to a certain value$z_0$ given by the above formula. Let's compare this initial value$z_0$ to the value$r - i$ . We will have three cases:-
$z_0 < r - i$ We prove that in this case no iteration of the
whileloop will take place.It's easy to prove, for example, by contradiction: if the
whileloop made at least one iteration, it would mean that initial approximation$z[i] = z_0$ was inaccurate (less than the match's actual length). But since$s[l \dots r)$ and$s[0 \dots r-l)$ are the same, this would imply that$z[i-l]$ holds the wrong value (less than it should be).Thus, since
$z[i-l]$ is correct and it is less than$r - i$ , it follows that this value coincides with the required value$z[i]$ . -
$z_0 = r - i$ In this case, the
whileloop can make a few iterations, but each of them will lead to an increase in the value of the$r$ index because we will start comparing from$s[r]$ , which will climb beyond the$[l, r)$ interval. -
$z_0 > r - i$ This option is impossible, by definition of
$z_0$ .
-
So, we have proved that each iteration of the inner loop make the
As the rest of the algorithm obviously works in
We will now consider some uses of Z-functions for specific tasks.
These applications will be largely similar to applications of prefix function.
To avoid confusion, we call
To solve this problem, we create a new string
Compute the Z-function for
The running time (and memory consumption) is
Given a string
We'll solve this problem iteratively. That is: knowing the current number of different substrings, recalculate this amount after adding to the end of
So, let
Take a string
So, we have found that the number of new substrings that appear when symbol
Consequently, the running time of this solution is
It's worth noting that in exactly the same way we can recalculate, still in
Given a string
A solution is: compute the Z-function of
The proof for this fact is the same as the solution which uses the prefix function.
- CSES - Finding Borders
- eolymp - Blocks of string
- Codeforces - Password [Difficulty: Easy]
- UVA # 455 "Periodic Strings" [Difficulty: Medium]
- UVA # 11022 "String Factoring" [Difficulty: Medium]
- UVa 11475 - Extend to Palindrome
- LA 6439 - Pasti Pas!
- Codechef - Chef and Strings
- Codeforces - Prefixes and Suffixes