Consider random string of length $n$ over alphabet of size $|\mathcal{A}|=a$ ($a^n$ strings in total). What's expected number of distinct substrings of this string? What's expected number of distinct substrings of length $k$?
In other words, find size of the set $$\left\{(s, t)| s \in \mathcal{A}^n, t \in \mathcal{A} ^ k, t\ \text{is substring of}\ s \right\}.$$
Does it possible to calculate this number in polynomial time (i.e. $poly(n,k,a)$)?
I am looking for exact formula or alogirthm to calculate this number, but asymptotics results are also welcome.
-
3$\begingroup$ This is not an exact answer, but if you take $k=\log n/\log |\mathcal A|$, then each string of length $(1+\epsilon)k$ should be unique. A substring is obtained by fixed left and right endpoints. Most of these substrings are much longer than $k$, so should be unique. This suggests that the number of distinct substrings should be approximately $\binom n2-kn\approx n^2/2-n\log n/\log|\mathcal A|$. $\endgroup$Anthony Quas– Anthony Quas2016-10-31 20:39:03 +00:00Commented Oct 31, 2016 at 20:39
-
1$\begingroup$ See Flaxman et al, Strings with maximally many distinct subsequences and substrings, Electronic J Combinatorics 11 (2004) #R8, 10 pages, www.combinatorics.org/ojs/index.php/eljc/article/download/v11i1r8/pdf $\endgroup$Gerry Myerson– Gerry Myerson2016-10-31 21:54:29 +00:00Commented Oct 31, 2016 at 21:54
-
$\begingroup$ @GerryMyerson thanks, but my question is how to sum up number of distinct substrings over all strings, not maximal number of substrings one can achieve. $\endgroup$Artsem Zhuk– Artsem Zhuk2016-10-31 22:45:35 +00:00Commented Oct 31, 2016 at 22:45
-
$\begingroup$ I understand that. Did you look at the paper? The title says "maximal", but there's also a discussion of averages. $\endgroup$Gerry Myerson– Gerry Myerson2016-11-01 01:22:44 +00:00Commented Nov 1, 2016 at 1:22
-
$\begingroup$ This answer mathoverflow.net/a/150618/17581 may be relevant. In particular, in the comments a generating function counting the words avoiding a certain subword is presented. $\endgroup$Ilya Bogdanov– Ilya Bogdanov2016-11-03 20:23:36 +00:00Commented Nov 3, 2016 at 20:23
2 Answers
The authors of this paper exactly answer that question (in corollary 2.2 of that paper).
It seems like if $\tau$ is a uniformly random word in $\mathcal{A}^n$, then $\tau$ has roughly as many distinct length $k$ substrings as possible (in expectation). That is, this number is within a constant factor of the upper bound $\min(a^k, n-k+1)$.
Case 1 ($k$ is big): As noted in the comments, if $k \geq (1+ \varepsilon) \log(n)/\log(a)$ , then with high probability no substring of length $k$ appears more than once. So in fact, when $k$ is this large, it is very likely that $\tau$ has exactly $n-k+1$ distinct substrings (no expectation needed).
Case 2 ($k$ is small): Suppose that $k < (1 + \varepsilon) \log(n)/\log(a)$. Then split the length $n$ word into $n/k$ disjoint blocks of length $k$. Let $B_1, B_2, \ldots , B_{n/k}$ denote the substrings of $\tau$ corresponding to these blocks. Then these $B_i$ are i.i.d. and each is uniformly drawn from $\mathcal{A}^k$. Thus, the expected number of distinct values for the $B_i$ is $a^k \left[ 1 - (1-a^{-k})^{n/k} \right] \geq a^k (1 - \exp[-\frac{n}{ka^k}])$, which is at least $a^k \delta$ for some fixed $\delta >0$ depending only on $\varepsilon$.
Does this answer the question to your satisfaction?