Expected number of substring in random string

Question

Consider random string of length $n$ over alphabet of size $|\mathcal{A}|=a$ ($a^n$ strings in total). What's expected number of distinct substrings of this string? What's expected number of distinct substrings of length $k$?
In other words, find size of the set $$\left\{(s, t)| s \in \mathcal{A}^n, t \in \mathcal{A} ^ k, t\ \text{is substring of}\ s \right\}.$$ Does it possible to calculate this number in polynomial time (i.e. $poly(n,k,a)$)?
I am looking for exact formula or alogirthm to calculate this number, but asymptotics results are also welcome.

This is not an exact answer, but if you take $k=\log n/\log |\mathcal A|$, then each string of length $(1+\epsilon)k$ should be unique. A substring is obtained by fixed left and right endpoints. Most of these substrings are much longer than $k$, so should be unique. This suggests that the number of distinct substrings should be approximately $\binom n2-kn\approx n^2/2-n\log n/\log|\mathcal A|$. — Anthony Quas
– Anthony Quas, Commented Oct 31, 2016 at 20:39
See Flaxman et al, Strings with maximally many distinct subsequences and substrings, Electronic J Combinatorics 11 (2004) #R8, 10 pages, www.combinatorics.org/ojs/index.php/eljc/article/download/v11i1r8/pdf — Gerry Myerson
– Gerry Myerson, Commented Oct 31, 2016 at 21:54
@GerryMyerson thanks, but my question is how to sum up number of distinct substrings over all strings, not maximal number of substrings one can achieve. — Artsem Zhuk
– Artsem Zhuk, Commented Oct 31, 2016 at 22:45
I understand that. Did you look at the paper? The title says "maximal", but there's also a discussion of averages. — Gerry Myerson
– Gerry Myerson, Commented Nov 1, 2016 at 1:22
This answer mathoverflow.net/a/150618/17581 may be relevant. In particular, in the comments a generating function counting the words avoiding a certain subword is presented. — Ilya Bogdanov
– Ilya Bogdanov, Commented Nov 3, 2016 at 20:23

John smith · Accepted Answer · 2017-04-07 05:16:58Z

2

The authors of this paper exactly answer that question (in corollary 2.2 of that paper).

answered Apr 7, 2017 at 5:16

John smith

212 bronze badges

Add a comment |

Pat Devlin · Accepted Answer · 2016-11-11 17:39:51Z

It seems like if $\tau$ is a uniformly random word in $\mathcal{A}^n$, then $\tau$ has roughly as many distinct length $k$ substrings as possible (in expectation). That is, this number is within a constant factor of the upper bound $\min(a^k, n-k+1)$.

Case 1 ($k$ is big): As noted in the comments, if $k \geq (1+ \varepsilon) \log(n)/\log(a)$ , then with high probability no substring of length $k$ appears more than once. So in fact, when $k$ is this large, it is very likely that $\tau$ has exactly $n-k+1$ distinct substrings (no expectation needed).

Case 2 ($k$ is small): Suppose that $k < (1 + \varepsilon) \log(n)/\log(a)$. Then split the length $n$ word into $n/k$ disjoint blocks of length $k$. Let $B_1, B_2, \ldots , B_{n/k}$ denote the substrings of $\tau$ corresponding to these blocks. Then these $B_i$ are i.i.d. and each is uniformly drawn from $\mathcal{A}^k$. Thus, the expected number of distinct values for the $B_i$ is $a^k \left[ 1 - (1-a^{-k})^{n/k} \right] \geq a^k (1 - \exp[-\frac{n}{ka^k}])$, which is at least $a^k \delta$ for some fixed $\delta >0$ depending only on $\varepsilon$.

Does this answer the question to your satisfaction?

Stack Exchange Network

Expected number of substring in random string

2 Answers 2

You must log in to answer this question.

Linked

Expected number of substring in random string

2 Answers 2

You must log in to answer this question.

Linked

Related