4
$\begingroup$

Consider random string of length $n$ over alphabet of size $|\mathcal{A}|=a$ ($a^n$ strings in total). What's expected number of distinct substrings of this string? What's expected number of distinct substrings of length $k$?
In other words, find size of the set $$\left\{(s, t)| s \in \mathcal{A}^n, t \in \mathcal{A} ^ k, t\ \text{is substring of}\ s \right\}.$$ Does it possible to calculate this number in polynomial time (i.e. $poly(n,k,a)$)?
I am looking for exact formula or alogirthm to calculate this number, but asymptotics results are also welcome.

$\endgroup$
6
  • 3
    $\begingroup$ This is not an exact answer, but if you take $k=\log n/\log |\mathcal A|$, then each string of length $(1+\epsilon)k$ should be unique. A substring is obtained by fixed left and right endpoints. Most of these substrings are much longer than $k$, so should be unique. This suggests that the number of distinct substrings should be approximately $\binom n2-kn\approx n^2/2-n\log n/\log|\mathcal A|$. $\endgroup$ Commented Oct 31, 2016 at 20:39
  • 1
    $\begingroup$ See Flaxman et al, Strings with maximally many distinct subsequences and substrings, Electronic J Combinatorics 11 (2004) #R8, 10 pages, www.combinatorics.org/ojs/index.php/eljc/article/download/v11i1r8/pdf $\endgroup$ Commented Oct 31, 2016 at 21:54
  • $\begingroup$ @GerryMyerson thanks, but my question is how to sum up number of distinct substrings over all strings, not maximal number of substrings one can achieve. $\endgroup$ Commented Oct 31, 2016 at 22:45
  • $\begingroup$ I understand that. Did you look at the paper? The title says "maximal", but there's also a discussion of averages. $\endgroup$ Commented Nov 1, 2016 at 1:22
  • $\begingroup$ This answer mathoverflow.net/a/150618/17581 may be relevant. In particular, in the comments a generating function counting the words avoiding a certain subword is presented. $\endgroup$ Commented Nov 3, 2016 at 20:23

2 Answers 2

2
$\begingroup$

The authors of this paper exactly answer that question (in corollary 2.2 of that paper).

$\endgroup$
1
$\begingroup$

It seems like if $\tau$ is a uniformly random word in $\mathcal{A}^n$, then $\tau$ has roughly as many distinct length $k$ substrings as possible (in expectation). That is, this number is within a constant factor of the upper bound $\min(a^k, n-k+1)$.

Case 1 ($k$ is big): As noted in the comments, if $k \geq (1+ \varepsilon) \log(n)/\log(a)$ , then with high probability no substring of length $k$ appears more than once. So in fact, when $k$ is this large, it is very likely that $\tau$ has exactly $n-k+1$ distinct substrings (no expectation needed).

Case 2 ($k$ is small): Suppose that $k < (1 + \varepsilon) \log(n)/\log(a)$. Then split the length $n$ word into $n/k$ disjoint blocks of length $k$. Let $B_1, B_2, \ldots , B_{n/k}$ denote the substrings of $\tau$ corresponding to these blocks. Then these $B_i$ are i.i.d. and each is uniformly drawn from $\mathcal{A}^k$. Thus, the expected number of distinct values for the $B_i$ is $a^k \left[ 1 - (1-a^{-k})^{n/k} \right] \geq a^k (1 - \exp[-\frac{n}{ka^k}])$, which is at least $a^k \delta$ for some fixed $\delta >0$ depending only on $\varepsilon$.

Does this answer the question to your satisfaction?

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.