Counting colours in compressed strings. (English)
Giancarlo, Raffaele (ed.) et al., Combinatorial pattern matching. 22nd annual symposium, CPM 2011, Palermo, Italy, June 27‒29, 2011. Proceedings. Berlin: Springer (ISBN 978-3-642-21457-8/pbk). Lecture Notes in Computer Science 6661, 197-207 (2011).
Summary: Motivated by the problem of counting unique visitors to a website, we consider how to preprocess a string $s[1 \dots n]$ such that later, given a substring’s endpoints, we can quickly count how many distinct characters that substring contains. The smallest reasonably fast previous data structure for this problem takes $n \log σ+ {\mathcal{O}\!\left( {n \log \log n} \right)}$ bits and answers queries in ${\mathcal{O}\!\left( {\log n} \right)}$ time. We give a data structure for this problem that takes $n H_0 (s) + {\mathcal{O}\!\left( {n} \right)} + o (n H_0 (s))$ bits, where $H _{0} (s)$ is the 0th-order empirical entropy of $s$, and answers queries in ${\mathcal{O}\!\left( {\log \ell} \right)}$ time, where $\ell $ is the length of the query substring. As far as we know, this is the first data structure, where the query time depends only on $\ell $ and not on $n$. We also show how our data structure can be made partially dynamic.