Ranked document retrieval with forbidden pattern. (English) Zbl 1432.68120

Cicalese, Ferdinando (ed.) et al., Combinatorial pattern matching. 26th annual symposium, CPM 2015, Ischia Island, Italy, June 29 – July 1, 2015. Proceedings. Cham: Springer. Lect. Notes Comput. Sci. 9133, 77-88 (2015).
Summary: Let \(\mathcal{D}=\{\mathsf {T}_1,\mathsf {T}_2,\dots , \mathsf {T}_D\}\) be a collection of \(D\) string documents of \(n\) characters in total. The forbidden pattern document listing problem asks to report those documents \(\mathcal{D}' \subseteq \mathcal{D}\) which contain the pattern \(P\), but not the pattern \(Q\). The \({\mathsf {top\text{-}}k}\) forbidden pattern query \((P,Q,k)\) asks to report those \(k\) documents in \(\mathcal{D}'\) that are most relevant to \(P\). For typical relevance functions (like document importance, term-frequency, term-proximity), we present a linear space index with worst case query time of \(O(|P|+|Q|+\sqrt{nk})\) for the \({\mathsf {top\text{-}}k}\) problem. As a corollary of this result, we obtain a linear space and \(O(|P|+|Q|+\sqrt{nt})\) query time solution for the document listing problem, where \(t\) is the number of documents reported. We conjecture that any significant improvement over the results in this paper is highly unlikely.
For the entire collection see [Zbl 1314.68012].


68P20 Information storage and retrieval of data
68W32 Algorithms on strings
Full Text: DOI