an:07205114
Zbl 1443.92147
Toivonen, Jarkko; Taipale, Jussi; Ukkonen, Esko
Seed-driven learning of position probability matrices from large sequence sets
EN
Schwartz, Russell (ed.) et al., 17th international workshop on algorithms in bioinformatics, WABI 2017, Boston, MA, USA, August 21--23, 2017. Proceedings. Wadern: Schloss Dagstuhl -- Leibniz Zentrum f??r Informatik. LIPIcs -- Leibniz Int. Proc. Inform. 88, Article 25, 13 p. (2017).
2017
a
92D20 92-04
motif finding; transcription factor binding site; sequence analysis; Hamming distance; seed
Summary: We formulate and analyze a novel seed-driven algorithm SeedHam for PPM learning. To learn a PPM of length \(\ell\), the algorithm uses the most frequent \(\ell\)-mer of the training data as a seed, and then restricts the learning into the \(\ell\)-mers of training data that belong to a Hamming neighbourhood of the seed. The PPM is constructed from background corrected counts of such \(\ell\)-mers using an algorithm that estimates a product of \(\ell\) categorical distribution from a (non-uniform) Hamming sample. The SeedHam method is intended for PPM learning from large sequence sets (up to hundreds of Mbases) containing enriched motif instances. A variant of the method is introduced that decreases contamination from artefact instances of the motif and thereby allows using larger Hamming neighbourhoods. To solve the motif orientation problem in two-stranded DNA we introduce a novel seed finding rule, based on analysis of the palindromic structure of sequences. Test experiments are reported, that illustrate the relative strengths of different variants of our methods, and show that our algorithm outperforms two popular earlier methods. A C++ implementation of the method is available from \url{https://github.com/jttoivon/seedham/}.
For the entire collection see [Zbl 1372.68022].