×

Better greedy sequence clustering with fast banded alignment. (English) Zbl 1443.92136

Schwartz, Russell (ed.) et al., 17th international workshop on algorithms in bioinformatics, WABI 2017, Boston, MA, USA, August 21–23, 2017. Proceedings. Wadern: Schloss Dagstuhl – Leibniz Zentrum für Informatik. LIPIcs – Leibniz Int. Proc. Inform. 88, Article 3, 13 p. (2017).
Summary: Comparing a string to a large set of sequences is a key subroutine in greedy heuristics for clustering genomic data. Clustering 16S rRNA gene sequences into operational taxonomic units (OTUs) is a common method used in studying microbial communities. We present a new approach to greedy clustering using a tree-like data structure and Four Russians speedup. We evaluate the running time of our method in terms of the number of comparisons it makes during clustering and show in experimental results that the number of comparisons grows linearly with the size of the dataset as opposed to the quadratic running time of other methods. We compare the clusters output by our method to the popular greedy clustering tool UCLUST. We show that the clusters we generate can be both tighter and larger.
For the entire collection see [Zbl 1372.68022].

MSC:

92D20 Protein sequences, DNA sequences
PDFBibTeX XMLCite
Full Text: DOI

References:

[2] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. {\it Journal of molecular biology}, 215(3):403-410, 1990.
[3] J. Gregory Caporaso, Christian L. Lauber, William A Walters, Donna Berg-Lyons, James Huntley, Noah Fierer, Sarah M. Owens, Jason Betley, Louise Fraser, Markus Bauer, et al. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. {\it The ISME journal}, 6(8):1621-1624, 2012.
[4] Robert C. Edgar. Search and clustering orders of magnitude faster than BLAST. {\it Bioin-} {\it formatics}, 26(19):2460-2461, 2010.
[5] J. Felsenstein. PHYLIP-phylogeny inference package (version 3.2). {\it cladistics}, 5:164-166, 1989.
[6] Mohammadreza Ghodsi, Bo Liu, and Mihai Pop.DNACLUST: accurate and efficient clustering of phylogenetic marker genes. {\it BMC bioinformatics}, 12(1):271, 2011.
[7] Dan Gusfield. {\it Algorithms on strings, trees and sequences: computer science and computa-} {\it tional biology}. Cambridge university press, 1997. · Zbl 0934.68103
[8] Vladimir I. Levenshtein.Binary codes capable of correcting deletions, insertions, and reversals. In {\it Soviet physics doklady}, volume 10, pages 707-710, 1966. · Zbl 0149.15905
[9] Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. {\it Bioinformatics}, 22(13):1658-1659, 2006.
[10] William J. Masek and Michael S. Paterson. How to compute string-edit distances quickly. In D. Sankoff and J. B. Kruskal, editors, {\it Time Warps, String Edits, and Macromolecules:} {\it the Theory and Practice of Sequence Comparison}, pages 337-349. Addison-Wesley Publ. Co., Mass., 1983.
[11] William J. Masek and Mike Paterson. A faster algorithm computing string edit distances. {\it J. Comput. Syst. Sci.}, 20(1):18-31, 1980. doi:10.1016/0022-0000(80)90002-1. · Zbl 0436.68044
[12] Gerard Muyzer, Ellen C. De Waal, and Andre G. Uitterlinden. Profiling of complex mi crobial populations by denaturing gradient gel electrophoresis analysis of polymerase chain reaction-amplified genes coding for 16S rRNA. {\it Applied and environmental microbiology}, 59(3):695-700, 1993.
[13] Eugene W. Myers.An {\it O}({\it N D}) difference algorithm and its variations.{\it Algorithmica}, 1(1):251-266, 1986. · Zbl 0639.68054
[14] Gene Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming. {\it Journal of the ACM (JACM)}, 46(3):395-415, 1999. · Zbl 1065.68663
[15] Temple F. Smith and Michael S. Waterman. Identification of common molecular subse quences. {\it Journal of molecular biology}, 147(1):195-197, 1981.
[16] Julie D. Thompson, Toby Gibson, Des G. Higgins, et al. Multiple sequence alignment using ClustalW and ClustalX. {\it Current protocols in bioinformatics}, pages 2-3, 2002.
[17] Lusheng Wang and Tao Jiang. On the complexity of multiple sequence alignment. {\it Journal} {\it of computational biology}, 1(4):337-348, 1994.
[18] James R. White, Saket Navlakha, Niranjan Nagarajan, Mohammad-Reza Ghodsi, Carl Kingsford, and Mihai Pop. Alignment and clustering of phylogenetic markers-implications for microbial diversity studies. {\it BMC bioinformatics}, 11(1):152, 2010.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.