×

Designing efficient spaced seeds for SOLiD read mapping. (English) Zbl 1219.92025

Adv. Bioinform. 2010, Article ID 708501, 12 p. (2010).
Summary: The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications.We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the other hand, a novel seeding principle especially adapted to read mapping. Our method can handle both lossy and lossless frameworks and is able to distinguish, at the level of seed design, between SNPs and reading errors. We illustrate our approach by several seed designs and demonstrate their efficiency.

MSC:

92C40 Biochemistry, molecular biology
92D10 Genetics and epigenetics
Full Text: DOI

References:

[1] B. Ma, J. Tromp, and M. Li, “PatternHunter: faster and more sensitive homology search,” Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.
[2] L. Noé and G. Kucherov, “YASS: enhancing the sensitivity of DNA similarity search,” Nucleic Acids Research, vol. 33, no. 2, pp. W540-W543, 2005. · Zbl 05437566 · doi:10.1093/nar/gki478
[3] H. Li, J. Ruan, and R. Durbin, “Mapping short DNA sequencing reads and calling variants using mapping quality scores,” Genome Research, vol. 18, no. 11, pp. 1851-1858, 2008. · doi:10.1101/gr.078212.108
[4] M. Strömberg and W. P. Lee, “MOSAIK read alignment and assembly program,” 2009, http://bioinformatics.bc.edu/marthlab/Mosaik.
[5] E. Rivals, L. Salmela, P. Kiiskinen, P. Kalsi, and J. Tarhio, “MPSCAN: fast localisation of multiple reads in genomes,” in Proceedings of the 9th International Workshopon Algorithms in Bioinformatics (WABI ’09), vol. 5724 of Lecture Notes in Computer Science, pp. 246-260, Philadelphia, Pa, USA, September 2009. · Zbl 05624743 · doi:10.1007/978-3-642-04241-6_21
[6] D. Campagna, A. Albiero, A. Bilardi et al., “PASS: a program to align short sequences,” Bioinformatics, vol. 25, no. 7, pp. 967-968, 2009. · Zbl 05743859 · doi:10.1093/bioinformatics/btp087
[7] Y. Chen, T. Souaiaia, and T. Chen, “PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds,” Bioinformatics, vol. 25, no. 19, pp. 2514-2521, 2009. · Zbl 05744233 · doi:10.1093/bioinformatics/btp486
[8] D. Weese, A.-K. Emde, T. Rausch, A. Döring, and K. Reinert, “RazerS-fast read mapping with sensitivity control,” Genome Research, vol. 19, no. 9, pp. 1646-1654, 2009. · doi:10.1101/gr.088823.108
[9] S. M. Rumble, P. Lacroute, A. V. Dalca, M. Fiume, A. Sidow, and M. Brudno, “SHRiMP: accurate mapping of short color-space reads,” PLoS Computational Biology, vol. 5, no. 5, Article ID e1000386, 2009. · doi:10.1371/journal.pcbi.1000386
[10] H. Lin, Z. Zhang, M. Q. Zhang, B. Ma, and M. Li, “ZOOM! Zillions of oligos mapped,” Bioinformatics, vol. 24, no. 21, pp. 2431-2437, 2008. · Zbl 05511889 · doi:10.1093/bioinformatics/btn416
[11] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome,” Genome Biology, vol. 10, no. 3, article R25, 2009. · doi:10.1186/gb-2009-10-3-r25
[12] H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows-Wheeler transform,” Bioinformatics, vol. 25, no. 14, pp. 1754-1760, 2009. · Zbl 05744088 · doi:10.1093/bioinformatics/btp324
[13] R. Li, C. Yu, Y. Li et al., “SOAP2: an improved ultrafast tool for short read alignment,” Bioinformatics, vol. 25, no. 15, pp. 1966-1967, 2009. · Zbl 05744125 · doi:10.1093/bioinformatics/btp336
[14] S. Hoffmann, C. Otto, S. Kurtz et al., “Fast mapping of short sequences with mismatches, insertions and deletions using index structures,” PLoS Computational Biology, vol. 5, no. 9, Article ID e1000502, 2009. · doi:10.1371/journal.pcbi.1000502
[15] N. Homer, B. Merriman, and S. F. Nelson, “BFAST: an alignment tool for large scale genome resequencing,” PLoS ONE, vol. 4, no. 11, Article ID e7767, 2009. · doi:10.1371/journal.pone.0007767
[16] B. D. Ondov, A. Varadarajan, K. D. Passalacqua, and N. H. Bergman, “Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications,” Bioinformatics, vol. 24, no. 23, pp. 2776-2777, 2008. · Zbl 05743630 · doi:10.1093/bioinformatics/btn512
[17] K. Prüfer, U. Stenzel, M. Dannemann, R. E. Green, M. Lachmann, and J. Kelso, “PatMaN: rapid alignment of short sequences to large databases,” Bioinformatics, vol. 24, no. 13, pp. 1530-1531, 2008. · Zbl 05511686 · doi:10.1093/bioinformatics/btn223
[18] D. R. Bentley, S. Balasubramanian, H. P. Swerdlow et al., “Accurate whole human genome sequencing using reversible terminator chemistry,” Nature, vol. 456, no. 7218, pp. 53-59, 2008. · doi:10.1038/nature07517
[19] G. Kucherov, L. Noé, and M. Roytberg, “Multiseed lossless filtration,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 1, pp. 51-61, 2005. · Zbl 05103338 · doi:10.1109/TCBB.2005.12
[20] G. Kucherov, L. Noé, and M. Roytberg, “A unifying framework for seed sensitivity and its application to subset seeds,” Journal of Bioinformatics and Computational Biology, vol. 4, no. 2, pp. 553-569, 2006. · doi:10.1142/S0219720006001977
[21] ABI: a theoretical understanding of 2 base color codes and its application toannotation, error detection, and error correction. methods for annotating 2 basecolor encoded reads in the SOLiDTMsystem, 2008.
[22] ABI: the SOLiDTM3 system. Enabling the Next Generation of Science, 2009.
[23] B. Ewing and P. Green, “Base-calling of automated sequencer traces using phred. II. Error probabilities,” Genome Research, vol. 8, no. 3, pp. 186-194, 1998.
[24] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: highly sensitive and fast homology search,” Journal of Bioinformatics and Computational Biology, vol. 2, no. 3, pp. 417-439, 2004. · Zbl 02178405 · doi:10.1142/S0219720004000661
[25] Y. Sun and J. Buhler, “Designing multiple simultaneous seeds for DNA similarity search,” Journal of Computational Biology, vol. 12, no. 6, pp. 847-861, 2005. · doi:10.1089/cmb.2005.12.847
[26] B. Brejová, D. G. Brown, and T. Vinar, “Optimal spaced seeds for Hidden Markov Models, with application to homologous coding regions,” in Proceedings of the 14th Symposium on Combinatorial Pattern Matching (CPM ’03), vol. 2676 of Lecture Notes in Computer Science, pp. 42-54, Springer, 2003. · Zbl 1279.92063
[27] L. Zhou, J. Stanton, and L. Florea, “Universal seeds for cDNA-to-genome comparison,” BMC Bioinformatics, vol. 9, article 36, 2008. · Zbl 05326412 · doi:10.1186/1471-2105-9-36
[28] J. Yang and L. Zhang, “Run probabilities of seed-like patterns and identifying good transition seeds,” Journal of Computational Biology, vol. 15, no. 10, pp. 1295-1313, 2008. · doi:10.1089/cmb.2007.0209
[29] G. Kucherov, L. Noé, and M. Roytberg, “Subset seed automaton,” in Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA ’07), vol. 4783 of Lecture Notes in Computer Science, pp. 180-191, Springer, 2007. · Zbl 1139.68369
[30] G. Kucherov, L. Noé, and M. Roytberg, “Iedera: subset seed design tool,” 2009, http://bioinfo.lifl.fr/yass/iedera. · Zbl 1139.68369
[31] L. Noé, M. Gîrdea, and G. Kucherov, “Seed design framework for mapping SOLiD reads,” in Proceedings of the 14th Annual International Conference on Research in Computational Molecular Biology (RECOMB ’10), B. Berger, Ed., vol. 6044 of Lecture Notes in Computer Science, pp. 384-396, Springer, Lisbon, Portugal, April 2010.
[32] U. Keich, M. Li, B. Ma, and J. Tromp, “On spaced seeds for similarity search,” Discrete Applied Mathematics, vol. 138, no. 3, pp. 253-263, 2004. · Zbl 1043.92009 · doi:10.1016/S0166-218X(03)00382-2
[33] J. Buhler, U. Keich, and Y. Sun, “Designing seeds for similarity search in genomic DNA,” in Proceedings of the 7th Annual International Conference on Research in Computational Molecular Biology (RECOMB ’3), pp. 67-75, ACM Press, 2003.
[34] S. Burkhardt and J. Kärkkäinen, “Better filtering with gapped q-grams,” Fundamenta Informaticae, vol. 56, no. 1-2, pp. 51-70, 2003. · Zbl 1031.68092
[35] D. Mak, Y. Gelfand, and G. Benson, “Indel seeds for homology search,” Bioinformatics, vol. 22, no. 14, pp. e341-e349, 2006. · doi:10.1093/bioinformatics/btl263
[36] M. Gîrdea, L. Noé, and G. Kucherov, “Read mapping tool for AB SOLiD data,” in Proceedings of the 9th International Workshopon Algorithms in Bioinformatics (WABI ’09), Philadelphia, Pa, USA, September 2009.
[37] O. Gotoh, “An improved algorithm for matching biological sequences,” Journal of Molecular Biology, vol. 162, no. 3, pp. 705-708, 1982.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.