×

zbMATH — the first resource for mathematics

DNA-seq error correction based on substring indices. (English) Zbl 1458.92030
Elloumi, Mourad (ed.), Algorithms for next-generation sequencing data. Techniques, approaches, and applications. Cham: Springer. 147-166 (2017).
Summary: Next-generation sequencing (NGS) has revolutionized genomics. NGS technologies produce millions of sequencing reads of a few hundred bases in length. In the following, we focus on NGS reads produced by genome sequencing of a clonal cell population, which has important applications like the de novo genome assembly of previously unknown genomes, for example, recently mutated parasites A. Mellmann et al. [“Prospective genomic characterization of the German enterohemorrhagic Escherichia coli o104:h4 outbreak by rapid next generation sequencing technology”, PLoS ONE 6, No. 7, Article ID e22751, 9 p. (2011; doi:10.1371/journal.pone.0022751)] or newly sequenced genomes [D. P. Locke et al., “Comparative and demographic analysis of orangutan genomes”, Nature 469, 529–533 (2011; doi:10.1038/nature09687)].
For the entire collection see [Zbl 1383.68005].
MSC:
92C40 Biochemistry, molecular biology
92D20 Protein sequences, DNA sequences
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53-86 (2004) · Zbl 1115.92303
[2] Auton, A., Fledel-Alon, A., Pfeifer, S., Venn, O., Ségurel, L., Street, T., Leffler, E.M., Bowden, R., Aneas, I., Broxholme, J., Humburg, P., Iqbal, Z., Lunter, G., Maller, J., Hernandez, R.D., Melton, C., Venkat, A., Nobrega, M.A., Bontrop, R., Myers, S., Donnelly, P., Przeworski, M., McVean, G.: A fine-scale chimpanzee genetic map from population sequencing. Science 336(6078), 193-198 (2012)
[3] Bart, R., Cohn, M., Kassen, A., McCallum, E.J., Shybut, M., Petriello, A., Krasileva, K., Dahlbeck, D., Medina, C., Alicai, T., Kumar, L., Moreira, L.M., Neto, J.R., Verdier, V., Santana, M.A., Kositcharoenkul, N., Vanderschuren, H., Gruissem, W., Bernal, A., Staskawicz, B.J.: High-throughput genomic sequencing of cassava bacterial blight strains identifies conserved effectors to target for durable resistance. Proc. Natl. Acad. Sci. U. S. A. 109, E1972-1979 (2012)
[4] Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., Lander, E.S., Nusbaum, C., Jaffe, D.B.: ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18(5), 810-820 (2008)
[5] Chaisson, M.J., Pevzner, P.A.: Short read fragment assembly of bacterial genomes. Genome Res. 18(2), 324-330 (2008)
[6] Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS’00, pp. 390-398 (2000)
[7] Glenn, T.C.: Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11(5), 759-769 (2011)
[8] Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F.J., Burton, J.N., Walker, B.J., Sharpe, T., Hall, G., Shea, T.P., Sykes, S., Berlin, A.M., Aird, D., Costello, M., Daza, R., Williams, L., Nicol, R., Gnirke, A., Nusbaum, C., Lander, E.S., Jaffe, D.B.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. U. S. A. 108(4), 1513-1518 (2010)
[9] Greenfield, P., Duesing, K., Papanicolaou, A., Bauer, D.C.: Blue: correcting sequencing errors using consensus and context. Bioinformatics 30(19), 2723-2732 (2014)
[10] Ilie, L., Fazayeli, F., Ilie, S.: HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27(3), 295-302 (2011)
[11] Kao, W., Chan, A., Song, Y.: ECHO: a reference-free short-read error correction algorithm. Genome Res. 21(7), 1181 (2011)
[12] Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), R116 (2010)
[13] Koren, S., Schatz, M.C., Walenz, B.P., Martin, J., Howard, J.T., Ganapathy, G., Wang, Z., Rasko, D.A., McCombie, W.R., Jarvis, E.D., Phillippy, A.M.: Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30(7), 693-700 (2012)
[14] Le, H.S., Schulz, M.H., McCauley, B.M., Hinman, V.F., Bar-Joseph, Z.: Probabilistic error correction for RNA sequencing. Nucleic Acids Res. 41(10), e109 (2013)
[15] Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics 26(5), 589-595 (2010)
[16] Locke, D.P., Hillier, L.W., Warren, W.C., Worley, K.C., Nazareth, L.V., Muzny, D.M., Yang, S.P., Wang, Z., Chinwalla, A.T., Minx, P., Mitreva, M., Cook, L., Delehaunty, K.D., Fronick, C., Schmidt, H., Fulton, L.A., Fulton, R.S., Nelson, J.O., Magrini, V., Pohl, C., Graves, T.A., Markovic, C., Cree, A., Dinh, H.H., Hume, J., Kovar, C.L., Fowler, G.R., Lunter, G., Meader, S., Heger, A., Ponting, C.P., Marques-Bonet, T., Alkan, C., Chen, L., Cheng, Z., Kidd, J.M., Eichler, E.E., White, S., Searle, S., Vilella, A.J., Chen, Y., Flicek, P., Ma, J., Raney, B., Suh, B., Burhans, R., Herrero, J., Haussler, D., Faria, R., Fernando, O., Darre, F., Farre, D., Gazave, E., Oliva, M., Navarro, A., Roberto, R., Capozzi, O., Archidiacono, N., Valle, G.D., Purgato, S., Rocchi, M., Konkel, M.K., Walker, J.A., Ullmer, B., Batzer, M.A., Smit, A.F.A., Hubley, R., Casola, C., Schrider, D.R., Hahn, M.W., Quesada, V., Puente, X.S., Ordonez, G.R., Lopez-Otin, C., Vinar, T., Brejova, B., Ratan, A., Harris, R.S., Miller, W., Kosiol, C., Lawson, H.A., Taliwal, V., Martins, A.L., Siepel, A., RoyChoudhury, A., Ma, X., Degenhardt, J., Bustamante, C.D., Gutenkunst, R.N., Mailund, T., Dutheil, J.Y., Hobolth, A., Schierup, M.H., Ryder, O.A., Yoshinaga, Y., de Jong, P.J., Weinstock, G.M., Rogers, J., Mardis, E.R., Gibbs, R.A., Wilson, R.K.: Comparative and demographic analysis of orangutan genomes. Nature 469, 529-533 (2011)
[17] MacManes, M.D., Eisen, M.B.: Improving transcriptome assembly through error correction of high-throughput sequence reads. PeerJ 1, e113 (2013)
[18] Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. In: SODA’90, pp. 319-327. SIAM, Philadelphia (1990) · Zbl 0800.68364
[19] Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., Dewell, S.B., Du, L., Fierro, J.M., Gomes, X.V., Godwin, B.C., He, W., Helgesen, S., Ho, C.H., Irzyk, G.P., Jando, S.C., Alenquer, M.L.I., Jarvie, T.P., Jirage, K.B., Kim, J.B., Knight, J.R., Lanza, J.R., Leamon, J.H., Lefkowitz, S.M., Lei, M., Li, J., Lohman, K.L., Lu, H., Makhijani, V.B., Mcdade, K.E., Mckenna, M.P., Myers, E.W., Nickerson, E., Nobile, J.R., Plant, R., Puc, B.P., Ronan, M.T., Roth, G.T., Sarkis, G.J., Simons, J.F., Simpson, J.W., Srinivasan, M., Tartaro, K.R., Tomasz, A., Vogt, K.A., Volkmer, G.A., Wang, S.H., Wang, Y., Weiner, M.P., Yu, P., Begley, R.F., Rothberg, J.M.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-380 (2005)
[20] McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262-272 (1976) · Zbl 0329.68042
[21] Medvedev, P., Scott, E., Kakaradov, B., Pevzner, P.: Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics 27(13), i137-i141 (2011)
[22] Mellmann, A., Harmsen, D., Cummings, C.A., Zentz, E.B., Leopold, S.R., Rico, A., Prior, K., Szczepanowski, R., Ji, Y., Zhang, W., McLaughlin, S.F., Henkhaus, J.K., Leopold, B., Bielaszewska, M., Prager, R., Brzoska, P.M., Moore, R.L., Guenther, S., Rothberg, J.M., Karch, H.: Prospective genomic characterization of the German enterohemorrhagic Escherichia coli o104:h4 outbreak by rapid next generation sequencing technology. PLoS ONE 6(7), e22751 (2011)
[23] Minoche, A.E., Dohm, J.C., Himmelbauer, H.: Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol. 12(11), R112 (2011)
[24] Myers, E.W.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395-415 (1999) · Zbl 1065.68663
[25] Pevzner, P., Waterman, M.: Multiple filtration and approximate pattern matching. Algorithmica 13(1-2), 135-154 (1995) · Zbl 0831.92015
[26] Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. U. S. A. 98(17), 9748-9753 (2001) · Zbl 0993.92018
[27] Quail, M.A., Smith, M., Coupland, P., Otto, T.D., Harris, S.R., Connor, T.R., Bertoni, A., Swerdlow, H.P., Gu, Y.: A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13(1), 341 (2012)
[28] Salmela, L.: Correction of sequencing errors in a mixed set of reads. Bioinformatics 26(10), 1284-1290 (2010)
[29] Salmela, L., Schröder, J.: Correcting errors in short reads by multiple alignments. Bioinformatics 27(11), 1455-1461 (2011)
[30] Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T.J., Schatz, M.C., Delcher, A.L., Roberts, M., Marçais, G., Pop, M., Yorke, J.A.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22(3), 557-567 (2012)
[31] Savel, D.M., LaFramboise, T., Grama, A., Koyutürk, M.: Suffix-tree based error correction of NGS reads using multiple manifestations of an error. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, BCB’13, pp. 351:351-351:358. ACM, New York (2013)
[32] Schröder, J., Schröder, H., Puglisi, S.J., Sinha, R., Schmidt, B.: SHREC: a short-read error correction method. Bioinformatics 25(17), 2157-2163 (2009)
[33] Schulz, M.H., Weese, D., Holtgrewe, M., Dimitrova, V., Niu, S., Reinert, K., Richard, H.: Fiona: a parallel and automatic strategy for read error correction. Bioinformatics 30(17), i356-i363 (2014)
[34] Shendure, J., Ji, H.: Next-generation DNA sequencing. Nat. Biotechnol. 26(10), 1135-1145 (2008)
[35] Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549-556 (2012)
[36] Stein, L.: The case for cloud computing in genome informatics. Genome Biol. 11(5), 207 (2010)
[37] The 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061-1073 (2010)
[38] Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249-260 (1995) · Zbl 0831.68027
[39] Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Symposium on Switching and Automata Theory, SWAT’73, pp. 1-11. IEEE Computer Society, Washington, DC (1973)
[40] Yang, X., Dorman, K.S., Aluru, S.: Reptile: representative tiling for short read error correction. Bioinformatics 26(20), 2526-2533 (2010)
[41] Yang, X.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.