CUSHAW suite: parallel and efficient algorithms for NGS read alignment. (English) Zbl 1457.68337

Elloumi, Mourad (ed.), Algorithms for next-generation sequencing data. Techniques, approaches, and applications. Cham: Springer. 203-233 (2017).
Summary: Next generation sequencing (NGS) technologies have enabled cheap, large-scale, and high-throughput production of short DNA sequence reads and thereby have promoted the explosive growth of data volume. Unfortunately, the produced reads are short and prone to contain errors that are incurred during sequencing cycles. Both large data volume and sequencing errors have complicated the mapping of NGS reads onto the reference genome and have motivated the development of various aligners for very short reads, typically less than 100 base pairs (bps) in length. As read length continues to increase, propelled by advances in NGS technologies, these longer reads tend to have higher sequencing error rates and more true mutations (including substitutions, insertions, or deletions) to the genome. Such new characteristics make inefficient the aligners, which are optimized for very short reads and support only ungapped alignments or gapped alignments with very limited number of gaps (typically one gap), and thereby call for new aligners with fully gapped alignment supported. In this chapter, we present the CUSHAW software suite for NGS read alignment, which is open-source and consists of three individual aligners: CUSHAW, CUSHAW2, and CUSHAW3. This suite offers parallel and efficient NGS read alignments to large genomes, such as the human genome, by harnessing multi-core CPUs or compute unified device architecture (CUDA)-enabled graphics processing units (GPUs). Moreover, it has the capability to align both base-space and color-space reads and is consistently shown to be one of the best alignment tools through our performance evaluations.
For the entire collection see [Zbl 1383.68005].


68W32 Algorithms on strings
92D20 Protein sequences, DNA sequences
Full Text: DOI


[1] Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821-829 (2008)
[2] Simpson, J.T., Wong, K., Jackman, S.D., et al.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117-1123 (2009)
[3] Liu, Y., Schmidt, B., Maskell, D.L.: Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinformatics 12, 354 (2011)
[4] Luo, R., Liu, B., Xie, Y., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012)
[5] Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11, 473-83 (2010)
[6] Yang, X., Chockalingam, S.P., Aluru, S.: A survey of error-correction methods for next-generation sequencing. Brief. Bioinform. 14, 56-66 (2013)
[7] Peng, Y., Leung, H.C., Yiu, S.M., et al.: Meta-IDBA: a de novo assembler for metagenomic data. Bioinformatics 27, i94-i101 (2011)
[8] Yang, X., Zola, J., Aluru, S.: Parallel metagenomic sequence clustering via sketching and quasi-clique enumeration on map-reduce clouds. In: 25th International Parallel and Distributed Processing Symposium, pp. 1223-1233 (2011)
[9] Nguyen, T.D., Schmidt, B., Kwoh, C.K.: Fast Dendrogram-based OTU Clustering using Sequence Embedding. In: 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 63-72 (2014)
[10] Smith, A.D., Xuan, Z., Zhang, M.Q.: Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 9, 128 (2008)
[11] Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851-1858 (2008)
[12] Homer, N., Merriman, B., Nelson, S.F.: BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4, e7767 (2009)
[13] Langmead, B., Trapnell, C., Pop, M., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)
[14] Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1755-1760 (2009)
[15] Liu, Y., Schmidt, B., Maskell, D.L.: CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform. Bioinformatics 28, 1830-1837 (2012)
[16] Li, R., Yu, C., Li, Y., et al.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966-1967 (2009)
[17] Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589-595 (2010)
[18] Rizk, G., Lavenier, D.: GASSST: global alignment short sequence search tool. Bioinformatics 26, 2534-2540 (2010)
[19] Langmead, B., Salzberg, S.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357-359 (2012)
[20] Liu, Y., Schmidt, B.: Long read alignment based on maximal exact match seeds. Bioinformatics 28, i318-i324 (2012)
[21] Liu, Y., Schmidt, B.: CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing. IEEE Des. Test 31, 31-39 (2014)
[22] Liu, Y., Popp, B., Schmidt, B.: CUSHAW3: sensitive and accurate base-space and color-space short-read alignment with hybrid seeding. PLoS ONE 9, e86869 (2014)
[23] González-Domínguez, J., Liu, Y., Schmidt, B.: Parallel and scalable short-read alignment on multi-core clusters Using UPC++. PLoS ONE 11, e0145490 (2016)
[24] Marco-Sola, S., Sammeth, M., Guigó, R., et al.: The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 1885-1888 (2012)
[25] Mu, J.C., Jiang, H., Kiani, A., et al.: Fast and accurate read alignment for resequencing. Bioinformatics 28, 2366-2373 (2012)
[26] Luo, R., Wong, T., Zhu, J., et al.: SOAP3-dp: fast, accurate and sensitive GPU-based short-read aligner. PLoS ONE 8, e65632 (2013)
[27] Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013). arXiv:1303.3997 [q-bio.GN]
[28] Altschul, S.F., Gish, W., Miller, W., et al.: Basic local alignment search tool. J. Mol. Biol. 215, 403-410 (1990)
[29] Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, CA (1994)
[30] Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52, 4 (2005) · Zbl 1323.68261
[31] Novoalign, http://www.novocraft.com/products/novoalign
[32] David, M., Dzamba, M., Lister, D., et al.: SHRiMP2: sensitive yet practical short read mapping. Bioinformatics 27, 1011-1012 (2011)
[33] Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453 (1970)
[34] Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197 (1981)
[35] Blom, J., Jakobi, T., Doppmeier, D., et al.: Exact and complete short read alignment to microbial genomes using Graphics Processing Unit programming. Bioinformatics 27, 1351-1358 (2011)
[36] Kiełbasa, S.M., Wan, R., Sato, K., et al.: Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487-493 (2011)
[37] Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440-445 (2002)
[38] Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13, 296-308 (2006) · Zbl 1119.92329
[39] Ewing, B., Green, P.: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186-194 (1998)
[40] Rognes, T.: Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics 12, 221 (2011)
[41] Liu, Y., Schmidt, B.: GSWABE: faster GPU-accelerated sequence alignment with optimal alignment retrieval for short DNA sequences. Concurr. Comput. Pract. Exp. 27, 958-972 (2014). doi:10.1002/cpe.3371
[42] Li, H., Handsaker, B., Wysoker, A., et al.: The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078-2079 (2009)
[43] Highnam, G., Wang, J.J., Kusler, D., et al.: An analytical framework for optimizing variant discovery from personal genomes. Nat. Commun. 6, 6275 (2015)
[44] Sherry, S.T., Ward, M.H., Kholodov, M., et al.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308-311 (2001)
[45] Huang, W., Li, L., Myers, J.R., et al.: ART: a next-generation sequencing read simulator. Bioinformatics 28, 593-594 (2012)
[46] McKenna, A., Hanna, M., Banks, E., et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next generation DNA sequencing data. Genome Res. 20, 1297-1303 (2010)
[47] Liu, Y., Schmidt, B., Maskell, D.L.: MSA-CUDA: multiple sequence alignment on graphics processing units with CUDA. In: 20th IEEE International Conference on Application-specific Systems, Architectures and Processors, pp. 121-128 (2009)
[48] Liu, Y., Maskelml, D.L., Schmidt, B.: CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res. Notes 2, 73 (2009)
[49] Alachiotis, N., Berger, S.A., Stamatakis, A.: Coupling SIMD and SIMT architectures to boost performance of a phylogeny-aware alignment kernel. BMC Bioinformatics 13, 196 (2012)
[50] Liu, Y., Schmidt, B.: SWAPHI: Smith-Waterman protein database search on Xeon Phi coprocessors. In: 25th IEEE International Conference on Application-specific Systems, Architectures and Processors, pp. 184-185 (2014)
[51] Liu, Y., Tran, T.T., Lauenroth, F., et al.: SWAPHI-LS: Smith-Waterman algorithm on Xeon Phi coprocessors for long DNA sequences. In: 2014 IEEE International Conference on Cluster Computing, pp. 257-265 (2014)
[52] Wang, L.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.