×

An overview of multiple sequence alignments and cloud computing in bioinformatics. (English) Zbl 1300.92067

Summary: Multiple sequence alignment (MSA) of DNA, RNA, and protein sequences is one of the most essential techniques in the fields of molecular biology, computational biology, and bioinformatics. Next-generation sequencing technologies are changing the biology landscape, flooding the databases with massive amounts of raw sequence data. MSA of ever-increasing sequence data sets is becoming a significant bottleneck. In order to realise the promise of MSA for large-scale sequence data sets, it is necessary for existing MSA algorithms to be run in a parallelised fashion with the sequence data distributed over a computing cluster or server farm. Combining MSA algorithms with cloud computing technologies is therefore likely to improve the speed, quality, and capability for MSA to handle large numbers of sequences. In this review, multiple sequence alignments are discussed, with a specific focus on the ClustalW and Clustal Omega algorithms. Cloud computing technologies and concepts are outlined, and the next generation of cloud base MSA algorithms is introduced.

MSC:

92D20 Protein sequences, DNA sequences
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Kemena, C.; Notredame, C., Upcoming challenges for multiple sequence alignment methods in the high-throughput era, Bioinformatics, 25, 19, 2455-2465, (2009)
[2] Edgar, R. C.; Batzoglou, S., Multiple sequence alignment, Current Opinion in Structural Biology, 16, 3, 368-373, (2006)
[3] Notredame, C., Recent evolutions of multiple sequence alignment algorithms, PLoS Computational Biology, 3, 8, article e123, (2007)
[4] Human Genome Project Information
[5] Home, 1000 genomes
[6] Scientists, G. K. C. O., Genome 10K: a proposal to obtain whole-genome sequence for 10, 000 vertebrate species, Journal of Heredity, 100, 6, 659-674, (2009)
[7] 454 Life Sciences, a Roche Company
[8] Illumina, Inc
[9] SOLiDTM 4 System
[10] Li, H.; Homer, N., A survey of sequence alignment algorithms for next-generation sequencing, Briefings in Bioinformatics, 11, 5, 473-483, (2010)
[11] SourceForge.net: jnomics
[12] Do, C. B.; Katoh, K., Protein multiple sequence alignment, Methods in Molecular Biology, 484, 379-413, (2008)
[13] Edgar, R. C., MUSCLE: a multiple sequence alignment method with reduced time and space complexity, BMC Bioinformatics, 5, article 113, (2004)
[14] Needleman, S. B.; Wunsch, C. D., A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of Molecular Biology, 48, 3, 443-453, (1970)
[15] Smith, T. F.; Waterman, M. S., Identification of common molecular subsequences, Journal of Molecular Biology, 147, 1, 195-197, (1981)
[16] Wallace, I. M.; Blackshields, G.; Higgins, D. G., Multiple sequence alignments, Current Opinion in Structural Biology, 15, 3, 261-266, (2005)
[17] Katoh, K.; Toh, H., Recent developments in the MAFFT multiple sequence alignment program, Briefings in Bioinformatics, 9, 4, 286-298, (2008)
[18] Feng, D.-F.; Doolittle, R. F., Progressive sequence alignment as a prerequisitetto correct phylogenetic trees, Journal of Molecular Evolution, 25, 4, 351-360, (1987)
[19] Wilbur, W. J.; Lipman, D. J., Rapid similarity searches of nucleic acid and protein data banks, Proceedings of the National Academy of Sciences of the United States of America, 80, 3, 726-730, (1983)
[20] Edgar, R. C., MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research, 32, 5, 1792-1797, (2004)
[21] Sievers, F.; Wilm, A.; Dineen, D.; Gibson, T. J.; Karplus, K.; Li, W.; Lopez, R.; McWilliam, H.; Remmert, M.; Söding, J.; Thompson, J. D.; Higgins, D. G., Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega, Molecular Systems Biology, 7, article 539, (2011)
[22] Saitou, N.; Nei, M., The neighbor-joining method: a new method for reconstructing phylogenetic trees, Molecular Biology and Evolution, 4, 4, 406-425, (1987)
[23] Gronau, I.; Moran, S., Optimal implementations of UPGMA and other common clustering algorithms, Information Processing Letters, 104, 6, 205-210, (2007) · Zbl 1184.68600
[24] Thompson, J. D.; Higgins, D. G.; Gibson, T. J., CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Research, 22, 22, 4673-4680, (1994)
[25] Katoh, K.; Standley, D. M., MAFFT multiple sequence alignment software version 7: improvements in performance and usability, Molecular Biology and Evolution, 30, 4, 772-780, (2013)
[26] Lassmann, T.; Sonnhammer, E. L. L., Kalign—an accurate and fast multiple sequence alignment algorithm, BMC Bioinformatics, 6, article 298, (2005)
[27] Roshan, U.; Livesay, D. R., Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics, 22, 22, 2715-2721, (2006)
[28] Morgenstern, B., DIALIGN: multiple DNA and protein sequence alignment at bibiserv, Nucleic Acids Research, 32, W33-W36, (2004)
[29] Löytynoja, A.; Goldman, N., Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis, Science, 320, 5883, 1632-1635, (2008)
[30] Bradley, R. K.; Roberts, A.; Smoot, M.; Juvekar, S.; Do, J.; Dewey, C.; Holmes, I.; Pachter, L., Fast statistical alignment, PLoS Computational Biology, 5, 5, (2009)
[31] Di Tommaso, P.; Moretti, S.; Xenarios, I.; Orobitg, M.; Montanyola, A.; Chang, J.-M.; Taly, J.-F.; Notredame, C., T-coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension, Nucleic Acids Research, 39, W13-W17, (2011)
[32] Notredame, C.; Higgins, D. G.; Heringa, J., T-coffee: a novel method for fast and accurate multiple sequence alignment, Journal of Molecular Biology, 302, 1, 205-217, (2000)
[33] Do, C. B.; Mahabhashyam, M. S. P.; Brudno, M.; Batzoglou, S., Probcons: probabilistic consistency-based multiple sequence alignment, Genome Research, 15, 2, 330-340, (2005)
[34] Liu, Y.; Schmidt, B.; Maskell, D. L., Msaprobs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities, Bioinformatics, 26, 16, 1958-1964, (2010)
[35] Mount, D. W., Using iterative methods for global multiple sequence alignment, Cold Spring Harbor Protocols, 4, 7, (2009)
[36] Gotoh, O., Optimal alignment between groups of sequences and its application to multiple sequence alignment, Computer Applications in the Biosciences, 9, 3, 361-370, (1993)
[37] Notredame, C.; Higgins, D. G., SAGA: sequence alignment by genetic algorithm, Nucleic Acids Research, 24, 8, 1515-1524, (1996)
[38] Thompson, J. D.; Plewniak, F.; Poch, O., A comprehensive comparison of multiple sequence alignment programs, Nucleic Acids Research, 27, 13, 2682-2690, (1999)
[39] Lesk, A. M.; Chothia, C., How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins, Journal of Molecular Biology, 136, 3, 225-270, (1980)
[40] O’Sullivan, O.; Suhre, K.; Abergel, C.; Higgins, D. G.; Notredame, C., 3dcoffee: combining protein sequences and structures within multiple sequence alignments, Journal of Molecular Biology, 340, 2, 385-395, (2004)
[41] Armougom, F.; Moretti, S.; Poirot, O.; Audic, S.; Dumas, P.; Schaeli, B.; Keduas, V.; Notredame, C.; Notredame, C., Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-coffee, Nucleic Acids Research, 34, W604-W608, (2006)
[42] Xia, X.; Zhang, S.; Su, Y.; Sun, Z., Micalign: a sequence-to-structure alignment tool integrating multiple sources of information in conditional random fields, Bioinformatics, 25, 11, 1433-1434, (2009)
[43] Zhang, Z.; Schäffer, A. A.; Miller, W.; Madden, T. L.; Lipman, D. J.; Koonin, E. V.; Altschul, S. F., Protein sequence similarity searches using patterns as seeds, Nucleic Acids Research, 26, 17, 3986-3990, (1998)
[44] Frith, M. C.; Saunders, N. F. W.; Kobe, B.; Bailey, T. L., Discovering sequence motifs with arbitrary insertions and deletions, PLoS Computational Biology, 4, 4, (2008)
[45] Li, H.; Ruan, J.; Durbin, R., Mapping short DNA sequencing reads and calling variants using mapping quality scores, Genome Research, 18, 11, 1851-1858, (2008)
[46] Li, R.; Li, Y.; Kristiansen, K.; Wang, J., SOAP: short oligonucleotide alignment program, Bioinformatics, 24, 5, 713-714, (2008)
[47] Langmead, B.; Trapnell, C.; Pop, M.; Salzberg, S. L., Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biology, 10, 3, article R25, (2009)
[48] A definition of the cloud at last?—web performance watch
[49] Virtualization is a key enabler of cloud computing
[50] Arthur, D.; Vassilvitskii, S., K-means++: the advantages of careful seeding, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics · Zbl 1302.68273
[51] Söding, J., Protein homology detection by HMM-HMM comparison, Bioinformatics, 21, 7, 951-960, (2005)
[52] Sievers, F.; Dineen, D.; Wilm, A.; Higgins, D. G., Making automated multiple alignments of very large numbers of protein sequences, Bioinformatics, 29, 8, 989-995, (2013)
[53] Katoh, K.; Misawa, K.; Kuma, K.-I.; Miyata, T., MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Research, 30, 14, 3059-3066, (2002)
[54] What cloud computing really means
[55] MIT credits irish-based entrepreneur with co-coining term “cloud computing”
[56] The benefits of data center virtualization for businesses, cloudtweaks
[57] Vmware virtualization software for desktops, servers & virtual machines for public and private cloud solutions
[58] Main Page—KVM
[59] Accelerating high-performance computing applications using parallel computing
[60] Cloud 101: what the heck do iaas, paas and saas companies do?
[61] Amazon Web Services, Cloud computing: compute, storage, database
[62] Microsoft Home Page, Devices and Services
[63] Rackspace
[64] Ensembl Genome Browser
[65] GenBank Home
[66] CRM—The Enterprise Cloud Computing Company—Salesforce.com Europe
[67] Choudhary, V., Software as a service: implications for investment in software development, Proceedings of the 40th Annual Hawaii International Conference on System Sciences (HICSS ’07)
[68] Lawton, G., Developing software online with platform-as-a-service technology, Computer, 41, 6, 13-15, (2008)
[69] Dean, J.; Ghemawat, S., Mapreduce: simplified data processing on large clusters, Communications of the ACM, 51, 1, 107-113, (2008)
[70] Nguyen, T.; Shi, W.; Ruden, D., Cloudaligner: a fast and full-featured mapreduce based tool for sequence mapping, BMC Research Notes, 4, article 171, (2011)
[71] Schatz, M. C., Cloudburst: highly sensitive Read mapping with mapreduce, Bioinformatics, 25, 11, 1363-1369, (2009)
[72] Pireddu, L.; Leo, S.; Zanetti, G., Seal: a distributed short Read mapping and duplicate removal tool, Bioinformatics, 27, 15, 2159-2160, (2011)
[73] Blastreduce: high performance short Read mapping with mapreduce
[74] Langmead, B.; Schatz, M. C.; Lin, J.; Pop, M.; Salzberg, S. L., Searching for SNPs with cloud computing, Genome Biology, 10, 11, article R134, (2009)
[75] Schatz, M. C.; Sommer, D.; Kelley, D.; Pop, M., De novo assembly of large genomes using cloud computing, Proceedings of the CSHL Biology of Genomes Conference
[76] Chang, Y.-J.; Chen, C. C.; Chen, C. L.; Ho, J. M., A de novo next generation genomic sequence assembler based on string graph and mapreduce cloud computing framework, BMC Genomics, 13, S28, (2012)
[77] Langmead, B.; Hansen, K. D.; Leek, J. T., Cloud-scale RNA-sequencing differential expression analysis with myrna, Genome Biology, 11, 8, R83, (2010)
[78] Hong, D.; Rhie, A.; Park, S.-S.; Lee, J.; Ju, Y. S.; Kim, S.; Yu, S.-B.; Bleazard, T.; Park, H.-S.; Rhee, H.; Chong, H.; Yang, K.-S.; Lee, Y.-S.; Kim, I.-H.; Lee, J. S.; Kim, J.-I.; Seo, J.-S., FX: an RNA-seq analysis tool on the cloud, Bioinformatics, 28, 5, 721-723, (2012)
[79] Jourdren, L.; Bernard, M.; Dillies, M. A.; Le Crom, S., Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses, Bioinformatics, 28, 11, 1542-1543, (2012)
[80] Niemenmaa, M.; Kallio, A.; Schumacher, A.; Klemelä, P.; Korpelainen, E.; Heljanko, K., Hadoop-BAM: directly manipulating next generation sequencing data in the cloud, Bioinformatics, 28, 6, 876-877, (2012)
[81] O’Connor, B. D.; Merriman, B.; Nelson, S. F., Seqware query engine: storing and searching sequence data in the cloud, BMC Bioinformatics, 11, (2010)
[82] McKenna, A.; Hanna, M.; Banks, E.; Sivachenko, A.; Cibulskis, K.; Kernytsky, A.; Garimella, K.; Altshuler, D.; Gabriel, S.; Daly, M.; DePristo, M. A., The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data, Genome Research, 20, 9, 1297-1303, (2010)
[83] Matthews, S. J.; Williams, T. L., Mrsrf: an efficient mapreduce algorithm for analyzing large collections of evolutionary trees, BMC Bioinformatics, 11, (2010)
[84] Colosimo, M. E.; Peterson, M. W.; Mardis, S.; Hirschman, L., Nephele: genotyping via complete composition vectors and mapreduce, Source Code for Biology and Medicine, 6, article 13, (2011)
[85] Vouzis, P. D.; Sahinidis, N. V., GPU-BLAST: using graphics processors to accelerate protein sequence alignment, Bioinformatics, 27, 2, 182-188, (2011)
[86] Liu, C.-M.; Wong, T.; Wu, E.; Luo, R.; Yiu, S.-M.; Li, Y.; Wang, B.; Yu, C.; Chu, X.; Zhao, K.; Li, R.; Lam, T.-W., SOAP3: ultra-fast GPU-based parallel alignment tool for short reads, Bioinformatics, 28, 6, 878-879, (2012)
[87] Lewis, S.; Csordas, A.; Killcoyne, S., Hydra: a scalable proteomic search engine which utilizes the hadoop distributed computing framework, BMC Bioinformatics, 13, article 324, (2012)
[88] Matsunaga, A.; Tsugawa, M.; Fortes, J., Cloudblast: combining mapreduce and virtualization on distributed resources for bioinformatics applications, Proceedings of the 4th IEEE International Conference on eScience (eScience ’08)
[89] Feng, X.; Grossman, R.; Stein, L., Peakranger: a cloud-enabled peak caller for chip-seq data, BMC Bioinformatics, 12, article 139, (2011)
[90] Zhang, L.; Gu, S.; Liu, Y.; Wang, B.; Azuaje, F., Gene set analysis in the cloud, Bioinformatics, 28, 2, 294-295, (2012)
[91] Leo, S.; Santoni, F.; Zanetti, G., Biodoop: bioinformatics on hadoop, Proceedings of the 38th International Conference Parallel Processing Workshops (ICPPW ’09)
[92] Huang, H.; Tata, S.; Prill, R. J., Bluesnp: R package for highly scalable genome-wide association studies using hadoop clusters, Bioinformatics, 29, 1, 135-136, (2013)
[93] Kelley, D. R.; Schatz, M. C.; Salzberg, S. L., Quake: quality-aware detection and correction of sequencing errors, Genome Biology, 11, 11, article R116, (2010)
[94] Apache Hadoop
[95] How Hadoop Makes Short Work Of Big Data—Forbes
[96] Public Data Sets on Amazon Web Services (AWS)
[97] Kasson, P. M., Computational biology in the cloud: methods and new insights from computing at scale, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, World Scientific
[98] Vijayakumar, S.; Bhargavi, A.; Praseeda, U.; Ahamed, S. A., Optimizing sequence alignment in cloud using hadoop and MPP database, Proceedings of the 5th IEEE International Conference on Cloud Computing (CLOUD ’12)
[99] Sleator, R. D., Proteins: form and function, Bioeng Bugs, 3, 2, 80-85, (2012)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.