×

Bioinformatics: organisms from Venus, technology from Jupiter, algorithms from Mars. (English) Zbl 1293.93686

Summary: In this paper, we discuss data sets that are being generated by microarray technology, which makes it possible to measure in parallel the activity or expression of thousands of genes simultaneously. We discuss the basics of the technology, how to preprocess the data, and how classical and newly developed algorithms can be used to generate insight in the biological processes that have generated the data. Algorithms we discuss are Principal Component Analysis, clustering techniques such as hierarchical clustering and Adaptive Quality Based Clustering and statistical sampling methods, such as Monte Carlo Markov Chains and Gibbs sampling. We illustrate these algorithms with several real-life cases from diagnostics and class discovery in leukemia, functional genomics research on the mitotic cell cycle of yeast, and motif detection in Arabidopsis thaliana using DNA background models. We also discuss some bioinformatics software platforms. In the final part of the manuscript, we present some future perspectives on the development of bioinformatics, including some visionary discussions on technology, algorithms, systems biology and computational biomedicine.

MSC:

93E03 Stochastic systems in control theory (general)
92B05 General biology and biomathematics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Aerts, S.; Thijs, G.; Coessens, B.; Staes, M.; Moreau, Y.; De Moor, B., TOUCAN: Deciphering the cis-Regulatory logic of coregulated genes, Nucleic Acids Res, 31, 6, 1753-1764 (2003)
[2] Alon, U.; Barkai, N.; Notterman, D. A.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A. J., Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc Natl Acad Sciences, 96, 12, 6745-6750 (1999)
[3] Alter, O.; Brown, P. O.; Botstein, D., Singular value decomposition for genome-wide expression data processing and modeling, Proc Natl Acad Sci USA, 97, 10101-10106 (2000)
[4] Alter, O.; Brown, P. O.; Botstein, D., Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms, Proc Natl Acad Sci USA, 100, 3351-3356 (2003)
[5] Altman, R. B., Challenges for intelligent systems in biology, IEEE Intelligent Systems, 14-18 (2001), (Nov/Dec)
[6] Anastassiou, D., Genomic signal processing, IEEE Signal Proc Magazine, 8-20 (2001), (July)
[7] Antal, P.; Fannes, G.; De Moor, B.; Vandewalle, J.; Moreau, Y.; Timmerman, D., Extended bayesian regression models: a symbiotic application of belief networks and multilayer perceptrons for the classification of ovarian tumors, (Proceedings of the eight european conference on artificial intelligence in medicine (AIME’01). Proceedings of the eight european conference on artificial intelligence in medicine (AIME’01), Cascais, Portugal (2001)), 177-187, vol. 8 · Zbl 0986.68798
[8] Armstrong, S. A.; Staunton, J. E.; Silverman, L.; Pieters, R.; Den Boer, M.; Minden, M.; Sallan, S.; Lander, E.; Golub, T.; Korsmeyer, S., MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia, Nat Genet, 30, 41-47 (2002), (Jan.)
[9] Ashburner, M., Gene ontology: tool for the unification of biology, Nat Genet, 25, 1, 25-29 (2000)
[10] Baldi, P.; Brunak, S., Bioinformatics the machine learning approach (1998), MIT Press · Zbl 0992.92024
[11] Baldi, P.; Long, A. D., A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes, Bioinformatics, 17, 509-519 (2001)
[12] Berry, M., Computational information retrieval, (Proceedings of CIR’00, SIAM proceedings in applied mathematics. Proceedings of CIR’00, SIAM proceedings in applied mathematics, SIAM, Philadelphia (2001)), 185
[13] Bishop, M. J.; Rawlings, C. J., DNA and protein sequence analysis a practical approach (1997), Oxford University Press
[14] Brazma, A.; Hingamp, P.; Quakenbush, J.; Sherlock, G.; Spellman, P.; Stoeckert, C.; Aach, J.; Ansorge, W.; Ball, C.; Causton, H.; Gaasterland, T.; Glenisson, P.; Holstege, F.; Kim, I.; Markowitz, V.; Matese, J.; Parkinson, H.; Robinson, A.; Sarkans, U.; Schulze-Kremer, S.; Stewart, J.; Taylor, R.; Vilo, J.; Vingron, M., Minimum information about a microarray experiment (MIAME) - towards standards for microarray data, Nat Genet, 29, 365-371 (2001), (Dec.)
[15] Brown, T., Genomes (2002), BIOS Scientific Publishers
[16] Brown, M.; Grundy, W.; Lin, D.; Cristianini, N.; Sugnet, C.; Furey, T.; Ares, M.; Haussler, D., Knowledge-based analysis of microarray gene expression data by using support vector machines, Proc Natl Acad Sci, 97, 262-267 (2000)
[17] Casella, G.; George, E. L., Explaining the Gibbs sampler, The Am Statistician, 46, 3, 167-174 (1992)
[18] Cattaneo, E.; Rigamonti, D.; Zuccato, C., The enigma of Huntington’s disease, Scientific Am, 61-65 (2002), (Dec)
[19] Cho, R. J.; Campbell, M. J.; Winzeler, E. A.; Steinmetz, L.; Conway, A.; Wodicka, L.; Wolfsberg, T. G.; Gabrielian, A. E.; Landsman, D.; Lockhart, D. J.; Davis, R. W., A genome-wide transcriptional analysis of the mitotic cell cycle, Mol Cell, 2, 65-73 (1998)
[20] Coessens, B.; Gert, Thijs; Stein, Aerts; Kathleen, Marchal; Frank, De Smet; Kristof, Engelen; Patrick, Glenisson; Yves, Moreau; Janick, Mathys; Bart, De Moor, INCLUSive - a web portal and service registry for microarray and regulatory sequence analysis, Nucleic Acids Research, Web Software Issue (2003)
[21] Csete, M.; Doyle, J., Reverse engineering of biological complexity, Sience, 295, 1664-1669 (2002), (March)
[22] Dabrowski, M.; Aerts, S.; Van Hummelen, P.; Craessaerts, K.; De Moor, B.; Annaert, W.; Moreau, Y.; De Strooper, B., Gene profiling of hippocampal neuronal culture, J. of Neurochemistry, 85, 5, 1279-1288 (June 2003)
[23] Davies, K., Cracking the genome; inside the race to unlock human DNA, Prometheus (2001)
[24] De Cock K. Principal angles in system theory, information theory and signal processing, PhD thesis, Faculty of Engineering, K.U. Leuven (Leuven, Belgium) 2002, p 337; De Cock K. Principal angles in system theory, information theory and signal processing, PhD thesis, Faculty of Engineering, K.U. Leuven (Leuven, Belgium) 2002, p 337
[25] De Cock, K.; De Moor, B., Subspace angles between ARMA models, Systems Control Lett, 46, 265-270 (2002) · Zbl 0994.93057
[26] De Lathauwer, L.; De Moor, B.; Vandewalle, J., On the best rank-1 and rank- \((R_1, R_2,., R_N)\) approximation and applications of higher-order tensors, SIAM J Matrix Anal Appl, 21, 4, 1324-1342 (2000) · Zbl 0958.15026
[27] De Lathauwer, L.; De Moor, B.; Vandewalle, J., Independent component analysis and (simultaneous) third-order tensor diagonalization, IEEE Tran Signal Proc, 49, 10, 2262-2271 (2001)
[28] De Moor, B., On the structure of generalized singular value and QR decompositions, SIAM J Matrix Analysis Appl, 347-358 (1994), 15-1.(Jan) · Zbl 0792.15007
[29] De Moor, B.; Van Dooren, P., Generalizations of the QR and the singular value decomposition, SIAM Matrix Analysis Application, 13, 4, 993-1014 (1992) · Zbl 0764.65014
[30] DeRisi, J. L.; Iyer, V. R.; Brown, P. O., Exploring the metabolic and genetic control of gene expression on a genomic scale, Science, 278, 680-686 (1997)
[31] De Smet, F.; Mathys, J.; Marchal, K.; Thijs, G.; De Moor, B.; Moreau, Y., Adaptive quality-based clustering of gene expression profiles, Bioinformatics, 18, 5, 735-746 (2002)
[32] Duda, R. O.; Hart, P. E.; Stork, D. G., Pattern classification (2001), John Wiley & Sons: John Wiley & Sons New York · Zbl 0968.68140
[33] Duggan, D. J.; Bittner, M.; Chen, Y.; Meltzer, P.; Trent, J. M., Expression profiling using cDNA microarrays, Nat Genet, 21, l suppl, 10-14 (1999)
[34] Eisen, M. B.; Spellman, P. T.; Brown, P. O.; Bolstein, D., Cluster analysis and display of genome-wide expression patterns, 95, 14863-14868 (1998), Proc Natl Acad Sci: Proc Natl Acad Sci USA
[35] Engelen, K.; Coessens, B.; Marchal, K.; De Moor, B., MARAN: Normalizing microarray data, Bioinformatics, 19, 893-894 (2003)
[36] Ewens, W. J.; Grant, G. R., Statistical methods in bioinformatics: an introduction (2001), Springer-Verlag: Springer-Verlag New York · Zbl 0965.92001
[37] Ezzel, C., Proteins rule, Acientific Am, 27-33 (2002), (Apr.)
[38] Fall, C. P.; Marland, E. S.; Wagner, J. M.; Tyson JJ, Computational cell biology (2002), Springer-Verlag: Springer-Verlag New York · Zbl 1010.92019
[39] Friend, S.; Stoughton, R. B., The magic of microarrays, Scientific Am, 34-41 (2002), (Feb.)
[40] Furey, T.; Duffy, N.; Cristianini, N.; Bednarski, D.; Schummer, M.; Haussler, D., Support vector machine classification and validation of cancer tissue samples using microarray data, Bioinformatics, 16, 10, 906-914 (2000)
[41] Glenisson P, Antal P, Mathys J, Moreau Y, De Moor B, Evaluation of the vector space representation for textbased gene clustering. Internal Report 02-121, ESATSISTA; Glenisson P, Antal P, Mathys J, Moreau Y, De Moor B, Evaluation of the vector space representation for textbased gene clustering. Internal Report 02-121, ESATSISTA · Zbl 1253.62046
[42] K.U. Leuven (Leuven, Belgium), 2002. Accepted for publication in Proc Eighth Annual Pacific Symposium on Biocomputing (PSB 2003).; K.U. Leuven (Leuven, Belgium), 2002. Accepted for publication in Proc Eighth Annual Pacific Symposium on Biocomputing (PSB 2003).
[43] Gokcay, E.; Principe, J., Information theoretic clustering, IEEE Trans on Pattern Analysis and Macine Intelligence, 24, 2, 158-171 (2002)
[44] Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J. P.; Coller, H.; Loh, M. L.; Downing, J. R.; Caligiuri, M. A.; Bloomfield, C. D.; Lander, E. S., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286, 531-537 (1999)
[45] Griffiths, A. J.F.; Gelbart, W. M.; Miller, J. H.; Lewontin, R. C., Modern genetic analysis (1999), W.H. Freeman and co: W.H. Freeman and co New York
[46] Griffiths, A. J.F.; Miller, J. H.; Suzuki, D. T.; Lewontin, R. C.; Gelbart, W. M., An introduction to genetic analysis (1996), W.H. Freeman and co: W.H. Freeman and co New York
[47] Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V., Gene selection for cancer classification using support vector machines, Machine Learning, 46, 1/3, 389-422 (2002) · Zbl 0998.68111
[48] Hastie, T.; Tibshirani; Friedman, The elements of statistical learning. Data mining, inference and prediction, 236-242 (2001), Springer-Verlag · Zbl 0973.62007
[49] Henig, R. M., The Monk in the garden (2000), Houghton Mifflin Company
[50] Hyvarinen, A.; Karhunen, J.; Oja, E., Independent component analysis (2001), John Wiley & Sons
[51] Jamison, D. C., Editorial: open bioinformatics, Bioinformatics, 19, 6, 679-680 (2003)
[52] Kadota, K.; Miki, R.; Bono, H.; Shimizu, K.; Okazaki, Y.; Hayashizaki, Y., Preprocessing implementation for microarray (PRIM): an efficient method for processing cDNA microarray data, Physiol Genomics, 4, 183-188 (2001)
[53] Kalow, W., Pharmacogenomics (2001), Marcel Dekker Inc: Marcel Dekker Inc New York
[54] Kari, L., DNA Computing, arrival of biological mathematics. The Mathematical Intelligencer, 19 (1997), Springer-Verlag: Springer-Verlag New York, 2 · Zbl 0942.68562
[55] Karp, G., Cell and molecular biology. Concepts and experiments (2002), John Wiley & Sons
[56] Kasturi, J.; Acharya, R.; Ramanathan, M., An information theoretic approach for analysing temporal patterns of gene expression, Bioinformatics, 19, 4, 449-458 (2003)
[57] Kerr, M. K.; Churchill, G. A., Statistical design and the analysis of gene expression microarray data, Genet Res, 77, 123-128 (2001)
[58] Kitano, H., Systems biology: a brief overview, Science, 295, 1662-1664 (2002), (Mar. 1)
[59] Knight, J., When the chips are down, Nature, 410, 860-861 (2001)
[60] Kreuzer, H.; Massey, A., Recombinant DNA and biotechnology. A guide for teachers (1996), ASM Press (American Society for Microbiology): ASM Press (American Society for Microbiology) Washington DC
[61] Lander, E. S., Array of hope, Nat Genet, 21, 3-4 (1999)
[62] Lander, E. S., Initial sequencing and analysis of the human genome, Nature, 409, 6822, 860-921 (2001)
[63] Lescot M, Dehais P, Thijs G, Marchal K, Moreau Y, Van de Peer Y, Rouze P, Rombauts S. PlantCARE, a database of plant cis-acting regulatory elements and a protal to tools for in silico analysis of promoter sequences. Nucleic Acids Res. Special Issue on databases, 30(1): 325-327.; Lescot M, Dehais P, Thijs G, Marchal K, Moreau Y, Van de Peer Y, Rouze P, Rombauts S. PlantCARE, a database of plant cis-acting regulatory elements and a protal to tools for in silico analysis of promoter sequences. Nucleic Acids Res. Special Issue on databases, 30(1): 325-327.
[64] Lesk, A., The unreasonable effectiveness of mathematics in molecular biology. The Mathematical Intelligencer, 22, 29-37 (2000), Springer-Verlag: Springer-Verlag New York, 2 · Zbl 1052.92500
[65] Lipschutz, R. J.; Fodor, S. P.A.; Gingeras, T. R.; Lockheart, D. J., High density synthetic oligonucleotide arrays, Nat Genet, 21, suppl, 20-24 (1999)
[66] Marchal, K.; Engelen, K.; De Brabanter, J.; Aerts, S.; De Moor, B., Comparison of different methodologies to identify differentially expressed genes in two-sample cDNA arrays, J Biol Systems, 10, 4, 409-430 (2002) · Zbl 1113.92311
[67] Marchal, K.; Thijs, G.; De Keersmaecker, S.; Monsieurs, P.; De Moor, B.; Vanderleyden, J., Genome-specific higher-order background models to improve motif detection, Trends Microbiol, 11, 2, 61-66 (2003)
[68] Moreau, Y.; Antal, P.; Fannes, G.; De Moor, B., Probabilistic graphical models for computational biomedicine, Methods Information Med, 41 (2/2003)
[69] Moreau, Y.; De Smet, F.; Thijs, G.; Marchal, K.; De Moor, B., Functional bioinformatics of microarray data: from expression to regulation, Proc IEEE, 90, 11, 1722-1743 (Nov 2002)
[70] Moreau, Y.; Marchal, K.; Mathys, J., Computational biomedicine: a multidisciplinary crossroads, Siemens Prize, FWO (Flanders Belgium), 89 (2002)
[71] Mount, D., Bioinformatics. Sequence and genome analysis (2001), Cold Spring Harbor Laboratory Press
[72] Mukherjee, S.; Tamayo, P.; Mesirov, J.; Slonim, D.; Verri, A.; Poggio, T., Support vector machine classification of microarray data (1998), MIT Artificial Intelligence Lab, A.I. memo 1677
[73] Nielsen, T. O.; West, R. B.; Linn, S. C.; Alter, O.; Knowling, M. A.; O’Connell, J. X.; Zhu, S.; Fero, M.; Sherlock, G.; Pollack, J. R.; Brown, P. O.; Botstein, D.; van de Rijn, M., Molecular characterisation of soft tissue tumours: a gene expression study, Lancet, 359, 1301-1307 (2002)
[74] Phelps, T. J.; Palumbo, A. V.; Beliaev, A. S., Metabolomics and microarrays for improved understanding of phenotypic characteristics controlled by both genomics and environmental constraints, Curr Opnion Biotechnol, 13, 20-24 (2002)
[75] Primrose, S.; Twyman, R.; Old, R., Principles of gene manipulation, Blackwell Science (2001)
[76] Quackenbush, J., Computational analysis of microarray data, Nat Rev Genet, 2, 418-427 (2001)
[77] Reymond, P.; Weber, H.; Damond, M.; Farmer, E., Differential gene expression in response to mechanical wounding and insect feeding in arabidopsis, Plant Cell, 12, 707-719 (2000)
[78] Ridley, M., Genome: the autobiography of a species in 23 chapters (1999), Harper Collins: Harper Collins New York
[79] Schena, M.; Shalon, D.; Davis, R. W.; Brown, P. O., Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science, 270, 467-470 (1995)
[80] Sheng, Q.; Moreau, Y.; De Moor, B., Biclustering microarray data by Gibbs sampling (2003), K.U. Leuven, Leuven: K.U. Leuven, Leuven Belgium, Internal Report 03-09, ESAT-SISTA
[81] Stein, L., Creating a bioinformatics nation, Nature, 417, 119-120 (2002), May 9
[82] Suykens, J. A.K.; Van Gestel, T.; De Brabanter, J.; De Moor, B.; Vandewalle, J., Least Squares Support Vector Machines (2002), World Scientific Publishing Co., Pte, Ltd: World Scientific Publishing Co., Pte, Ltd Singapore · Zbl 1017.93004
[83] Suykens, J. A.K.; Van Gestel, T.; Vandewalle, J.; De Moor, B., A support vector machine formulation to PCA analysis and its kernel version, IEEE Trans Neural Networks, 14, 2, 447-450 (2003)
[84] Sykes, B., The seven daughters of Eve (2002), Bantam Press
[85] Thijs, G., Probabilistic methods to search for regulatory elements in sets of coregulated genes (2003), PhD, Department of Electrical Engineering, Katholieke Universiteit Leuven: PhD, Department of Electrical Engineering, Katholieke Universiteit Leuven Belgium
[86] Thijs, G.; Lescot, M.; Marchal, K.; Rombauts, S.; De Moor, B.; Rouze, P.; Moreau, Y., A higher-order background model improves the detection by Gibbs sampling of potential promoter regulatory elements, Bioinformatics, 17, 2, 1113-1122 (2001)
[87] Thijs, G.; Marchal, K.; Lescot, M.; Rombauts, S.; De Moor, B.; Rouze, P.; Moreau, Y., A Gibbs sampling method to find over-represented motifs in the upstream regions of coexpressed genes, J Computational Bio, Special Issue RECOMB’2002, 9, 3, 447-464 (2002)
[88] Thijs, G.; Moreau, Y.; De Smet, F.; Mathys, J.; Lescot, M.; Rombauts, S.; Rouze, P.; De Moor, B.; Marchal, K., INCLUSIVE: INtegrated Clustering Upstream sequence retrieval and motif sampling, Bioinformatics, 18, 2, 331-332 (2002)
[89] Van Gestel T. From linear to kernel based methods in classification, modelling and prediction. PhD thesis, Department of Electrical Engineering, Katholieke Universiteit Leuven, 2002; p 286.; Van Gestel T. From linear to kernel based methods in classification, modelling and prediction. PhD thesis, Department of Electrical Engineering, Katholieke Universiteit Leuven, 2002; p 286.
[90] Van Gestel, T.; Suykens, J.; Lanckriet, G.; Lambrechts, A.; De Moor, B.; Vandewalle, J., Bayesian Framework for Least Squares Support Vector Machine Classifiers, Gaussian Processes and Kernel Fisher Discriminant Analysis, Neural Computation, 15, 5, 1115-1148 (2002), May · Zbl 1003.68146
[91] van Helden, J.; Andre, B.; Collado-Vides, L., Extracting regulatory sites from upstream region of yeast genes by computational analysis of oligonucleotide frequencies, J Mol Biol, 281, 827-842 (1998)
[92] van Helden, J.; Andre, B.; Collado-Vides, L., A web site for the computational analysis of yeast regulatory sequences, Yeast, 16, 177-187 (2000)
[93] van’t Veer, L. J.; Dai, H.; van de Vijver, M. J.; He, Y. D.; Hart, A. A.; Mao, M.; Peterse, H. L.; van der Kooy, K.; Marton, M. J.; Witteveen, A. T.; Schreiber, G. J.; Kerkhoven, R. M.; Roberts, C.; Linsley, P. S.; Bernards, R.; Friend, S. H., Gene expression profiling predicts clinical outcome of breast cancer, Nature, 415, 530-536 (2002)
[94] Vapnik, V., The nature of statistical learning theory (1995), Springer-Verlag: Springer-Verlag New York · Zbl 0833.62008
[95] Veldhuis, R.; Klabbers, E., On the computation of the Kullback-Leibler measure for spectral distances, IEEE Trans Speech Audio Processing, 11, 1 (2003)
[96] Venter, J. C., The sequence of the human genome, Science, 291, 5507, 1304-1351 (2001)
[97] Vidal MA, biological atlas of functional maps, Cell, 104, 333-339 (2001), (Feb. 9)
[98] Watson, J.; Crick, F., A structure for deoxyribose nucleic acid, Nature, 171, 737-738 (1953)
[99] Wolkenhauer, O., Systems biology: the reincarnation of systems theory applied in biology?, Henry Stewart Publications, Briefings in bioinformatics, 2, 3, 258-270 (2001)
[100] Wolkenhauer, O., Mathematical modeling in the post-genome era: understanding genome expression and regulation - a system theoretic approach, BioSystems, 65, 1-18 (2002)
[101] Yang, Y. H.; Dudoit, S.; Luu, P.; Lin, D.; Peng, V.; Ngai, J.; Speed, T. P., Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation, Nucleic Acids Res 30, El5 (2002)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.