×

zbMATH — the first resource for mathematics

All fingers are not the same: handling variable-length sequences in a discriminative setting using conformal multi-instance kernels. (English) Zbl 1443.92144
Schwartz, Russell (ed.) et al., 17th international workshop on algorithms in bioinformatics, WABI 2017, Boston, MA, USA, August 21–23, 2017. Proceedings. Wadern: Schloss Dagstuhl – Leibniz Zentrum für Informatik. LIPIcs – Leibniz Int. Proc. Inform. 88, Article 16, 14 p. (2017).
Summary: Most string kernels for comparison of genomic sequences are generally tied to using (absolute) positional information of the features in the individual sequences. This poses limitations when comparing variable-length sequences using such string kernels. For example, profiling chromatin interactions by 3C-based experiments results in variable-length genomic sequences (restriction fragments). Here, exact position-wise occurrence of signals in sequences may not be as important as in the scenario of analysis of the promoter sequences, that typically have a transcription start site as reference. Existing position-aware string kernels have been shown to be useful for the latter scenario.
In this work, we propose a novel approach for sequence comparison that enables larger positional freedom than most of the existing approaches, can identify a possibly dispersed set of features in comparing variable-length sequences, and can handle both the aforementioned scenarios. Our approach, CoMIK, identifies not just the features useful towards classification but also their locations in the variable-length sequences, as evidenced by the results of three binary classification experiments, aided by recently introduced visualization techniques. Furthermore, we show that we are able to efficiently retrieve and interpret the weight vector for the complex setting of multiple multi-instance kernels.
For the entire collection see [Zbl 1372.68022].
MSC:
92D20 Protein sequences, DNA sequences
92-08 Computational methods for problems pertaining to biology
Software:
KIRMES; CoMIK
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In {\it Proceedings of the Twenty-first International} {\it Conference on Machine Learning}, ICML’04, page 6, New York, NY, USA, 2004. ACM. doi:10.1145/1015330.1015424.
[2] Matthew B. Blaschko and Thomas Hofmann. Conformal multi-instance kernels. In {\it NIPS} {\it 2006 Workshop on Learning to Compare Examples}, 2006.
[3] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In {\it Proceedings of the Fifth Annual Workshop on Computational} {\it Learning Theory}, COLT’92, pages 144-152, New York, NY, USA, 1992. ACM. doi:10. 1145/130385.130401.
[4] Jennifer E. F. Butler and James T. Kadonaga. The RNA polymerase II core promoter: a key component in the regulation of gene expression. {\it Genes & Development}, 16(20):2583-2592, 2002. doi:10.1101/gad.1026202.
[5] Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In {\it Proceedings} {\it of the 15th Annual International ACM SIGIR Conference on Research and Development} {\it in Information Retrieval}, SIGIR’92, pages 318-329, New York, NY, USA, 1992. ACM. doi:10.1145/133160.133214.
[6] Thomas G. Dietterich, Richard H. Lathrop, Tomas Lozano-Perez, and Arris Pharmaceut ical. Solving the multiple-instance problem with axis-parallel rectangles. {\it Artificial Intelli-} {\it gence}, 89:31-71, 1997.
[7] Charles Elkan.The foundations of cost-sensitive learning.In {\it Proceedings of the 17th} {\it International Joint Conference on Artificial Intelligence - Volume 2}, IJCAI’01, pages 973- 978, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[8] Thomas Gärtner, Peter A. Flach, Adam Kowalczyk, and Alex J. Smola. Multi-instance kernels. In {\it Proc. 19th International Conf. on Machine Learning}, pages 179-186, Massachu setts, 2002. Morgan Kaufmann.
[9] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In {\it Proceedings of the Pacific Symposium on Biocomputing}, volume 7, pages 566-575, 2002.
[10] Christina S. Leslie, Eleazar Eskin, Adiel Cohen, Jason Weston, and William Stafford Noble. Mismatch string kernels for discriminative protein classification. {\it Bioinformatics}, 20(4):467- 476, 2004. doi:10.1093/bioinformatics/btg431.
[11] Thomas Lingner and Peter Meinicke.Remote homology detection based on oli gomer distances.{\it Bioinformatics}, 22(18):2224-2231, September 2006.doi:10.1093/ bioinformatics/btl376.
[12] Shai Lubliner, Ifat Regev, Maya Lotan-Pompan, Sarit Edelheit, Adina Weinberger, and Eran Segal. Core promoter sequence in yeast is a major determinant of expression level. {\it Genome research}, 25(7):1008-1017, 2015.
[13] Peter Meinicke, Maike Tech, Burkhard Morgenstern, and Rainer Merkl. Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. {\it BMC Bioinformatics}, 5(1):169, 2004. doi:10.1186/1471-2105-5-169.
[14] Sarvesh Nikumbh and Nico Pfeifer. Genetic sequence-based prediction of long-range chro matin interactions suggests a potential role of short tandem repeat sequences in genome organization. {\it BMC Bioinformatics}, 18(1):218, 2017. doi:10.1186/s12859-017-1624-x.
[15] G. Rätsch, S. Sonnenburg, and B. Schölkopf.RASE: recognition of alternatively spliced exons in C.elegans. {\it Bioinformatics}, 21(suppl 1):i369-i377, 2005. doi:10.1093/ bioinformatics/bti1053.
[16] Gunnar Rätsch and Sören Sonnenburg. Accurate splice site prediction for caenorhabditis elegans. In {\it Kernel Methods in Computational Biology}, MIT Press series on Computational Molecular Biology, pages 277-298. MIT Press, Cambridge, MA., 2004.
[17] :14
[18] Hiroto Saigo, Jean-Philippe Vert, Nobuhisa Ueda, and Tatsuya Akutsu. Protein homology detection using string alignment kernels.{\it Bioinformatics}, 20(11):1682-1689, July 2004. doi:10.1093/bioinformatics/bth141.
[19] Amartya Sanyal, Bryan R. Lajoie, Gaurav Jain, and Job Dekker.The long-range in teraction landscape of gene promoters.{\it Nature}, 489(7414):109-113, Sep 2012.doi: 10.1038/nature11279.
[20] Sebastian J. Schultheiss, Wolfgang Busch, Jan U. Lohmann, Oliver Kohlbacher, and Gunnar Rätsch. Kirmes: kernel-based identification of regulatory modules in euchromatic sequences. {\it Bioinformatics}, 25(16):2126-2133, 2009. doi:10.1093/bioinformatics/btp278.
[21] John Shawe-Taylor and Nello Cristianini.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.