×

zbMATH — the first resource for mathematics

DRESS: dimensionality reduction for efficient sequence search. (English) Zbl 1405.68465
Summary: Similarity search in large sequence databases is a problem ubiquitous in a wide range of application domains, including searching biological sequences. In this paper we focus on protein and DNA data, and we propose a novel approximate method method for speeding up range queries under the edit distance. Our method works in a filter-and-refine manner, and its key novelty is a query-sensitive mapping that transforms the original string space to a new string space of reduced dimensionality. Specifically, it first identifies the \(t\) most frequent codewords in the query, and then uses these codewords to convert both the query and the database to a more compact representation. This is achieved by replacing every occurrence of each codeword with a new letter and by removing the remaining parts of the strings. Using this new representation, our method identifies a set of candidate matches that are likely to satisfy the range query, and finally refines these candidates in the original space. The main advantage of our method, compared to alternative methods for whole sequence matching under the edit distance, is that it does not require any training to create the mapping, and it can handle large query lengths with negligible losses in accuracy. Our experimental evaluation demonstrates that, for higher range values and large query sizes, our method produces significantly lower costs and runtimes compared to two state-of-the-art competitor methods.
MSC:
68W32 Algorithms on strings
92D20 Protein sequences, DNA sequences
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Altschul, S.; Madden, T.; Schffer, R.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D., Gapped blast and psi-blast: a new generation of protein database search programs, Nucleic Acids Res, 25, 3389-3402, (1997)
[2] Altschul, SF; Gish, W.; Miller, W.; Myers, EW; Lipman, DJ, Basic local alignment search tool, J Mol Biol, 215, 403-410, (1990)
[3] Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of very large database endowment (PVLDB), pp 918-929
[4] Baeza-Yates, R.; Gonnet, GH, A new approach to text searching, Commun ACM, 35, 74-82, (1992)
[5] Behm A, Vernica R, Alsubaiee S, Ji S, Lu J, Jin L, Lu Y, Li C (2010) UCI Flamingo Package 4.0. http://flamingo.ics.uci.edu/releases/4.0/
[6] Bhadra, R.; Sandhya, S.; Abhinandan, KR; Chakrabarti, S.; Sowdhamini, R.; Srinivasan, N., Cascade psi-blast web server: a remote homology search tool for relating protein domains, Nucleic Acids Res, 34, 143-146, (2006)
[7] Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Tech. Rep. 124, Systems Research Center, Palo Alto, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.6774
[8] Hjaltason, G.; Samet, H., Properties of embedding methods for similarity searching in metric spaces, IEEE Trans Pattern Anal Mach Intell (PAMI), 25, 530-549, (2003)
[9] Jongeneel, CV, Searching the expressed sequence tag (est) databases: panning for genes, Bioinformatics, 1, 76-92, (2000)
[10] Kalafus, KJ; Jackson, AR; Milosavljevic, A., Pash: efficient genome-scale sequence anchoring by positional hashing, Genome Resour, 14, 672-678, (2004)
[11] Kent WJ (2002) Resource BLAT-The BLAST-like alignment tool. Genome Res
[12] Kim MS, Whang KY, Lee JG, Lee MJ (2005a) n-gram/2l: a space and time efficient two-level n-gram inverted index structure. In: Proceedings of the 31st international conference on very large data bases, VLDB Endowment, pp 325-336
[13] Kim, YJ; Boyd, A.; Athey, BD; Patel, JM, miblast: scalable evaluation of a batch of nucleotide sequence queries with blast, Nucleic Acids Res, 33, 4335-4344, (2005)
[14] Korf, I.; Gish, W., Mpblast : improved blast performance with multiplexed queries, Bioinformatics, 16, 1052-1053, (2000)
[15] Langmead, B.; Trapnell, C.; Pop, M.; Salzberg, SL; etal., Ultrafast and memory-efficient alignment of short dna sequences to the human genome, Genome Biol, 10, r25, (2009)
[16] Li C, Wang B, Yang X (2007) Vgram: improving performance of approximate queries on string collections using variable-length grams. In: Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, pp 303-314
[17] Li C, Lu J, Lu Y (2008a) Efficient merging and filtering algorithms for approximate string searches. International conference on data engineering (ICDE)
[18] Li, H.; Ruan, J.; Durbin, R., Mapping short dna sequencing reads and calling variants using mapping quality scores, Genome Res, 18, 1851-1858, (2008)
[19] Li, R.; Li, Y.; Kristiansen, K.; Wang, J., Soap: short oligonucleotide alignment program, Bioinformatics, 24, 713-714, (2008)
[20] Li, Y.; Patel, JM; Terrell, A., Wham: a high-throughput sequence alignment method, ACM Trans Database Syst (TODS), 37, 28, (2012)
[21] Litwin W, Mokadem R, Rigaux P, Schwarz T (2007) Fast ngram-based string search over data encoded using algebraic signatures. In: Proceedings of the very large database endowment (PVLDB), pp 207-218
[22] Liu, B.; Wang, X.; Zou, Q.; Dong, Q.; Chen, Q., Protein remote homology detection by combining chous pseudo amino acid composition and profile-based protein representation, Mol Inf, 32, 775-782, (2013)
[23] Meek C, Patel JM, Kasetty S (2003) Oasis: an online and accurate technique for local-alignment searches on biological sequences. In: Proceedings of very large database endowment (PVLDB), vol 29, pp 910-921
[24] Needleman, SB; Wunsch, CD, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol, 48, 443-453, (1970)
[25] Ning, Z.; Cox, AJ; Mullikin, JC, SSAHA: A fast search method for large dna databases, Genome Resour, 11, 1725-1729, (2001)
[26] Papapetrou, P.; Athitsos, V.; Kollios, G.; Gunopulos, D., Reference-based alignment in large sequence databases, Proc Very Large Database Endow (PVLDB), 2, 205-216, (2009)
[27] Smith, TF; Waterman, MS, Identification of common molecular subsequences, J Mol Biol, 147, 195-197, (1981)
[28] Tian, Y.; Mceachin, RC; Santos, C.; States, DJ; Patel, JM, Saga: A subgraph matching tool for biological graphs, Bioinformatics, 23, 232-239, (2007)
[29] Traina C, Traina AJM, Seeger B, Faloutsos C (2000) Slim-trees: high performance metric trees minimizing overlap between nodes. International conference on extending database technology (EDBT), pp 51-65
[30] Venkateswaran J, Lachwani D, Kahveci T, Jermaine C (2006) Reference-based indexing of sequence databases. In: International conference on very large databases (VLDB), pp 906-917
[31] Vergoulis, T.; Dalamagas, T.; Sacharidis, D.; Sellis, TK, Approximate regional sequence matching for genomic databases, VLDB J, 21, 779-795, (2012)
[32] Vieira MR, Traina C, Chino FJT, Traina AJM (2004) Dbm-tree: a dynamic metric access method sensitive to local density data. Brazilian symposium on databases (SBBD), pp 163-177
[33] Wandelt S, Starlinger J, Bux M, Leser U (2013) Rcsi: scalable similarity search in thousand(s) of genomes. Proceedings of the VLDB Endowment (PVLDB) p (to appear)
[34] Wu, S.; Manber, U., Fast text searching: allowing errors, Commun ACM, 35, 83-91, (1992)
[35] Yan, X.; Yu, PS; Han, J., Graph indexing based on discriminative frequent structure analysis, ACM Trans Database Syst, 30, 960-993, (2005)
[36] Yang X, Wang B, Li C (2008) Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, pp 353-364
[37] Zhang, Z.; Schwartz, S.; Wagner, L.; Miller, W., A greedy algorithm for aligning dna sequences, J Comput Biol, 7, 203-214, (2000)
[38] Zhu, H.; Kollios, G.; Athitsos, V., A generic framework for efficient and effective subsequence retrieval, Proc VLDB Endow (PVLDB), 5, 1579-1590, (2012)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.