A semantic relatedness preserved subset extraction method for language corpora based on pseudo-Boolean optimization. (English) Zbl 1461.68237

Text corpora in natural-language research contain billions of words and the size is growing, which has created the problem of extracting smaller subsets with a minimally changed semantics. Let \(T=\{t_1,\dots,t_n\}\) be a set of tokens (e.g. words) in an annotated text corpus with real-valued unary and binary attributes and semantic relatedness relations \(S^1\in\mathcal{R}^n\), \(S^2\in\mathcal{R}^{n\times n}\), \(S^3\in\mathcal{R}^{n\times n\times n}\); \(X=\{x_1,\dots,x_n\}\in\{0,1\}^n\) be Boolean variables to denote subsets from \(T\). The problem of semantics relatedness preservation in corpora subset extraction is finding an ‘optimal’ (minimal) subset \(X\subset T\) which maximizes \(\sum\limits_{i=1}^ns_i^1{x_i}+\sum\limits_{i,j=1}^ns_{ij}^2x_ix_j+\sum\limits_{i,j,k = 1}^ns_{ijk}^3x_ix_jx_k\) under constraints for attributes (here, one unary and one binary attribute constraint are considered). This NP-hard problem is transformed into the problem of finding the maximum flow in an equivalent graph and solved using the discrete Lagrangian iteration method.


68T50 Natural language processing
90C27 Combinatorial optimization
Full Text: DOI


[1] Yen, T.-H.; Wu, J.-C.; Chang, J.; Boisson, J.; Chang, J., Writeahead: mining grammar patterns in corpora for assisted writing, (Proceedings of ACL-IJCNLP 2015 System Demonstrations (2015)), 139-144
[2] Miller, D.; Biber, D., Evaluating reliability in quantitative vocabulary studies: the influence of corpus design and composition, Int. J. Corpus Linguist., 20, 1, 30-53 (2015)
[3] Caliskan, A.; Bryson, J. J.; Narayanan, A., Semantics derived automatically from language corpora contain human-like biases, Science, 356, 6334, 183-186 (2017)
[4] Aston, G., Acquiring the language of interpreters: a corpus-based approach, (Making Way in Corpus-Based Interpreting Studies (2018), Springer), 83-96
[5] Wen, T.-H.; Gasic, M.; Kim, D.; Mrksic, N.; Su, P.-H.; Vandyke, D.; Young, S., Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking, arXiv preprint
[6] Crossley, S. A.; Dascalu, M.; McNamarac, D. S., How important is size? An investigation of corpus size and meaning in both latent semantic analysis and latent Dirichlet allocation, (30th International Florida Artificial Intelligence Research Society Conference, FLAIRS 2017 (2017), AAAI Press)
[7] Ebeling, S. O., Does corpus size matter? Revisiting ENPC case studies with an extended version of the corpus, Nord. J. Engl. Stud., 15, 3, 33-54 (2016)
[8] Lin, H.; Bilmes, J., Optimal selection of limited vocabulary speech corpora, (Twelfth Annual Conference of the International Speech Communication Association (2011))
[9] Liu, Y.; Iyer, R.; Kirchhoff, K.; Bilmes, J., Svitchboard ii and fisver I: high-quality limited-complexity corpora of conversational English speech, (Sixteenth Annual Conference of the International Speech Communication Association (2015))
[10] Cutting, D.; Kupiec, J.; Pedersen, J.; Sibun, P., A practical part-of-speech tagger, (Proceedings of the Third Conference on Applied Natural Language Processing, Association for Computational Linguistics (1992)), 133-140
[11] Agarwal, P.; Strötgen, J.; Del Corro, L.; Hoffart, J.; Weikum, G., diaNED: time-aware named entity disambiguation for diachronic corpora, (Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2 (2018)), 686-693
[12] Hearst, M. A., Automatic acquisition of hyponyms from large text corpora, (Proceedings of the 14th Conference on Computational Linguistics-Volume 2, Association for Computational Linguistics (1992)), 539-545
[13] Evans, C.; Yuan, D., A large corpus for supervised word-sense disambiguation (2017)
[14] Poesio, M., Discourse annotation and semantic annotation in the gnome corpus, (Proceedings of the 2004 ACL Workshop on Discourse Annotation, Association for Computational Linguistics (2004)), 72-79
[15] Garrido, J. M.; Laplaza, Y.; Marquina, M.; Pearman, A.; Escalada, J. G.; Crespo, M.Á. R.; Armenta, A., The I3MEDIA speech database: a trilingual annotated corpus for the analysis and synthesis of emotional speech, (LREC, Citeseer (2012)), 1197-1202
[16] Bowker, L.; Pearson, J., Working with Specialized Language: A Practical Guide to Using Corpora (2002), Routledge
[17] Tangherlini, T. R.; Leonard, P., Trawling in the sea of the great unread: sub-corpus topic modeling and humanities research, Poetics, 41, 6, 725-749 (2013)
[18] Witten, I. H.; Milne, D. N., An Effective, Low-cost Measure of Semantic Re-latedness Obtained from Wikipedia Links (2008), IAAA Press
[19] E.-Y. Ran, D. Yanay, Methods and systems of supervised learning of semantic relatedness, uS Patent 8,909,648, Dec. 9, 2014.
[20] Hulpuş, I.; Prangnawarat, N.; Hayes, C., Path-based semantic relatedness on linked data and its use to word and entity disambiguation, (International Semantic Web Conference (2015), Springer), 442-457
[21] Marelli, M.; Bentivogli, L.; Baroni, M.; Bernardi, R.; Menini, S.; Zamparelli, R., Semeval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment, (Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) (2014)), 1-8
[22] Boros, E.; Gruber, A., On quadratization of pseudo-Boolean functions, arXiv preprint · Zbl 1403.90512
[23] Billionnet, A.; Minoux, M., Maximizing a supermodular pseudoboolean function: a polynomial algorithm for supermodular cubic functions, Discrete Appl. Math., 12, 1, 1-11 (1985) · Zbl 0583.90067
[24] Rhys, J., A selection problem of shared fixed costs and network flows, Manag. Sci., 17, 3, 200-207 (1970) · Zbl 0203.52505
[25] Shang, Y.; Wah, B. W., A discrete Lagrangian-based global-search method for solving satisfiability problems, J. Glob. Optim., 12, 1, 61-99 (1998) · Zbl 0904.90154
[26] Boriboon, M.; Kriengket, K.; Chootrakool, P.; Phaholphinyo, S.; Purodakananda, S.; Thanakulwarapas, T.; Kosawat, K., Best corpus development and analysis, (2009 International Conference on Asian Language Processing (2009), IEEE), 322-327
[27] Brown, P. F.; Desouza, P. V.; Mercer, R. L.; Pietra, V. J.D.; Lai, J. C., Class-based n-gram models of natural language, Comput. Linguist., 18, 4, 467-479 (1992)
[28] Mozes, S.; Nikolaev, K.; Nussbaum, Y.; Weimann, O., Minimum cut of directed planar graphs in \(o(n \log \log n)\) time, (Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (2018), SIAM), 477-494 · Zbl 1403.68172
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.