×

The mechanism of additive composition. (English) Zbl 1456.68218

Summary: Additive composition [P. W. Foltz et al., “The measurement of textual coherence with latent semantic analysis”, Discourse Process 15, No. 2–3, 285–307 (1998; doi:10.1080/01638539809545029); T. K. Landauer and S. T. Dumais, “A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge”, Psychol. Rev. 104, No. 2, 211–240 (1997; doi:10.1037/0033-295X.104.2.211); J. Mitchell and M. Lapata, “Composition in distributional models of semantics”, Cognit. Sci. 34, No. 8, 1388–1429 (2010; doi:10.1111/j.1551-6709.2010.01106.x)] is a widely used method for computing meanings of phrases, which takes the average of vector representations of the constituent words. In this article, we prove an upper bound for the bias of additive composition, which is the first theoretical analysis on compositional frameworks from a machine learning point of view. The bound is written in terms of collocation strength; we prove that the more exclusively two successive words tend to occur together, the more accurate one can guarantee their additive composition as an approximation to the natural phrase vector. Our proof relies on properties of natural language data that are empirically verified, and can be theoretically derived from an assumption that the data is generated from a Hierarchical Pitman-Yor Process. The theory endorses additive composition as a reasonable operation for calculating meanings of phrases, and suggests ways to improve additive compositionality, including: transforming entries of distributional word vectors by a function that meets a specific condition, constructing a novel type of vector representations to make additive composition sensitive to word order, and utilizing singular value decomposition to train word vectors.

MSC:

68T50 Natural language processing
60F05 Central limit and other weak theorems
60G57 Random measures
62G05 Nonparametric estimation
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Arora, S., Li, Y., Liang, Y., & Ma, T. (2016). A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics, 4, 385-399.
[2] Banea, C., Chen, D., Mihalcea, R., Cardie, C., & Wiebe, J. (2014). Simcompass: Using deep learning word embeddings to assess cross-level similarity. In: Proceedings of SemEval.
[3] Baroni, M., & Zamparelli, R. (2010). Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In: Proceedings of EMNLP.
[4] Blacoe, W., & Lapata, M. (2012). A comparison of vector-based representations for semantic composition. In: Proceedings of EMNLP.
[5] Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. · doi:10.1145/2133806.2133826
[6] Boleda, G., Baroni, M., Pham, T.N., & McNally, L. (2013). Intensionality was only alleged: On adjective-noun composition in distributional semantics. In: Proceedings of IWCS.
[7] Bottou, L.; Montavon, G. (ed.); Orr, GB (ed.); Müller, KR (ed.), Stochastic gradient descent tricks (2012), Berlin
[8] Burger, M., & Neubauer, A. (2001). Error bounds for approximation with neural networks. Journal of Approximation Theory, 112(2), 235-250. · Zbl 1004.41007 · doi:10.1006/jath.2001.3613
[9] Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22-29.
[10] Clarke, D. (2012). A context-theoretic framework for compositionality in distributional semantics. Computational Linguistics, 38(1), 41-47. · doi:10.1162/COLI_a_00084
[11] Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). Power-law distributions in empirical data. SIAM Review, 51(4), 661-703. · Zbl 1176.62001 · doi:10.1137/070710111
[12] Coecke, B., Sadrzadeh, M., & Clark, S. (2010). Mathematical foundations for a compositional distributional model of meaning. Linguistic Analysis, 36(1), 345-384.
[13] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493-2537. · Zbl 1280.68161
[14] Corral, A., Boleda, G., & i Cancho, R. E. (2015). Zipf’s law for word frequencies: Word forms versus lemmas in long texts. PLoS One, 10(7), 1-23. · doi:10.1371/journal.pone.0129031
[15] Dagan, I., Pereira, F., & Lee, L. (1994). Similarity-based estimation of word cooccurrence probabilities. In: Proceedings of ACL. · Zbl 0928.68111
[16] Dinu, G., Pham, N.T., & Baroni, M. (2013). General estimation and evaluation of compositional distributional semantic models. In: Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality.
[17] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121-2159. · Zbl 1280.68164
[18] Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence with latent semantic analysis. Discourse Process, 15, 285-307. · doi:10.1080/01638539809545029
[19] Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1-58. · doi:10.1162/neco.1992.4.1.1
[20] Gnecco, G., & Sanguineti, M. (2008). Approximation error bounds via rademachers complexity. Applied Mathematical Sciences, 2(4), 153-176. · Zbl 1169.42320
[21] Grefenstette, E., & Sadrzadeh, M. (2011). Experimental support for a categorical compositional distributional model of meaning. In: Proceedings of EMNLP.
[22] Guevara, E. (2010). A regression model of adjective-noun compositionality in distributional semantics. In: Proceedings of the Workshop on GEometrical Models of Natural Language Semantics.
[23] Gutmann, M. U., & Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13(1), 207-361. · Zbl 1283.62064
[24] Ha LQ, Sicilia-Garcia, E.I., Ming, J., & Smith, F.J. (2002). Extension of zipf’s law to words and phrases. In: Proceedings of Coling.
[25] Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217-288. · Zbl 1269.65043 · doi:10.1137/090771806
[26] Harris, Z. S. (1954). Distributional structure. Word, 10, 146-162. · doi:10.1080/00437956.1954.11659520
[27] Hashimoto, K., Stenetorp, P., Miwa, M., & Tsuruoka, Y. (2014). Jointly learning word representations and composition functions using predicate-argument structures. In: Proceedings of EMNLP. · Zbl 0978.68126
[28] Hashimoto, T., Alvarez-Melis, D., & Jaakkola, T. (2016). Word embeddings as metric recovery in semantic spaces. Transactions of the Association for Computational Linguistics, 4, 273-286.
[29] Iyyer, M., Manjunatha, V., Boyd-Graber, J., & III, H.D. (2015). Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of ACL.
[30] Kobayashi, H. (2014), Perplexity on reduced corpora. In: Proceedings of ACL.
[31] Landauer, TK; Ross, N. (ed.), On the computational basis of learning and cognition: Arguments from LSA, No. 41 (2002), Cambridge
[32] Landauer, T. K., & Dumais, S. T. (1997). A solution to platos problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211. · doi:10.1037/0033-295X.104.2.211
[33] Landauer, T.K., Laham, D., Rehder, B., & Schreiner, M.E. (1997). How well can passage meaning be derived without using word order? a comparison of latent semantic analysis and humans. In: Proceedings of Annual Conference of the Cognitive Science Society.
[34] Lebret, R., & Collobert, R. (2014). Word embeddings through Hellinger PCA. In: Proceedings of EACL.
[35] Levy, O., & Goldberg, Y. (2014a). Linguistic regularities in sparse and explicit word representations. In: Proceedings of CoNLL. · Zbl 1053.65506
[36] Levy, O., & Goldberg, Y. (2014b). Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems (NIPS) 27, 2177-2185.
[37] Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, 211-225. · Zbl 1176.62001
[38] Melamud, O., Goldberger, J., & Dagan, I. (2016). context2vec: Learning generic context embedding with bidirectional lstm. In: Proceedings of CoNLL. · Zbl 1280.68161
[39] Mikolov, T., Ilya, S., Chen, K., Corrado, G., & Dean, J. (2013a). Distributed representations of words and phrases and their compositionality. In NIPS’13 Proceedings of the 26th International Conference on Neural Information Processing Systems (pp. 3111-3119).
[40] Mikolov, T., Yih, Wen-tau, & Zweig, G. (2013b). Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT.
[41] Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1-28. · doi:10.1080/01690969108406936
[42] Mitchell, J., & Lapata, M. (2008). Vector-based models of semantic composition. In: Proceedings of ACL-HLT. · Zbl 1004.41007
[43] Mitchell, J., & Lapata, M. (2010). Composition in distributional models of semantics. Cognitive Science, 34(8), 1388-1429. · doi:10.1111/j.1551-6709.2010.01106.x
[44] Montemurro, M. A. (2001). Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications, 300(3), 567-578. · Zbl 0978.68126 · doi:10.1016/S0378-4371(01)00355-7
[45] Muraoka, M., Shimaoka, S., Yamamoto, K., Watanabe, Y., Okazaki, N., & Inui, K. (2014). Finding the best model among representative compositional models. In: Proceedings of PACLIC.
[46] Niyogi, P., & Girosi, F. (1999). Generalization bounds for function approximation from scattered noisy data. Advances in Computational Mathematics, 10, 51-80. · Zbl 1053.65506 · doi:10.1023/A:1018966213079
[47] Paperno, D., Pham, N.T., & Baroni, M. (2014). A practical and linguistically-motivated approach to compositional distributional semantics. In: Proceedings of ACL.
[48] Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In: Proceedings of EMNLP. · Zbl 0880.60076
[49] Pham, N.T., Kruszewski, G., Lazaridou, A., & Baroni, M. (2015). Jointly optimizing word representations for lexical and sentential tasks with the c-phrase model. In: Proceedings of ACL.
[50] Pitman, J. (2006). Combinatorial Stochastic Processes. Berlin: Springer-Verlag. · Zbl 1103.60004
[51] Pitman, J., & Yor, M. (1997). The two-parameter Pisson-Dirichlet distribution derived from a stable subordinator. Annals of Probability, 25, 855-900. · Zbl 0880.60076 · doi:10.1214/aop/1024404422
[52] Rothe, S., & Schütze, H. (2015). Autoextend: Extending word embeddings to embeddings for synsets and lexemes. In: Proceedings of ACL-IJCNLP.
[53] Socher, R., Huang, E. H., Pennin, J., & Manning, C. D. (2011). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Advances in NIPS, 24, 801-809.
[54] Socher, R., Huval, B., Manning, C.D., & Ng, A.Y. (2012). Semantic compositionality through recursive matrix-vector spaces. In: Proceedings of EMNLP.
[55] Stratos, K., Collins, M., & Hsu, D. (2015). Model-based word embeddings from decompositions of count matrices. In: Proceedings of ACL-IJCNLP. · Zbl 1280.68161
[56] Takase, S., Okazaki, N., & Inui, K. (2016). Composing distributed representations of relational patterns. In: Proceedings of ACL.
[57] Teh, Y.W. (2006). A hierarchical bayesian language model based on Pitman-Yor processes. In: Proceedings of ACL. · Zbl 1269.65043
[58] The BNC Consortium (2007) The british national corpus, version 3 (bnc xml edition). Distributed by Oxford University Computing Services, http://www.natcorp.ox.ac.uk/
[59] Tian, R., Miyao, Y., & Matsuzaki, T. (2014). Logical inference on dependency-based compositional semantics. In: Proceedings of ACL. · Zbl 1283.62064
[60] Tian, R., Okazaki, N., & Inui, K. (2016). Learning semantically and additively compositional distributional representations. In: Proceedings of ACL.
[61] Turian, J., Ratinov, L.A., & Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning. In: Proceedings of ACL.
[62] Turney, P.D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Proceedings of EMCL. · Zbl 1007.68551
[63] Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), 141-188. · Zbl 1185.68765
[64] Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Berlin: Springer-Verlag. · Zbl 0833.62008 · doi:10.1007/978-1-4757-2440-0
[65] Zanzotto, F.M., Korkontzelos, I., Fallucchi, F., & Manandhar, S. (2010). Estimating linear models for compositional distributional semantics. In: Proceedings of Coling.
[66] Zipf, G. K. (1935). The Psychobiology of Language: An Introduction to Dynamic Philology. Cambridge: M.I.T. Press.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.