×

word.alignment: an R package for computing statistical word alignment and its evaluation. (English) Zbl 07311689

Summary: Word alignment has lots of applications in various natural language processing (NLP) tasks. As far as we are aware, there is no word alignment package in the R environment. In this paper, word.alignment, a new R software package is introduced which implements a statistical word alignment model as an unsupervised learning. It uses IBM Model 1 as a machine translation model based on the use of the EM algorithm and the Viterbi search in order to find the best alignment. It also provides the symmetric alignment using three heuristic methods such as union, intersection, and grow-diag. It has also the ability to build an automatic bilingual dictionary applying an innovative rule. The generated dictionary is suitable for a number of NLP tasks. This package provides functions for measuring the quality of the word alignment via comparing the alignment with a gold standard alignment based on five metrics as well. It is easily installed and executable on the mostly widely used platforms. Note that it is easily usable and we show that its results are almost everywhere better than some other word alignment tools. Finally, some examples illustrating the use of word.alignment is provided.

MSC:

65C60 Computational problems in statistics (MSC2010)
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Benoit, K.; Watanabe, K.; Wang, H.; Nulty, P.; Obeng, A.; Müller, S.; Matsuo, A., quanteda: an R package for the quantitative analysis of textual data, J Open Source Softw, 3, 30, 774 (2018)
[2] Brown, PF; Cocke, J.; Pietra, SAD; Pietra, VJD; Jelinek, F.; Lafferty, JD; Mercer, RL; Roossin, PS, A statistical approach to machine translation, Comput Linguist, 16, 2, 79-85 (1990)
[3] Brown, PF; Pietra, VJD; Pietra, SAD; Mercer, RL, The mathematics of statistical machine translation: parameter estimation, Comput Linguist, 19, 2, 263-311 (1993)
[4] Brunning JJJ (2010) Alignment models and algorithms for statistical machine translation. Doctoral dissertation. University of Cambridge
[5] Chéragui MA (2012) Theoretical overview of machine translation. In: Proceedings ICWIT, pp 160-169
[6] Daneshgar N, Sarmad M (2019) word.alignment: computing word alignment using IBM model 1 (and symmetrization) for a given parallel corpus and its evaluation. R package version 1.1
[7] Déchelotte D, Schwenk H, Bonneau-Maynard H, Allauzen A, Adda G (2007) A state-of-the-art statistical machine translation system based on moses. In: MT Summit, pp 127-133
[8] Dowle M, Srinivasan A, Short T, Lianoglou S, Saporta R, Antonyan E (2017) data.table: extension of data. frame. R package version 1.10.4-3
[9] Feinerer I, Hornik K (2015). tm: text mining package. R package version 0.6-1
[10] Fraser, A.; Marcu, D., Measuring word alignment quality for statistical machine translation, Comput Linguist, 33, 3, 293-303 (2007) · Zbl 1234.68407
[11] Holmqvist M, Ahrenberg L (2011) A gold standard for English-Swedish word alignment. In: Proceedings of the 18th Nordic conference of computational linguistics (NODALIDA 2011), pp 106-113
[12] Hornik K (2015). NLP: natural language processing infrastructure. R package version 0.1-7
[13] Ildefonso, T.; Lopes, GP, Longest sorted sequence algorithm for parallel text alignment, International conference on computer aided systems theory, 81-90 (2005), Berlin: Springer, Berlin
[14] Jochim, C.; Lioma, C.; Schütze, H., Expanding queries with term and phrase translations in patent retrieval, Information retrieval facility conference, 16-29 (2011), Berlin: Springer, Berlin
[15] Koehn, P., Statistical machine translation (2010), Cambridge: Cambridge University Press, Cambridge · Zbl 1202.68446
[16] Lardilleux A, Lepage Y (2009) Sampling-based multilingual alignment. In: International conference on recent advances in natural language processing (RANLP 2009). Borovets, Bulgaria
[17] Moore RC (2005) A discriminative framework for bilingual word alignment. In: Proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, pp. 81-88
[18] Neubig G, Watanabe T, Sumita E, Mori S, Kawahara T (2011) An unsupervised model for joint phrase alignment and extraction. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, pp. 632-641
[19] Neubig G, Watanabe T, Mori S, Kawahara T (2012) Machine translation without words through substring alignment. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers,vol 1. Association for Computational Linguistics, pp. 165-174
[20] Nie, JY, Cross-language information retrieval, Synth Lect Hum Lang Technol, 3, 1, 1-125 (2010)
[21] Och FJ (2000) Giza++: training of statistical translation models. Technical report, RWTH Aachen, University of Technology
[22] Och FJ, Ney H (2000) A comparison of alignment models for statistical machine translation. In: COLING 2000, volume 2: the 18th international conference on computational linguistics · Zbl 1234.68428
[23] Och, FJ; Ney, H., A systematic comparison of various statistical alignment models, Comput Linguist, 29, 1, 19-51 (2003) · Zbl 1234.68428
[24] Och, FJ; Ney, H., The alignment template approach to statistical machine translation, Comput Linguist, 30, 4, 417-449 (2004) · Zbl 1234.68429
[25] Okita T (2009) Data cleaning for word alignment. In: Proceedings of the ACL-IJCNLP 2009 student research workshop. Association for Computational Linguistics, pp. 72-80
[26] Sasaki, Y., The truth of the F-measure, Teach Tutor Mater, 1, 5, 1-5 (2007)
[27] Simes A, Almeida JJ (2003) NATools-a statistical word aligner workbench. Proces Leng Nat 31(septiembre 2003), 217-224
[28] Supreme Council of Information and Communication Technology (2013) Mizan English-Persian Parallel Corpus
[29] R Core Team (2015) R: a language and environment for statistical computing R Foundation for statistical computing, Vienna, Austria. ISBN 3-900051-07-0
[30] Vulić I, Moens MF (2010) Term alignment, state of the art overview. Technical report, Katholieke Universiteit Leuven LIIR (Language Intelligence and Information Retrieval)
[31] Walker A (2017) openxlsx: read, write and edit XLSX files. R package version 4.0.17
[32] Wang, X., Evaluation of two word alignment systems (2004), Umeå: Institutionen för datavetenskap, Umeå
[33] Wu H, Wang H (2007) Comparative study of word alignment heuristics and phrase-based SMT. In: Proceedings of the MT Summit XI
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.