Haralampieva, Veneta; Caglayan, Ozan; Specia, Lucia Supervised visual attention for simultaneous multimodal machine translation. (English) Zbl 07566006 J. Artif. Intell. Res. (JAIR) 74, 1059-1089 (2022). Summary: There has been a surge in research in multimodal machine translation (MMT), where additional modalities such as images are used to improve translation quality of textual systems. A particular use for such multimodal systems is the task of simultaneous machine translation, where visual context has been shown to complement the partial information provided by the source sentence, especially in the early phases of translation. In this paper, we propose the first Transformer-based simultaneous MMT architecture, which has not been previously explored in simultaneous translation. Additionally, we extend this model with an auxiliary supervision signal that guides the visual attention mechanism using labelled phrase-region alignments. We perform comprehensive experiments on three language directions and conduct thorough quantitative and qualitative analyses using both automatic metrics and manual inspection. Our results show that (i) supervised visual attention consistently improves the translation quality of the simultaneous MMT models, and (ii) fine-tuning the MMT with supervision loss enabled leads to better performance than training the MMT from scratch. Compared to the state-of-the-art, our proposed model achieves improvements of up to 2.3 BLEU and 3.5 METEOR points. MSC: 68Txx Artificial intelligence Keywords:machine translation; neural networks; natural language; vision Software:Moses; GloVe; BLEU; Meteor; NLTK; Visual Genome; Tensor2Tensor; LXMERT; Faster R-CNN; Flickr30K; ImageNet PDF BibTeX XML Cite \textit{V. Haralampieva} et al., J. Artif. Intell. Res. (JAIR) 74, 1059--1089 (2022; Zbl 07566006) Full Text: DOI arXiv References: [1] Alinejad, A., Siahbani, M., & Sarkar, A. (2018). Prediction improves simultaneous neural machine translation. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3022-3027, Brussels, Belgium. Association for Computational Linguistics. [2] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. InCVPR. [3] Arivazhagan, N., Cherry, C., Macherey, W., Chiu, C.-C., Yavuz, S., Pang, R., Li, W., & Raffel, C. (2019). Monotonic infinite lookback attention for simultaneous machine translation. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1313-1323. [4] Arivazhagan, N., Cherry, C., Macherey, W., & Foster, G. (2020). Re-translation versus streaming for simultaneous translation. InProceedings of the 17th International Conference on Spoken Language Translation, pp. 220-227, Online. Association for Computational Linguistics. [5] Arthur, P., Cohn, T., & Haffari, G. (2021). Learning coupled policies for simultaneous machine translation using imitation learning. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2709-2719, Online. Association for Computational Linguistics. [6] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016).Layer normalization.arXiv preprint arXiv:1607.06450,1(1). [7] Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. InProceedings of the 3rd International Conference on Learning Representations. [8] Bangalore, S., Rangarajan Sridhar, V. K., Kolan, P., Golipour, L., & Jimenez, A. (2012). Real-time incremental speech-to-speech translation of dialogs. InProceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 437-445, Montr´eal, Canada. Association for Computational Linguistics. [9] Barrault, L., Bougares, F., Specia, L., Lala, C., Elliott, D., & Frank, S. (2018). Findings of the third shared task on multimodal machine translation. InProceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pp. 308-327, Belgium, Brussels. Association for Computational Linguistics. [10] Bird, S., & Loper, E. (2004). NLTK: The natural language toolkit. InProceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 214-217, Barcelona, Spain. Association for Computational Linguistics. [11] Bub, T., Wahlster, W., & Waibel, A. (1997). Verbmobil: The combination of deep and shallow processing for spontaneous speech translation. In1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 71-74. IEEE. [12] Caglayan, O. (2019).Multimodal Machine Translation. Theses, Universit´e du Maine. [13] Caglayan, O., Aransa, W., Bardet, A., Garc´ıa-Mart´ınez, M., Bougares, F., Barrault, L., Masana, M., Herranz, L., & van de Weijer, J. (2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. InProceedings of the Second Conference on Machine Translation, pp. 432-439. [14] Caglayan, O., Aransa, W., Wang, Y., Masana, M., Garc´ıa-Mart´ınez, M., Bougares, F., Barrault, L., & van de Weijer, J. (2016). Does multimodality help human and machine for translation and image captioning?. InProceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp. 627-633, Berlin, Germany. Association for Computational Linguistics. [15] Caglayan, O., Ive, J., Haralampieva, V., Madhyastha, P., Barrault, L., & Specia, L. (2020a). Simultaneous machine translation with visual context. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2350-2361, Online. Association for Computational Linguistics. [16] Caglayan, O., Madhyastha, P., & Specia, L. (2020b). Curious case of language generation evaluation metrics: A cautionary tale. InProceedings of the 28th International Conference on Computational Linguistics, pp. 2322-2328, Barcelona, Spain (Online). International Committee on Computational Linguistics. [17] Calixto, I., Elliott, D., & Frank, S. (2016). DCU-UvA multimodal MT system report. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp. 634-638. [18] Calixto, I., & Liu, Q. (2017). Incorporating global visual features into attention-based neural machine translation.. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 992-1003. [19] Calixto, I., Liu, Q., & Campbell, N. (2017). Doubly-attentive decoder for multi-modal neural machine translation. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1913-1924. [20] Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Doll´ar, P., & Zitnick, C. L. (2015).Microsoft COCO captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325,1. [21] Cho, K., & Esipova, M. (2016). Can neural machine translation do simultaneous translation?.arXiv preprint arXiv:1606.02012,1. [22] Dalvi, F., Durrani, N., Sajjad, H., & Vogel, S. (2018). Incremental decoding and training methods for simultaneous translation in neural machine translation. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 493-499, New Orleans, Louisiana. Association for Computational Linguistics. [23] Delbrouck, J.-B., & Dupont, S. (2017). Modulating and attending the source image during encoding improves multimodal translation.arXiv preprint arXiv:1712.03449,1(1). [24] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A largescale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pp. 248-255. Ieee. [25] Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. InProceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376-380. Association for Computational Linguistics. [26] Elbayad, M., Besacier, L., & Verbeek, J. (2020). Efficient Wait-k Models for Simultaneous Machine Translation. InProc. Interspeech 2020, pp. 1461-1465. [27] Elliott, D., Frank, S., Barrault, L., Bougares, F., & Specia, L. (2017). Findings of the second shared task on multimodal machine translation and multilingual image description. InProceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers, pp. 215-233, Copenhagen, Denmark. Association for Computational Linguistics. [28] Elliott, D., Frank, S., Sima’an, K., & Specia, L. (2016). Multi30K: Multilingual EnglishGerman image descriptions. InProceedings of the 5th Workshop on Vision and Language, pp. 70-74, Berlin, Germany. Association for Computational Linguistics. [29] Elliott, D., & K´ad´ar, ´A. (2017). Imagination improves multimodal translation. InProceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 130-141, Taipei, Taiwan. Asian Federation of Natural Language Processing. [30] Garg, S., Peitz, S., Nallasamy, U., & Paulik, M. (2019).Jointly learning to align and translate with transformer models. InConference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong. [31] Gu, J., Neubig, G., Cho, K., & Li, V. O. (2017). Learning to translate in real-time with neural machine translation. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1053-1062, Valencia, Spain. Association for Computational Linguistics. [32] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. [33] Imankulova, A., Kaneko, M., Hirasawa, T., & Komachi, M. (2020). Towards multimodal simultaneous neural machine translation. InProceedings of the Fifth Conference on Machine Translation, pp. 594-603, Online. Association for Computational Linguistics. [34] Ive, J., Li, A. M., Miao, Y., Caglayan, O., Madhyastha, P., & Specia, L. (2021). Exploiting multimodal reinforcement learning for simultaneous machine translation. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 3222-3233, Online. Association for Computational Linguistics. [35] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,1. [36] Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. InProceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177-180, Prague, Czech Republic. Association for Computational Linguistics. [37] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision,123(1), 32-73. [38] Libovick‘y, J., & Helcl, J. (2017). Attention strategies for multi-source sequence-to-sequence learning. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 196-202. [39] Libovick‘y, J., Helcl, J., & Mareˇcek, D. (2018). Input combination strategies for multi-source transformer decoder. InProceedings of the Third Conference on Machine Translation: Research Papers, pp. 253-260. [40] Liu, L., Utiyama, M., Finch, A., & Sumita, E. (2016). Neural machine translation with supervised attention. InProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 3093-3102, Osaka, Japan. The COLING 2016 Organizing Committee. [41] Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. InAdvances in Neural Information Processing Systems, pp. 13-23. [42] Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. InAdvances in neural information processing systems, pp. 289-297. [43] Ma, M., Huang, L., Xiong, H., Zheng, R., Liu, K., Zheng, B., Zhang, C., He, Z., Liu, H., Li, X., et al. (2019). STACL: Simultaneous Translation with Implicit Anticipation and [44] Ma, X., Pino, J. M., Cross, J., Puzon, L., & Gu, J. (2020). Monotonic multihead attention. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. [45] Mi, H., Wang, Z., & Ittycheriah, A. (2016).Supervised attentions for neural machine translation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2283-2288, Austin, Texas. Association for Computational Linguistics. [46] Niehues, J., Pham, N.-Q., Ha, T.-L., Sperber, M., & Waibel, A. (2018). Low-latency neural speech translation. InProc. Interspeech 2018, pp. 1293-1297. [47] Nishihara, T., Tamura, A., Ninomiya, T., Omote, Y., & Nakayama, H. (2020). Supervised visual attention for multimodal neural machine translation. InProceedings of the 28th International Conference on Computational Linguistics, pp. 4304-4314, Barcelona, Spain (Online). International Committee on Computational Linguistics. [48] Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation.InProceedings of the 40th annual meeting on association for computational linguistics, pp. 311-318. Association for Computational Linguistics. [49] Pascanu, R., Gulcehre, C., Cho, K., & Bengio, Y. (2014). How to construct deep recurrent neural networks: Proceedings of the second international conference on learning representations (iclr 2014). In2nd International Conference on Learning Representations, ICLR 2014. [50] Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543, Doha, Qatar. Association for Computational Linguistics. [51] Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models.In2015 IEEE International Conference on Computer Vision (ICCV), pp. 2641-2649. [52] Press, O., & Wolf, L. (2017). Using the output embedding to improve language models. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 157-163. [53] Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvances in neural information processing systems, pp. 91-99. [54] Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., & Schiele, B. (2016). Grounding of textual phrases in images by reconstruction. InEuropean Conference on Computer Vision, pp. 817-834. Springer. [55] Ryu, K., Matsubara, S., & Inagaki, Y. (2006). Simultaneous English-Japanese spoken language translation based on incremental dependency parsing and transfer. InProceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp. 683-690, Sydney, Australia. Association for Computational Linguistics. [56] Satija, H., & Pineau, J. (2016). Simultaneous machine translation using deep reinforcement learning. InICML 2016 Workshop on Abstraction in Reinforcement Learning. [57] Specia, L., Frank, S., Sima’an, K., & Elliott, D. (2016). A shared task on multimodal machine translation and crosslingual image description. InProceedings of the First Conference on Machine Translation, pp. 543-553, Berlin, Germany. Association for Computational Linguistics. [58] Specia, L., Wang, J., Jae Lee, S., Ostapenko, A., & Madhyastha, P. (2021). Read, spot and translate.Machine Translation,35(1), 145-165. [59] Sulubacak, U., Caglayan, O., Gr¨onroos, S.-A., Rouhe, A., Elliott, D., Specia, L., & Tiedemann, J. (2020). Multimodal machine translation through visuals and speech.Machine Translation,34(2), 97-147. [60] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. InAdvances in neural information processing systems, pp. 3104-3112. [61] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016).Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818-2826. [62] Tan, H., & Bansal, M. (2019). LXMERT: Learning cross-modality encoder representations from transformers. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100-5111, Hong Kong, China. Association for Computational Linguistics. [63] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. InAdvances in neural information processing systems, pp. 5998-6008. [64] Wang, J., & Specia, L. (2019). Phrase localization without paired training examples. In Proceedings of the IEEE/CVF Internaitonal Conference on Computer Vision (ICCV), Seoul, South Korea. IEEE. [65] Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F., & Chao, L. S. (2019). Learning deep transformer models for machine translation. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1810-1822, Florence, Italy. Association for Computational Linguistics. [66] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning,8(3-4), 229-256. · Zbl 0772.68076 [67] Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics,2, 67-78. [68] Zheng, B., Zheng, R., Ma, M., & Huang, L. (2019). Simpler and faster learning of adaptive policies for simultaneous translation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1349-1354. [69] Zhou, M., Cheng, R., Lee, Y. J., & Yu, Z. (2018). A visual attention grounding neural model for multimodal machine translation. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3643-3653, Brussels, Belgium. Association for Computational Linguistics This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.