Residual energy-based models for text. (English) Zbl 07370557

Summary: Current large-scale auto-regressive language models display impressive fluency and can generate convincing text. In this work we start by asking the question: Can the generations of these models be reliably distinguished from real text by statistical discriminators? We find experimentally that the answer is affirmative when we have access to the training data for the model, and guardedly affirmative even if we do not.
This suggests that the auto-regressive models can be improved by incorporating the (globally normalized) discriminators into the generative process. We give a formalism for this using the Energy-Based Model framework, and show that it indeed improves the results of the generative models, measured both in terms of perplexity and in terms of human evaluation.


68T05 Learning and adaptive systems in artificial intelligence
Full Text: arXiv Link


[1] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, and Augustus Odena. Discriminator rejection sampling. InInternational Conference on Learning Representations, 2019.
[2] Alexei Baevski and Michael Auli.Adaptive input representations for neural language modeling. InInternational Conference on Learning Representations, 2019. URLhttps: //openreview.net/forum?id=ByxZX20qFQ.
[3] Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. InSIGNLL Conference on
[4] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeural Information Processing Systems, 2020.
[5] Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. Language gans falling short. InInternational Conference on Learning
[6] Miguel A Carreira-Perpinan and Geoffrey E Hinton. On contrastive divergence learning. In Aistats, volume 10, pages 33-40. Citeseer, 2005.
[7] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InProceedings of the 34th International Conference on
[8] Cyprien de Masson d’Autume, Shakir Mohamed, Mihaela Rosca, and Jack Rae. Training language gans from scratch. InAdvances in Neural Information Processing Systems, pages 4300-4311, 2019.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. InNorth-American Association for Computational Linguistics, 2019.
[10] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. InNeural Information Processing Systems, 2019.
[11] Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. Classical structured prediction losses for sequence to sequence learning. InNorth American Chapter
[12] Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Association for Computational Linguistics, 2018.
[13] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55(1):119-139, 1997. · Zbl 0880.68103
[14] Ruiqi Gao, Yang Lu, Junpei Zhou, Song-Chun Zhu, and Ying Nian Wu. Learning generative convnets via multi-grid modeling and sampling. InProceedings of the IEEE Conference on
[15] Sebastian Gehrmann, Hendrik Strobelt, and Alexander M. Rush. GLTR: statistical detection and visualization of generated text. InAssociation for Computational Linguistics, 2019.
[16] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeural
[17] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures.Neural Networks, 18(5-6):602—-610, 2005.
[18] Alex Graves.Generating sequences with recurrent neural networks.arXiv preprint arXiv:1308.0850, 2013.
[19] Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. Bias correction of learned generative models using likelihood-free importance weighting. InNeural Information Processing Systems, 2019.
[20] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. InProceedings of the Thirteenth International · Zbl 1283.62064
[21] Tatsunori B Hashimoto, Hugh Zhang, and Percy Liang. Unifying human and statistical evaluation for natural language generation. InNorth-Americal Association for Computational
[22] Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. Lagging inference networks and posterior collapse in variational autoencoders. InInternational
[23] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771-1800, 2002a. · Zbl 1010.68111
[24] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence.Neural computation, 14(8):1771-1800, 2002b. · Zbl 1010.68111
[25] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9 (8):1735-1780, 1997.
[26] Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations, 2020.
[27] John Hopfield. Neural networks and physical systems with emergent collective computational abilities. InNational Academy of Sciences of the USA, volume 79, pages 2554-2558, 1982. · Zbl 1369.92007
[28] Daniel G. Horvitz and Donovan J. Thompson. A generalization of sampling without replacement from a finite universe.Journal of the American Statistical Association, 1952.
[29] Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of generated text is easiest when humans are fooled. InProceedings of the 58th
[30] Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation.arXiv preprint
[31] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
[32] Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Skip-thought vectors. InNeural Information Processing Systems, 2015.
[33] Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu-Jie Huang. A tutorial on energy-based learning.Predicting Structured Outputs, 2006. MIT Press.
[34] Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. InInternational
[35] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019.
[36] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
[37] Zhuang Ma and Michael Collins. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. InEmpirical Methods for Natural
[38] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2016.
[39] Sebastian Nagel.Cc-news.http://web.archive.org/save/http://commoncrawl.org/ 2016/10/news-dataset-available/, 2016.
[40] Subhajit Naskar, Amirmohammad Rooshenas, Simeng Sun, Mohit Iyyer, and Andrew McCallum. Energy-based reranking: Improving neural machine translation using energybased models.arXiv preprint arXiv:2009.13267, 2020.
[41] Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning non-convergent nonpersistent short-run mcmc toward energy-based model. InAdvances in Neural Information
[42] Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference.InInternational Conference on Learning
[43] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In
[44] Art B. Owen.Monte Carlo theory, methods and examples. 2013. URLhttps://statweb. stanford.edu/ owen/mc/. chapter 9.
[45] Tetiana Parshakova, Jean-Marc Andreoli, and Marc Dymetman. Global autoregressive models for data-efficient sequence learning. InConference on Computational Natural Language
[46] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pages 1310-1318, 2013.
[47] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. InNeural Information Processing Systems, 2017.
[48] Alec Radford and Jeff Wu, 2019.URLhttps://github.com/openai/gpt-2-outputdataset/blob/master/README.md.
[49] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Blog, 1(8), 2019.
[50] M. Ranzato, V. Mnih, J. Susskind, and G.E. Hinton. Modeling natural images using gated mrfs.IEEE Trans. Pattern Analysis and Machine Intelligence, 35(9):2206-2222, 2013.
[51] Marc’Aurelio Ranzato, Y-Lan Boureau, Sumit Chopra, and Yann LeCun. A unified energybased framework for unsupervised learning. In11-th International Workshop on Artificial
[52] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InInternational Conference on Learning
[53] Brian Roark, Murat Saraclar, and Michael Collins. Discriminative n-gram language modeling. Computer Speech & Language, 21(2):373-392, 2007.
[54] Ronald Rosenfeld, Stanley F Chen, and Xiaojin Zhu. Whole-sentence exponential language models: a vehicle for linguistic-statistical integration.Computer Speech & Language, 15
[55] Mike Schuster and K. Paliwal Kuldip. Bidirectional recurrent neural networks.Signal Processing, IEEE Transactions on, 45(11):2673-2681, 1997.
[56] Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. Coldgans: Taming language gans with cautious sampling strategies.arXiv preprint arXiv:2006.04643, 2020.
[57] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InAssociation for Computational Linguistics, 2016.
[58] Libin Shen, Anoop Sarkar, and Franz Josef Och. Discriminative reranking for machine translation. InHuman Language Technology Conference of the North American Chapter of
[59] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. InAdvances in neural information processing systems, pages 3104-3112, 2014.
[60] Y. W. Teh, M. Welling, S. Osindero, and Hinton G. E. Energy-based models for sparse overcomplete representations.Journal of Machine Learning Research, 4:1235-1260, 2003. · Zbl 1139.68401
[61] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural
[62] P. Viola and M. Jones. Robust real-time object detection.IJCV, 2001.
[63] Bin Wang and Zhijian Ou. Language modeling with neural trans-dimensional random fields. In2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 294-300. IEEE, 2017.
[64] Bin Wang and Zhijian Ou. Improved training of neural trans-dimensional random field language models with dynamic noise-contrastive estimation. In2018 IEEE Spoken Language
[65] Bin Wang and Zhijian Ou. Learning neural trans-dimensional random field language models with noise-contrastive estimation. In2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6134-6138. IEEE, 2018b.
[66] Bin Wang, Zhijian Ou, and Zhiqiang Tan. Trans-dimensional random fields for language modeling. InProceedings of the 53rd Annual Meeting of the Association for Computational
[67] Bin Wang, Zhijian Ou, and Zhiqiang Tan. Learning trans-dimensional random fields with applications to language modeling.IEEE transactions on pattern analysis and machine
[68] Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. InInternational Conference on Learning
[69] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. InInternational Conference on Machine Learning, pages 2635-2644, 2016.
[70] Jianwen Xie, Song-Chun Zhu, and Ying Nian Wu. Synthesizing dynamic patterns by spatialtemporal generative convnet. InProceedings of the ieee conference on computer vision and pattern recognition, pages 7093-7101, 2017.
[71] Jianwen Xie, Zilong Zheng, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Wu. Learning descriptor networks for 3d shape synthesis and analysis. InProceedings of the
[72] Jianwen Xie, Song-Chun Zhu, and Ying Nian Wu. Learning energy-based spatial-temporal generative convnets for dynamic patterns.IEEE transactions on pattern analysis and
[73] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. InThirty-first AAAI conference on artificial intelligence, 2017.
[74] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. InNeural Information Procesing Systems, 2019a.
[75] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. InNeural Information Processing Systems, 2019b.
[76] Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and Arvind Neelakantan. Trading off diversity and quality in natural language generation.arXiv preprint arXiv:2004.10450, 2020.
[77] Junbo Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, and Yann LeCun. Adversarially regularized autoencoders. InInternational Conference in Machine Learning, 2018.
[78] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. InThe IEEE International Conference on
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.