SentencePiece swMATH ID: 35795 Software Authors: Taku Kudo, John Richardson Description: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. Homepage: https://arxiv.org/abs/1808.06226 Source Code: https://github.com/google/sentencepiece Related Software: ESPnet; Kaldi; Transformers; BERT; Conformer; LibriSpeech; fairseq; TensorFlow; Transformer-XL; ALBERT; RoBERTa; XLNet; Tensor2Tensor; Athena; PIKA; SPGISpeech; GigaSpeech; Lingvo; Espresso; PyTorch-Kaldi Cited in: 2 Publications all top 5 Cited by 19 Authors 1 Auli, Michael 1 Baines, Mandeep 1 Bhosale, Shruti 1 Birch, Tom 1 Camacho-Collados, José 1 Celebi, Onur 1 Chaudhary, Vishrav 1 Edunov, Sergey 1 El-Kishky, Ahmed 1 Fan, Angela 1 Goyal, Naman 1 Goyal, Siddharth 1 Jorge, Alípio Mário 1 Joulin, Armand 1 Liptchinsky, Vitaliy 1 Loureiro, Daniel 1 Ma, Zhiyi 1 Schwenk, Holger 1 Wenzek, Guillaume Cited in 2 Serials 1 Artificial Intelligence 1 Journal of Machine Learning Research (JMLR) Cited in 1 Field 2 Computer science (68-XX) Citations by Year