Vocabulary

The large number of words in natural language vocabulary is a challenge for the vector space representations used in neural networks. Several strategies have been explored to handle large vocabulary or resort to sub-word representations of words.

Vocabulary is the main subject of 47 publications. 22 are discussed here.

Topics in NeuralNetworkModels

Publications

Special Handling of Rare Words:

A significant limitation of neural machine translation models is the computational burden to support very large vocabularies. To avoid this, the vocabulary may be reduced to a shortlist of, say, 20,000 words, and the remaining tokens are replaced with the unknown word token "UNK". To translate such an unknown word, Luong et al. (2015); Jean et al. (2015) resort to a separate dictionary. Arthur et al. (2016) argue that neural translation models are worse for rare words and interpolate a traditional probabilistic bilingual dictionary with the prediction of the neural machine translation model. They use the attention mechanism to link each target word to a distribution of source words and weigh the word translations accordingly.

Source words such as names and numbers may also be directly copied into the target. Gulcehre et al. (2016) use a so-called switching network to predict either a traditional translation operation or a copying operation aided by a softmax layer over the source sentence. They preprocess the training data to change some target words into word positions of copied source words. Similarly, Gu et al. (2016) augment the word prediction step of the neural translation model to either translate a word or copy a source word. They observe that the attention mechanism is mostly driven by semantics and the language model in the case of word translation, but by location in case of copying.

Subwords:

Sennrich et al. (2016) split up all words to sub-word units, using character n-gram models and a segmentation based on the byte pair encoding compression algorithm. Schuster and Nakajima (2012) developed a similar method originally for speech recognition, called word piece or sentence piece, that also starts with breaking up all words into character strings and join them together to obtain a lower perplexity unigram language model trained on the data. Kudo and Richardson (2018) present a toolkit for the sentence piece method and describe it in more detail. Kudo (2018) propose subword regularization that samples different subword segmentation during training to allow for richer data to learn smaller subword units. Morishita et al. (2018) use different granularities of subword segmentation (using 16,000, 1000, and 300 operations) in the model and during decoding for the input words and the output word conditioning by summing up the different representations (a single subword from the large vocabulary may decompose into multiple subwords from the smaller vocabularies).

Ataman et al. (2017) proposes a linguistically motivated vocabulary reduction methods that models word formation as a sequence of stem and morphemes with a hidden Markov model, which can be optimized for a fixed target vocabulary size. Ataman and Federico (2018) show that this method outperforms byte pair encoding for several morphologically rich language pairs. Banerjee and Bhattacharyya (2018) also not that morphologically inspired segmentation, as provided by a tool called Morfessor (Virpioja et al., 2013), sometimes gives better results than byte pair encoding, and that both methods combined may outperform either.

Nikolov et al. (2018); Zhang and Komachi (2018) extend the idea of splitting up words to logographic languages such as Chinese by allowing breaking up characters based on their romanized version or decomposition into strokes.

Character-Based Models:

Generating word representations from their character sequence has been originally proposed for machine translation by Costa-jussà et al. (2016). They use a convolutional neural network to encode input words, but Costa-jussà and Fonollosa (2016) show success also with character-based language models in reranking machine translation . Chung et al. (2016) propose using a recurrent neural network to encode target words and also propose a bi-scale decoder where a fast layer outputs a character at a time, while a slow layer outputs a word at a time. Ataman et al. (2018); Ataman and Federico (2018) show good results with a recurrent neural network over character trigrams for input words but not output words.

Benchmarks

Discussion

New Publications

Durgar El-Kahlout, \.Ilknur and Bektaş, Emre and Erdem, Naime \cSeyma and Kaya, Hamza (2019): Translating Between Morphologically Rich Languages: An Arabic-to-Turkish Machine Translation System, Proceedings of the Fourth Arabic Natural Language Processing Workshop
add
@inproceedings{durgar-el-kahlout-etal-2019-translating,
author = {Durgar El-Kahlout, {\.I}lknur and Bekta{\c{s}}, Emre and Erdem, Naime {\c{S}}eyma and Kaya, Hamza},
title = {Translating Between Morphologically Rich Languages: An {A}rabic-to-{T}urkish Machine Translation System},
booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Workshop},
month = {aug},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics},
url = {https://www.aclweb.org/anthology/W19-4617},
pages = {158--166},
year = 2019
}
El-Kahlout et al. (2019)
Julia Kreutzer and Artem Sokolov (2018): Learning to Segment Inputs for NMT Shows Preference for Character-Level Processing, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
add
@inproceedings{iwslt18-Segment-Kreutzer,
author = {Julia Kreutzer and Artem Sokolov},
title = {Learning to Segment Inputs for NMT Shows Preference for Character-Level Processing},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
year = 2018
}
Kreutzer and Sokolov (2018)
Tang, Gongbo and Cap, Fabienne and Pettersson, Eva and Nivre, Joakim (2018): An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization, Proceedings of the 27th International Conference on Computational Linguistics
add
@inproceedings{C18-1112,
author = {Tang, Gongbo and Cap, Fabienne and Pettersson, Eva and Nivre, Joakim},
title = {An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization},
booktitle = {Proceedings of the 27th International Conference on Computational Linguistics},
month = {aug},
address = {Santa Fe, New Mexico, USA},
publisher = {Association for Computational Linguistics},
url = {https://www.aclweb.org/anthology/C18-1112},
pages = {1320--1331},
year = 2018
}
Tang et al. (2018)
Ugawa, Arata and Tamura, Akihiro and Ninomiya, Takashi and Takamura, Hiroya and Okumura, Manabu (2018): Neural Machine Translation Incorporating Named Entity, Proceedings of the 27th International Conference on Computational Linguistics
add
@inproceedings{C18-1274,
author = {Ugawa, Arata and Tamura, Akihiro and Ninomiya, Takashi and Takamura, Hiroya and Okumura, Manabu},
title = {Neural Machine Translation Incorporating Named Entity},
booktitle = {Proceedings of the 27th International Conference on Computational Linguistics},
month = {aug},
address = {Santa Fe, New Mexico, USA},
publisher = {Association for Computational Linguistics},
url = {https://www.aclweb.org/anthology/C18-1274},
pages = {3240--3250},
year = 2018
}
Ugawa et al. (2018)
Angli Liu and Katrin Kirchhoff (2018): Context Models for OOV Word Translation in Low-Resource Languages, Annual Meeting of the Association for Machine Translation in the Americas (AMTA)
add
@inproceedings{AMTA2018-Liu,
author = {Angli Liu and Katrin Kirchhoff},
title = {Context Models for OOV Word Translation in Low-Resource Languages},
booktitle = {Annual Meeting of the Association for Machine Translation in the Americas (AMTA)},
location = {Boston, USA},
year = 2018
}
Liu and Kirchhoff (2018)
Nguyen, Toan and Chiang, David (2018): Improving Lexical Choice in Neural Machine Translation, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
add
@InProceedings{N18-1031,
author = {Nguyen, Toan and Chiang, David},
title = {Improving Lexical Choice in Neural Machine Translation},
booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)},
publisher = {Association for Computational Linguistics},
pages = {334--343},
location = {New Orleans, Louisiana},
url = {http://aclweb.org/anthology/N18-1031},
year = 2018
}
Nguyen and Chiang (2018)
Liu, Frederick and Lu, Han and Neubig, Graham (2018): Handling Homographs in Neural Machine Translation, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
add
@InProceedings{N18-1121,
author = {Liu, Frederick and Lu, Han and Neubig, Graham},
title = {Handling Homographs in Neural Machine Translation},
booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)},
publisher = {Association for Computational Linguistics},
pages = {1336--1345},
location = {New Orleans, Louisiana},
url = {http://aclweb.org/anthology/N18-1121},
year = 2018
}
Liu et al. (2018)
Pham, Ngoc-Quan and Niehues, Jan and Waibel, Alex (2018): Towards one-shot learning for rare-word translation with external experts, Proceedings of the 2nd Workshop on Neural Machine Translation and Generation
add
@InProceedings{W18-2712,
author = {Pham, Ngoc-Quan and Niehues, Jan and Waibel, Alex},
title = {Towards one-shot learning for rare-word translation with external experts},
booktitle = {Proceedings of the 2nd Workshop on Neural Machine Translation and Generation},
publisher = {Association for Computational Linguistics},
pages = {100--109},
location = {Melbourne, Australia},
url = {http://aclweb.org/anthology/W18-2712},
year = 2018
}
Pham et al. (2018)
Zhao, Yang and Zhang, Jiajun and He, Zhongjun and Zong, Chengqing and Wu, Hua (2018): Addressing Troublesome Words in Neural Machine Translation, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
add
@inproceedings{D18-1036,
author = {Zhao, Yang and Zhang, Jiajun and He, Zhongjun and Zong, Chengqing and Wu, Hua},
title = {Addressing Troublesome Words in Neural Machine Translation},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
url = {https://www.aclweb.org/anthology/D18-1036},
pages = {391--400},
year = 2018
}
Zhao et al. (2018)

Character-Based Models

Lee, Jason and Cho, Kyunghyun and Hofmann, Thomas (2017): Fully Character-Level Neural Machine Translation without Explicit Segmentation, Transactions of the Association for Computational Linguistics
add
@article{TACL1051,
author = {Lee, Jason and Cho, Kyunghyun and Hofmann, Thomas },
title = {Fully Character-Level Neural Machine Translation without Explicit Segmentation},
journal = {Transactions of the Association for Computational Linguistics},
volume = {5},
keywords = {{}},
issn = {2307-387X},
url = {https://transacl.org/ojs/index.php/tacl/article/view/1051},
pages = {365--378},
year = 2017
}
Lee et al. (2017)
Ebrahimi, Javid and Lowd, Daniel and Dou, Dejing (2018): On Adversarial Examples for Character-Level Neural Machine Translation, Proceedings of the 27th International Conference on Computational Linguistics
add
@inproceedings{C18-1055,
author = {Ebrahimi, Javid and Lowd, Daniel and Dou, Dejing},
title = {On Adversarial Examples for Character-Level Neural Machine Translation},
booktitle = {Proceedings of the 27th International Conference on Computational Linguistics},
month = {aug},
address = {Santa Fe, New Mexico, USA},
publisher = {Association for Computational Linguistics},
url = {https://www.aclweb.org/anthology/C18-1055},
pages = {653--663},
year = 2018
}
Ebrahimi et al. (2018)
Cherry, Colin and Foster, George and Bapna, Ankur and Firat, Orhan and Macherey, Wolfgang (2018): Revisiting Character-Based Neural Machine Translation with Capacity and Compression, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
add
@inproceedings{D18-1461,
author = {Cherry, Colin and Foster, George and Bapna, Ankur and Firat, Orhan and Macherey, Wolfgang},
title = {Revisiting Character-Based Neural Machine Translation with Capacity and Compression},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
url = {https://www.aclweb.org/anthology/D18-1461},
pages = {4295--4305},
year = 2018
}
Cherry et al. (2018)
Passban, Peyman and Liu, Qun and Way, Andy (2018): Improving Character-Based Decoding Using Target-Side Morphological Information for Neural Machine Translation, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
add
@InProceedings{N18-1006,
author = {Passban, Peyman and Liu, Qun and Way, Andy},
title = {Improving Character-Based Decoding Using Target-Side Morphological Information for Neural Machine Translation},
booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)},
publisher = {Association for Computational Linguistics},
pages = {58--68},
location = {New Orleans, Louisiana},
url = {http://aclweb.org/anthology/N18-1006},
year = 2018
}
Passban et al. (2018)
Yang, Zhen and Chen, Wei and Wang, Feng and Xu, Bo (2016): A Character-Aware Encoder for Neural Machine Translation, Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
add
@InProceedings{yang-EtAl:2016:COLING,
author = {Yang, Zhen and Chen, Wei and Wang, Feng and Xu, Bo},
title = {A Character-Aware Encoder for Neural Machine Translation},
booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
month = {December},
address = {Osaka, Japan},
publisher = {The COLING 2016 Organizing Committee},
pages = {3063--3070},
url = {http://aclweb.org/anthology/C16-1288},
year = 2016
}
Yang et al. (2016)
Luong, Minh-Thang and Manning, Christopher D. (2016): Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
add
@InProceedings{luong-manning:2016:P16-1,
author = {Luong, Minh-Thang and Manning, Christopher D.},
title = {Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models},
booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {1054--1063},
url = {http://www.aclweb.org/anthology/P16-1100},
year = 2016
}
Luong and Manning (2016)
Jason Lee and Kyunghyun Cho and Thomas Hofmann (2016): Fully Character-Level Neural Machine Translation without Explicit Segmentation, CoRR
add
@article{DBLP:journals/corr/LeeCH16,
author = {Jason Lee and Kyunghyun Cho and Thomas Hofmann},
title = {Fully Character-Level Neural Machine Translation without Explicit Segmentation},
journal = {CoRR},
volume = {abs/1610.03017},
url = {http://arxiv.org/abs/1610.03017},
timestamp = {Wed, 02 Nov 2016 09:51:26 +0100},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/LeeCH16},
bibsource = {dblp computer science bibliography, http://dblp.org},
year = 2016
}
Lee et al. (2016)
Eriguchi, Akiko and Hashimoto, Kazuma and Tsuruoka, Yoshimasa (2016): Character-based Decoding in Tree-to-Sequence Attention-based Neural Machine Translation, Proceedings of the 3rd Workshop on Asian Translation (WAT2016)
add
@InProceedings{eriguchi-hashimoto-tsuruoka:2016:WAT2016,
author = {Eriguchi, Akiko and Hashimoto, Kazuma and Tsuruoka, Yoshimasa},
title = {Character-based Decoding in Tree-to-Sequence Attention-based Neural Machine Translation},
booktitle = {Proceedings of the 3rd Workshop on Asian Translation (WAT2016)},
month = {December},
address = {Osaka, Japan},
publisher = {The COLING 2016 Organizing Committee},
pages = {175--183},
url = {http://aclweb.org/anthology/W16-4617},
year = 2016
}
Eriguchi et al. (2016)

Hybrid / Use of Translation Lexicons

Zi Long and Ryuichiro Kimura and Takehito Utsuro and Tomoharu Mitsuhashi and Mikio Yamamoto (2017): Neural Machine Translation Model with a Large Vocabulary Selected by Branching Entropy, Machine Translation Summit XVI
add
@inproceedings{mtsummit2017:Long,
author = {Zi Long and Ryuichiro Kimura and Takehito Utsuro and Tomoharu Mitsuhashi and Mikio Yamamoto},
title = {Neural Machine Translation Model with a Large Vocabulary Selected by Branching Entropy},
booktitle = {Machine Translation Summit XVI},
location = {Nagoya, Japan},
url = {https://arxiv.org/pdf/1704.04520.pdf},
year = 2017
}
Long et al. (2017)
Neubig, Graham (2016): Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016, Proceedings of the 3rd Workshop on Asian Translation (WAT2016) mentioned in Training and Vocabulary
add
@InProceedings{neubig:2016:WAT2016,
author = {Neubig, Graham},
title = {Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016},
booktitle = {Proceedings of the 3rd Workshop on Asian Translation (WAT2016)},
month = {December},
address = {Osaka, Japan},
publisher = {The COLING 2016 Organizing Committee},
pages = {119--125},
url = {http://aclweb.org/anthology/W16-4610},
year = 2016
}
Neubig (2016)
Wang, Weiyue and Alkhouli, Tamer and Zhu, Derui and Ney, Hermann (2017): Hybrid Neural Network Alignment and Lexicon Model in Direct HMM for Statistical Machine Translation, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
add
@InProceedings{wang-EtAl:2017:Short1,
author = {Wang, Weiyue and Alkhouli, Tamer and Zhu, Derui and Ney, Hermann},
title = {Hybrid Neural Network Alignment and Lexicon Model in Direct HMM for Statistical Machine Translation},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
month = {July},
address = {Vancouver, Canada},
publisher = {Association for Computational Linguistics},
pages = {125--131},
url = {http://aclweb.org/anthology/P17-2020},
year = 2017
}
Wang et al. (2017)
Wang, Xing and Lu, Zhengdong and Tu, Zhaopeng and Li, Hang and Xiong, Deyi and Zhang, Min (2016): Neural Machine Translation Advised by Statistical Machine Translation, arXiv preprint arXiv:1610.05150
add
@article{wang2016neural,
author = {Wang, Xing and Lu, Zhengdong and Tu, Zhaopeng and Li, Hang and Xiong, Deyi and Zhang, Min},
title = {Neural Machine Translation Advised by Statistical Machine Translation},
journal = {arXiv preprint arXiv:1610.05150},
url = {https://arxiv.org/pdf/1610.05150v2.pdf},
year = 2016
}
Wang et al. (2016)
Thang Luong and Ilya Sutskever and Quoc V. Le and Oriol Vinyals and Wojciech Zaremba (2014): Addressing the Rare Word Problem in Neural Machine Translation, CoRR
add
@article{DBLP:journals/corr/LuongSLVZ14,
author = {Thang Luong and Ilya Sutskever and Quoc V. Le and Oriol Vinyals and Wojciech Zaremba},
title = {Addressing the Rare Word Problem in Neural Machine Translation},
journal = {CoRR},
volume = {abs/1410.8206},
url = {http://arxiv.org/abs/1410.8206},
timestamp = {Sun, 02 Nov 2014 11:25:59 +0100},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/LuongSLVZ14},
bibsource = {dblp computer science bibliography, http://dblp.org},
year = 2014
}
Luong et al. (2014)
Sébastien Jean and Kyunghyun Cho and Roland Memisevic and Yoshua Bengio (2014): On Using Very Large Target Vocabulary for Neural Machine Translation, CoRR
add
@article{DBLP:journals/corr/JeanCMB14,
author = {S{\'{e}}bastien Jean and Kyunghyun Cho and Roland Memisevic and Yoshua Bengio},
title = {On Using Very Large Target Vocabulary for Neural Machine Translation},
journal = {CoRR},
volume = {abs/1412.2007},
url = {http://arxiv.org/abs/1412.2007},
timestamp = {Thu, 01 Jan 2015 19:51:08 +0100},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/JeanCMB14},
bibsource = {dblp computer science bibliography, http://dblp.org},
year = 2014
}
Jean et al. (2014)
Hashimoto, Kazuma and Eriguchi, Akiko and Tsuruoka, Yoshimasa (2016): Domain Adaptation and Attention-Based Unknown Word Replacement in Chinese-to-Japanese Neural Machine Translation, Proceedings of the 3rd Workshop on Asian Translation (WAT2016)
add
@InProceedings{hashimoto-eriguchi-tsuruoka:2016:WAT2016,
author = {Hashimoto, Kazuma and Eriguchi, Akiko and Tsuruoka, Yoshimasa},
title = {Domain Adaptation and Attention-Based Unknown Word Replacement in Chinese-to-Japanese Neural Machine Translation},
booktitle = {Proceedings of the 3rd Workshop on Asian Translation (WAT2016)},
month = {December},
address = {Osaka, Japan},
publisher = {The COLING 2016 Organizing Committee},
pages = {75--83},
url = {http://aclweb.org/anthology/W16-4605},
year = 2016
}
Hashimoto et al. (2016)
Long, Zi and Utsuro, Takehito and Mitsuhashi, Tomoharu and Yamamoto, Mikio (2016): Translation of Patent Sentences with a Large Vocabulary of Technical Terms Using Neural Machine Translation, Proceedings of the 3rd Workshop on Asian Translation (WAT2016)
add
@InProceedings{long-EtAl:2016:WAT2016,
author = {Long, Zi and Utsuro, Takehito and Mitsuhashi, Tomoharu and Yamamoto, Mikio},
title = {Translation of Patent Sentences with a Large Vocabulary of Technical Terms Using Neural Machine Translation},
booktitle = {Proceedings of the 3rd Workshop on Asian Translation (WAT2016)},
month = {December},
address = {Osaka, Japan},
publisher = {The COLING 2016 Organizing Committee},
pages = {47--57},
url = {http://aclweb.org/anthology/W16-4602},
year = 2016
}
Long et al. (2016)
Chitnis, Rohan and DeNero, John (2015): Variable-Length Word Encodings for Neural Translation Models, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
add
@InProceedings{chitnis-denero:2015:EMNLP,
author = {Chitnis, Rohan and DeNero, John},
title = {Variable-Length Word Encodings for Neural Translation Models},
booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
month = {September},
address = {Lisbon, Portugal},
publisher = {Association for Computational Linguistics},
pages = {2088--2093},
url = {http://aclweb.org/anthology/D15-1249},
year = 2015
}
Chitnis and DeNero (2015)

MT Research Survey Wiki

A Comprehensive Survey of Neural and Statistical Machine Translation Research Publications

Search Descriptions