The large number of words in natural language vocabulary is a challenge for the vector space representations used in neural networks. Several strategies have been explored to handle large vocabulary or resort to sub-word representations of words.
Vocabulary is the main subject of 28 publications.
Topics in NeuralNetworkModelsNeural Language Models | Attention Model | Inference | Coverage | Vocabulary | Embeddings | Multilingual Word Embeddings | Monolingual Data | Adaptation | Linguistic Annotation | Multilingual Multimodal Multitask | Alternative Architectures | Analysis And Visualization | Neural Components In Statistical Machine Translation
Special Handling of Rare Words:A significant limitation of neural machine translation models is the computational burden to support very large vocabularies. To avoid this, the vocabulary may be reduced to a shortlist of, say, 20,000 words, and the remaining tokens are replaced with the unknown word token "UNK". To translate such an unknown word, Luong et al. (2015); Jean et al. (2015) resort to a separate dictionary. Arthur et al. (2016) argue that neural translation models are worse for rare words and interpolate a traditional probabilistic bilingual dictionary with the prediction of the neural machine translation model. They use the attention mechanism to link each target word to a distribution of source words and weigh the word translations accordingly. Source words such as names and numbers may also be directly copied into the target. Gulcehre et al. (2016) use a so-called switching network to predict either a traditional translation operation or a copying operation aided by a softmax layer over the source sentence. They preprocess the training data to change some target words into word positions of copied source words. Similarly, Gu et al. (2016) augment the word prediction step of the neural translation model to either translate a word or copy a source word. They observe that the attention mechanism is mostly driven by semantics and the language model in the case of word translation, but by location in case of copying.
Subwords:Sennrich et al. (2016) split up all words to sub-word units, using character n-gram models and a segmentation based on the byte pair encoding compression algorithm. Schuster and Nakajima (2012) developed a similar method originally for speech recognition, called WordPieceModel, that also starts with breaking up all words into character strings and join them together to obtain a lower perplexity language model trained on the data. Ataman et al. (2017) proposes a linguistically motivated vocabulary reduction methods that models word formation as a sequence of stem and morphemes with a hidden Markov model, which can be optimized for a fixed target vocabulary size. Ataman and Federico (2018) show that this method outperforms byte pair encoding for several morphologically rich language pairs.
Character-Based Models:Generating word representations from their character sequence has been originally proposed for machine translation by Costa-jussà et al. (2016). They use a convolutional neural network to encode input words, but Costa-jussà and Fonollosa (2016) show success also with character-based language models in reranking machine translation . Chung et al. (2016) propose using a recurrent neural network to encode target words and also propose a bi-scale decoder where a fast layer outputs a character at a time, while a slow layer outputs a word at a time. Ataman et al. (2018); Ataman and Federico (2018) show good results with a recurrent neural network over character trigrams for input words but not output words.