Search Descriptions

Main Topics

Search Publications





Neural Network Models

Neural network models have received little attention until a recent explosion of research in the 2010s, caused by their success in vision and speech recognition. Such models allow for clustering of related words and flexible use of context.

Neural Network Models is the main subject of 185 publications.


Basic models to use neural networks for machine translation were already proposed in the 20th century, but not seriously pursued due to lack of computational resources (Forcada and Ñeco, 1997; Casacuberta and Vidal, 1997).
The first competitive fully neural machine translation system participated in the WMT evaluation campaign in 2015 (Jean et al., 2015), reaching state-of-the-art performance at IWLST 2015 (Luong and Manning, 2015) and WMT 2016 (Sennrich et al., 2016), The same year, Systran (Crego et al., 2016), Google (Wu et al., 2016), and WIPO (Junczys-Dowmunt et al., 2016) reported large-scale deployments.
Language Models The first vanguard of neural network research tackled language models. A prominent reference for neural language model is Bengio et al. (2003), who implement an n-gram language model as a feed-forward neural network with the history words as input and the predicted word as output. Schwenk et al. (2006) introduce continuous space language models that are training using neural networks to machine translation, and use them in re-ranking, similar to the earlier work in speech recognition (Schwenk, 2007) where they lay out a number of speed-ups. They made their implementation available as a open source toolkit (Schwenk, 2010), which now also supports training on a graphical processing unit (GPU) (Schwenk et al., 2012).
By first clustering words into classes and encoding words as pair of class and word-in-class bits, Baltescu et al. (2014) reduce the computational complexity sufficiently to allow integration of the neural network language model into the decoder. Another way to reduce computational complexity to enable decoder integration is the use of noise contrastive estimation by Vaswani et al. (2013), which roughly self-normalizes the output scores of the model during training, hence removing the need to compute the values for all possible output words. Baltescu and Blunsom (2015) compare the two techniques - class-based word encoding with normalized scores vs. noise-contrastive estimation without normalized scores - and show that the letter gives better performance with much higher speed. As another way to allow straightforward decoder integration, Wang et al. (2013) convert a continuous space language model for a short list of 8192 words into a traditional n-gram language model in ARPA (SRILM) format. Wang et al. (2014) present a method to merge (or "grow") a continuous space language model with a traditional n-gram language model, to take advantage of both better estimate for the words in the short list and the full coverage from the traditional model.
Finch et al. (2012) use a recurrent neural network language model to rescore n-best lists for a transliteration system. Sundermeyer et al. (2013) compare feed-forward with long short-term neural network language models, a variant of recurrent neural network language models, showing better performance for the latter in a speech recognition re-ranking task. Mikolov (2012) reports significant improvements with reranking n-best lists of machine translation systems with a recurrent neural network language model.
Wu et al. (2012) propose to use factored representations of words (using lemma, stem, and part of speech), with each factor encoded in a one-hot vector, in the input to a recurrent neural network language model.
Neural language model are not deep learning models in the sense that they use a lot of hidden layers. Luong et al. (2015) show that having 3-4 hidden layers improves over having just the typical 1 layer.
Translation Models: By including aligned source words in the conditioning context, Devlin et al. (2014) enrich a feed-forward neural network language model with source context Zhang et al. (2015) add a sentence embedding to the conditional context of this model, which are learned using a variant of convolutional neural networks and mapping them across languages. Meng et al. (2015) use a more complex convolutional neural network to encode the input sentence that uses gated layers and also incorporates information about the output context.
Xing et al. (2015) point out inconsistencies in the representation of word embeddings and the objective function for translation transforms between word embeddings, which they address with normalization.
Zhang et al. (2014) learn phrase embeddings using recursive neural networks and auto-encoders and a mapping between input and output phrase to add an additional score to the phrase translations and to filter the phrase table. Hu et al. (2015) use convolutional neural networks to encode the input and output phrase and pass them to matching that computes their similarity. They include the full input sentence context in the and use a learning strategy called curriculum learning that first learns from the easy training examples and then the harder ones.
N-Gram Translation Model: An alternative view of the phrase based translation model is to break up phrase translations into minimal translation units, and employing a n-gram model over these units to condition each minimal translation units on the previous ones. Schwenk et al. (2007) treat each minimal translation unit as an atomic symbol and train a neural language model over it. Alternatively, (Hu et al., 2014) represent the minimal translation units as bag of words, (Wu et al., 2014) break them even further into single input words, single output words, or single input-output word pairs, and Yu and Zhu (2015) use phrase embeddings leaned with an auto-encoder.
Reordering Models: Lexicalized reordering models struggle with sparse data problems when conditioned on rich context. Li et al. (2014) show that a neural reordering model can be conditioned on current and previous phrase pair (encoded with a recursive neural network auto-encoder) to make the same classification decisions for orientation type.

Instead of handing reordering within the decoding process, we may pre-order the input sentence into output word order.

Gispert et al. (2015) use an input dependency tree to learn a model that swaps children nodes and implement it using a feed-forward neural network. Barone and Attardi (2015) formulate a top-down left-to-right walk through the dependency tree and make reordering decisions at any node. They model this process with a recurrent neural network that includes past decisions in the conditioning context.
End-to-End Neural Machine Translation: Kalchbrenner and Blunsom (2013) build a comprehensive machine translation model by first encoding the source sentence with a convolutional neural network, and then generate the target sentence by reversing the process. This approach is sometimes referred to as encoder-decoder. Cho et al. (2014) use recurrent neural networks for the approach. Sutskever et al. (2014) use a LSTM (long short-term memory) network and reverse the order of the source sentence before decoding.
Attention Model: The influential work by Bahdanau et al. (2015) adds an alignment model (so called "attention mechanism") to link generated output words to source words, which includes conditioning on the hidden state that produced the preceding target word. Source words are represented by the two hidden states of recurrent neural networks that process the source sentence left-to-right and right-to-left. Luong et al. (2015) propose variants to the attention mechanism (which they call "global" attention model) and also a hard-constraint attention model ("local" attention model) which is restricted to a Gaussian distribution around a specific input word.
To better model coverage, Tu et al. (2016) add coverage states for each input word by either (a) summing up attention values, scaled by a fertility value predicted from the input word in context, or (b) learning a coverage update function as a feed-forward neural network layer. This coverage state is added as additional conditioning context for the prediction of the attention state. Feng et al. (2016) condition the prediction of the attention state also on the previous context state and also introduce a coverage state (initialized with the sum of input source embeddings) that aims to subtract covered words at each step. Similarly, Meng et al. (2016) separate hidden states that keep track of source coverage and hidden states that keeo track of produced output. Cohn et al. (2016) add a number of biases to model coverage, fertility, and alignment inspired by traditional statistical machine translation models. They condition the prediction of the attention state on absolute word positions, the attention state of the previous output word in a limited window, and coverage (added attention state values) over a limited window. They also add a fertility model and add coverage in the training objective.
Chen et al. (2016); Liu et al. (2016) add supervised word alignment information (obtained with traditional statistical word alignment methods) to training. They augment the objective function to also optimize matching of the attention mechanism to the given alignments.
To explicitly model the trade-off between source context (the input words) and target context (the already produced target words), Tu et al. (2016) introduce an interpolation weight (called "context gate") that scales the impact of the (a) source context state and (b) the previous hidden state and the last word when predicting the next hidden state in the decoder.
Tu et al. (2017) augment the attention model with a reconstruction step. The generated output is translated back into the input language and the training objective is extended to not only include the likelihood of the target sentence but also the likelihood to the reconstructed input sentence.
Language Models in Neural Machine Translation: Traditional statistical machine translation models have a straightforward mechanism to integrate additional knowledge sources, such as a large out of domain language model. It is harder for end-to-end neural machine translation. Gülçehre et al. (2015) add a language model trained on additional monolingual data to this model, in form of a recurrently neural network that runs in parallel. They compare the use of the language model in re-ranking (or, re-scoring) against deeper integration where a gated unit regulates the relative contribution of the language model and the translation model when predicting a word. Sennrich et al. (2016) back-translate the monolingual data into the input language and use the obtained synthetic parallel corpus as additional training data.
Handling Rare and Unknown Words: A significant limitation of neural machine translation models is the computational burden to support very large vocabularies. To avoid this, typically the vocabulary is reduced to a shortlist of, say, 20,000 words, and the remaining token are replaced with the unknown word token "UNK". To translate such an unknown word, Luong et al. (2015); Jean et al. (2015) resort to a separate dictionary. Arthur et al. (2016) argue that neural translation models are worse for rare words and interpolate a traditional probabilistic bilingual dictionary with the prediction of the neural machine translation model. They use the attention mechanism to link each target word to a distribution of source words and weigh the word translations accordingly.
Source words such as names and numbers may also be directly copied into the target. Gulcehre et al. (2016) use a so-called switching network to predict either a traditional translation operation or a copying operation aided by a softmax layer over the source sentence. They preprocess the training data to change some target words into word positions of copied source words. Similarly, Gu et al. (2016) augment the word prediction step of the neural translation model to either translate a word or copy a source word. They observe that the attention mechanism is mostly driven by semantics and the language model in the case of word translation, but by location in case of copying.
To speed up training, Mi et al. (2016) use traditional statistical machine translation word and phrase translation models to filter the target vocabulary for mini batches.
Sennrich et al. (2016) split up all words to sub-word units, using character n-gram models and a segmentation based on the byte pair encoding compression algorithm.
Domain Adaptation: There is often a domain mismatch between the bulk (or even all) of the training data for a translation and its test data during deployment. There is rich literature in traditional statistical machine translation on this topic. A common approach for neural models is to first train on all available training data, and then run a few iterations on in-domain data only (Luong and Manning, 2015), as already pioneered in neural language model adaption (Ter-Sarkisov et al., 2015). To avoid overfitting, Freitag and Al-Onaizan (2016) suggest to use an ensemble of baseline models and adapted models. Kobus et al. (2016) apply an idea initially proposed by Sennrich et al. (2016) - to augment input sentences for register with a politeness feature token - to the domain adaptation problem. They add a domain token to each training and test sentence. Chen et al. (2016) report better results over the token approach to adapt to topics by encoding the given topic membership of each sentence as an additional input vector to the conditioning context of word prediction layer.



Related Topics

New Publications

System Descriptions (incomplete)

  • Junczys-Dowmunt et al. (2016)
  • Chung et al. (2016)
  • Guasch and Costa-jussà (2016)
  • Sánchez-Cartagena and Toral (2016)
  • Bradbury and Socher (2016)

Analytical Evaluation

  • Sennrich (2016)
  • Hirschmann et al. (2016)
  • Schnober et al. (2016)
  • Guta et al. (2015)

Neural Models as Statistical Machine Translation Components

  • Sennrich (2015)
  • Zhang et al. (2016)
  • Peter et al. (2016)
  • Zhang et al. (2015)
  • Setiawan et al. (2015)
  • Lu et al. (2014)
  • Zhai et al. (2014)
  • Liu et al. (2013)
  • Liu et al. (2014)

Language Models

  • Niehues et al. (2016)
  • Chen et al. (2016)
  • Chen et al. (2016)
  • Devlin et al. (2015)
  • Aransa et al. (2015)
  • Auli and Gao (2014)
  • Niehues et al. (2014)
  • Niehus and Waibel (2012)
  • Alkhouli et al. (2015)
  • Wang et al. (2013)

Reordering Models

  • Kanouchi et al. (2016)