Search Descriptions

Main Topics

Search Publications





Attention Model

The currently dominant model in neural machine translation is the sequence-to-sequence model with attention.

Attention Model is the main subject of 57 publications.


The attention model has its roots in a sequence-to-sequence model.

Cho et al. (2014) use recurrent neural networks for the approach. Sutskever et al. (2014) use a LSTM (long short-term memory) network and reverse the order of the source sentence before decoding.
The seminal work by Bahdanau et al. (2015) adds an alignment model (so called "attention mechanism") to link generated output words to source words, which includes conditioning on the hidden state that produced the preceding target word. Source words are represented by the two hidden states of recurrent neural networks that process the source sentence left-to-right and right-to-left. Luong et al. (2015) propose variants to the attention mechanism (which they call "global" attention model) and also a hard-constraint attention model ("local" attention model) which is restricted to a Gaussian distribution around a specific input word.
To explicitly model the trade-off between source context (the input words) and target context (the already produced target words), Tu et al. (2016) introduce an interpolation weight (called "context gate") that scales the impact of the (a) source context state and (b) the previous hidden state and the last word when predicting the next hidden state in the decoder.
Tu et al. (2017) augment the attention model with a reconstruction step. The generated output is translated back into the input language and the training objective is extended to not only include the likelihood of the target sentence but also the likelihood to the reconstructed input sentence.

Deep Models:

There are several various to add layers to the encoder and the decoder of he neural translation model. Wu et al. (2016) first use the traditional bidirectional recurrent neural networks to compute input word representations and then refine them with several stacked recurrent layers. Zhou et al. (2016) alternate between forward and backward recurrent layers. Barone et al. (2017) show good results with 4 stacks and 2 deep transitions each for encoder and decoder, as well as alternating networks for the encoder. There are a large number of variations (including the use of skip connections, the choice of LSTM vs. GRU, number of layers of any type) that still need to be explored empirical for various data conditions.


Freitag and Al-Onaizan (2017) introduce threshold pruning to neural machine translation. Discarding hypothesis whose score falls below a certain fraction of the best score are discarded, showing faster decoding while maintaining quality. Hu et al. (2015) modify search in two ways. Instead of expand all hypothesis of a stack of maximum size N, expand only the one best hypothesis from any stack at a time, To avoid only expanding the short hypothesis, a brevity penalty is introduced. Secondly, they limit the computation of translation probabilities to words that are in the phrase table of a traditional phrase-based model for the input sentence, leading to several fold speed-ups at little loss in quality. Extending this work, Mi et al. (2016) also consider top word translations and the most frequent words in the vocabulary filter for the prediction computation.

Decoding Constraints:

In practical deployment of machine translation, there is often a need to override model predictions with pre-specified word or phrase translations, for instance to enforce required terminology or to support external components. Chatterjee et al. (2017) allow the specification of pre-defined translations for certain input words and modify the decoder to use them, based on input word attention. Hokamp and Liu (2017) modify the decoding algorithm to force the decoder to produce certain specified output strings. Each time such one of the output strings is produced, hypotheses are placed into a different beam, and final translations are picked from the beam that contains hypotheses that produced all specified output. Related to this idea, Anderson et al. (2017) mark hypotheses with states in a finite state machines that indicate the subset of constraints (pre-specified translatins) that have been satisfied. Hasler et al. (2018) refine this approach by using a linear (not exponential) number of constraint satisfaction states, and also remove attention from words whose constraints have been satisfied. Post and Vilar (2018) split up the beam into sub beams, instead of duplicating beams to prevent increase in decoding time for sentences with such constraints.



Related Topics

New Publications

  • Hoang et al. (2017)
  • Yang et al. (2017)
  • Gu et al. (2017)

Attention Model

  • Zhang et al. (2017)
  • Yu et al. (2016)
  • Huang et al. (2016)
  • Mi et al. (2016)
  • Calixto et al. (2017)
  • Press and Wolf (2017)
  • Yang et al. (2017)

Advanced Training

  • Zhang et al. (2016)
  • Stahlberg et al. (2017)
  • Yang et al. (2017)
  • Wiseman and Rush (2016)
  • Kreutzer et al. (2017)
  • Neubig (2016)
  • Cheng et al. (2016)
  • Shen et al. (2016)
  • Do et al. (2015)
  • Huang et al. (2015)
  • Cherry (2016)

Advanced Modelling

  • Tu et al. (2017)
  • Gehring et al. (2017)
  • Oda et al. (2017)
  • Wang et al. (2017)
  • Wang et al. (2016)
  • Sountsov and Sarawagi (2016)
  • Shu and Miura (2016)
  • Liu et al. (2016)


  • Mi et al. (2016)
  • Gu et al. (2017)
  • Zhou et al. (2017)
  • Zhou et al. (2017)
  • Shi and Knight (2017)
  • Kikuchi et al. (2016)
  • Hoang et al. (2017)
  • Ishiwatari et al. (2017)


  • Cromieres (2016)
  • Sennrich et al. (2017)
  • Klein et al. (2017)