Search Descriptions


Neural machine Translation

Statistical Machine Translation

Search Publications





Alternative Architectures

While the attentional sequence-to-sequence model is currently the dominant architecture for neural machine translation, other architectures have been explored.

Alternative Architectures is the main subject of 44 publications. 14 are discussed here.


Kalchbrenner and Blunsom (2013) build a comprehensive machine translation model by first encoding the source sentence with a convolutional neural network, and then generate the target sentence by reversing the process. A refinement of this was proposed by Gehring et al. (2017) who use multiple convolutional layers in the encoder and the decoder that do not reduce the length of the encoded sequence but incorporate wider context with each layer.

Self Attention (Transformer)

Vaswani et al. (2017) replace the recurrent neural networks used in attentional sequence-to-sequence models with multiple self-attention layers (called Transformer), both for the encoder as well as the decoder. Chen et al. (2018) compare different configurations of Transformer or recurrent neural networks in the encoder and decoder, and report that many of the different quality gains are due to a handful of training tricks, and show better results with a Transformer encoder and a RNN decoder. Emelin et al. (2019) claim a representation bottleneck in the self-attention layers that requires carrying through lexical features, preventing it from focusing on more complex features. They add shortcut connections from the initial embedding layer to each of the self-attention layers, in both encoder and decoder.
Dehghani et al. (2019) propose a variant, called Universal Transformers, that do not use a fixed number of processing layers, but a arbitrary long loop through a single processing layer.

Deeper Transformer Models

Naive implementations of deeper transformer models by just increasing number of encoder and decoder blocks leads to worse and sometimes catastrophic results. Wu et al. (2019) first train a model with n transformer blocks, then keep their parameters fixed and add m additional blocks. Bapna et al. (2018) argue that earlier encoder layers may be lost and connect all encoder layers to the attention computation of the decoder. Wang et al. (2019) successfully train deep transformer models with up to 30 layers by relocating the normalization step to the beginning of the block and by adding residual connections to all previous layers, not just the directly preceding one.

Document Context

Maruf et al. (2018) consider the entire source document as context when translating a sentence. Attention is computed over all input sentences and the sentences are weighted accordingly. Miculicich et al. (2018) extend this work with hierarchical attention which first computes attention over sentences and then over words. Due to computational problems, this is limited to a window of surrounding sentences. Maruf et al. (2019) also use hierarchical attention but compute sentence-level attention over the entire document and filters out the most relevant sentences before extending attention over words. A gate distinguishes between words in the source sentence and words in the context sentences. Junczys-Dowmunt (2019) translates entire source documents (up to 1000 words) at a time by concatenating all input sentences, showing significant improvements.



Related Topics

New Publications


  • Xu et al. (2019)
  • Werlen et al. (2018)


  • Hao et al. (2019) - recurrence
  • Mino et al. (2017) - target attention
  • Zhang et al. (2018) - average attention

Multi-Layer Fusion

  • Wang et al. (2018)

Weakly Recurrent

  • Di Gangi and Federico (2018)

Weight Tying in Embeddings

  • Pappas et al. (2018)
  • Kuang et al. (2018)


  • Gu et al. (2018)
  • Wei et al. (2019)
  • Wang et al. (2018)
  • Libovick\'y and Helcl (2018)

Phrase Model

  • Huang et al. (2018)


  • Kaiser et al. (2018)

Neural Hidden Markov

  • Wang et al. (2018)

Modelling Past and Future

  • Zheng et al. (2018)


  • Bahar et al. (2018)

Gated Memory

  • Cao and Xiong (2018)

Exploiting Deep Representations

  • Dou et al. (2018)


  • Zhang et al. (2018)


  • Jehl and Riezler (2018)
  • Zhang et al. (2018)
  • Voita et al. (2019)
  • Kuang and Xiong (2018)
  • Wang et al. (2018)
  • Tu et al. (2018)
  • Maruf and Haffari (2018)

Sentence-Level Context

  • Wang et al. (2019)


  • Pouget-Abadie et al. (2014)
  • Hill et al. (2014)