Search Descriptions


Neural machine Translation

Statistical Machine Translation

Search Publications






The task of applying a trained model to generate a translation is called inference in machine learning, or more commonly decoding in machine translation. This problem is typically solved by beam search.

Inference is the main subject of 50 publications. 22 are discussed here.


Freitag and Al-Onaizan (2017) introduce threshold pruning to neural machine translation. Discarding hypothesis whose score falls below a certain fraction of the best score are discarded, showing faster decoding while maintaining quality. Hu et al. (2015) modify search in two ways. Instead of expand all hypothesis of a stack of maximum size N, expand only the one best hypothesis from any stack at a time, To avoid only expanding the short hypothesis, a brevity penalty is introduced. Secondly, they limit the computation of translation probabilities to words that are in the phrase table of a traditional phrase-based model for the input sentence, leading to several fold speed-ups at little loss in quality. Extending this work, Mi et al. (2016) also consider top word translations and the most frequent words in the vocabulary filter for the prediction computation. Shi and Knight (2017) use dictionaries obtained from statistical alignment models and (unsuccessfully) Locality Sensitive Hashing to speed up decoding. Devlin (2017) obtain speed-ups with pre-computation and use of 16-bit floating point operations.
Monte-Carlo decoding was used by Ott et al. (2018) to analyze the search space and by Edunov et al. (2018) for back-translation.

Limiting Search Space:

Compared to statistical machine translation, neural machine translation may be less adequate, even if more fluent. In other words, the translation may diverge from the input in various ways, such as not translating part of the sentence or generating un-related output words. Zhang et al. (2017) propose to limit the search space of the neural decoder to n-best lists generated by a phrase-based system. Khayrallah et al. (2017) extend this to the search lattice.


Niehues et al. (2017) explore the search space considered during decoding. While they find that decoding makes very few search errors, better translation results could be obtained by picking other translations considered during beam search. Liu et al. (2016) rerank the n-best list by training a model that generates the output starting with the last word of the sentence, called left-to-right decoding. Their approach was successfully used by Sennrich et al. (2016) in their winning system in the WMT 2016 shared task. Hoang et al. (2017) propose using a model trained in the inverse translation direction and a language model. Li and Jurafsky (2016) generate more diverse n-best lists by adding a bias term to penalize too many expansions of a single hypothesis. Stahlberg et al. (2017) use minimum Bayes risk to rerank decoding lattices. This method also allows the combination of SMT and NMT search graphs.

Decoding Constraints:

In practical deployment of machine translation, there is often a need to override model predictions with pre-specified word or phrase translations, for instance to enforce required terminology or to support external components. Chatterjee et al. (2017) allow the specification of pre-defined translations for certain input words and modify the decoder to use them, based on input word attention. Hokamp and Liu (2017) modify the decoding algorithm to force the decoder to produce certain specified output strings. Each time such one of the output strings is produced, hypotheses are placed into a different beam, and final translations are picked from the beam that contains hypotheses that produced all specified output. Related to this idea, Anderson et al. (2017) mark hypotheses with states in a finite state machines that indicate the subset of constraints (pre-specified translatins) that have been satisfied. Hasler et al. (2018) refine this approach by using a linear (not exponential) number of constraint satisfaction states, and also remove attention from words whose constraints have been satisfied. Post and Vilar (2018) split up the beam into sub beams, instead of duplicating beams to prevent increase in decoding time for sentences with such constraints. Hu et al. (2019) extends this work with a trie structure to encode constraints, thus improving the handling of constraints that start with the same words, and also improve batching. Song et al. (2019) replace the words with their specified translations in the input and aid the translation of such code-switched data with a pointer network that handles the copying of the specified translations.



Related Topics

New Publications

  • Zhang et al. (2017)
  • Stahlberg et al. (2018)
  • Lin et al. (2018)
  • Stahlberg et al. (2018)
  • Werlen et al. (2018)
  • Hoang et al. (2018)
  • Senellart et al. (2018)
  • Schulz et al. (2018)
  • Ma et al. (2018)
  • Shu and Nakayama (2018)
  • Chen et al. (2018)
  • Geng et al. (2018)
  • Alinejad et al. (2018)
  • Yang et al. (2018)
  • Zhang et al. (2018)
  • Shao et al. (2018)
  • Zhang et al. (2018)
  • Stahlberg and Byrne (2017)
  • He et al. (2017)
  • Dalvi et al. (2018)
  • Iglesias et al. (2018)
  • Gu et al. (2017)
  • Mi et al. (2016)
  • Gu et al. (2017)
  • Zhou et al. (2017)
  • Zhou et al. (2017)
  • Kikuchi et al. (2016)
  • Ishiwatari et al. (2017)