Neural machine Translation
Statistical Machine Translation
The task of applying a trained model to generate a translation is called inference in machine learning, or more commonly decoding in machine translation. This problem is typically solved by beam search.
Inference is the main subject of 65 publications. 56 are discussed here.
Topics in NeuralNetworkModelsNeural Language Models | Attention Model | Training | Inference | Coverage | Vocabulary | Embeddings | Multilingual Word Embeddings | Monolingual Data | Adaptation | Linguistic Annotation | Multilingual Multimodal Multitask | Alternative Architectures | Analysis And Visualization | Neural Components In Statistical Machine Translation
Beam Search RefinementsHu et al. (2015) modify search in two ways. Instead of expand all hypotheses of a stack of maximum size N, expand only the one best hypothesis from any stack at a time. To avoid only expanding the shortest hypothesis, a brevity penalty is introduced. Similarly, Shu and Nakayama (2018) do not generate a fixed number of hypotheses for each partial translation length, but instead organize hypotheses in a single priority queue regardless of their length. They also use a length penalty (called progress penalty) and a prediction of the output length. Freitag and Al-Onaizan (2017) introduce threshold pruning to neural machine translation. Discarding hypotheses whose score falls below a certain fraction of the best score are discarded, showing faster decoding while maintaining quality. Zhang et al. (2018) explore recombination, a well known technique in statistical machine translation decoding. They merge hypotheses that share the most recent output words and that are of similar length, thus speeding up decoding and reaching higher quality with fixed beam size. Zhang et al. (2018) apply the idea of cube pruning to neural model decoding, by grouping together hypotheses with the same last output word into so-called "sub cubes". States are expanded sequentially, starting with the highest scoring hypothesis from the highest scoring sub cube, thus obtaining probabilities for subsequent hypotheses. For some hypotheses, states are not expanded when it is no promising new hypotheses would be generated from them. One problem with beam search is that larger beam sizes lead to earlier generation of the end-of-sentence symbol and thus shorter translations. Kikuchi et al. (2016) force the decoder to produce translations within a pre-specified length range by ignoring other completed hypothesis. They also add a length embedding as an additional input feature to the decoder state progression. He et al. (2016) add a word bonus for each generated word (they also propose to add lexical translation probabilities and an n-gram language model). Murray and Chiang (2018) learn the optimal value for this word bonus. Huang et al. (2017) add a bounded word reward that boosts hypothesis length up to an expected optimal length. Yang et al. (2018) refine this reward and also change the stopping criteria for beam search so that sufficiently many long translations are generated.
Stochastic SearchMonte-Carlo decoding was used by Ott et al. (2018) to analyze the search space and by Edunov et al. (2018) for back-translation.
Greedy SearchCho (2016) proposes a variant of greedy decoding where noise is added to the hidden state of the decoder. Multiple passes are performed with different random noise and picking the translation with the highest probability assigned by the non-noisy model. Gu et al. (2017) build on this idea to develop a trainable greedy decoding method. Instead of a noise term, they learn an adjustment term that is optimized on sentence-level translation quality (as measured by BLEU) using reinforcement learning.
Fast Decoding:Devlin (2017) obtain speed-ups with pre-computation and use of 16-bit floating point operations. Zhang et al. (2018) remove the normalization in the softmax output word prediction, after adjusting the training objective to perform self-normalization. Hoang et al. (2018) speed up decoding by batching several input sentences, refined k-best extraction with specialized GPU kernel functions, and use of 16-bit floating point operations. Iglesias et al. (2018) also show improvements with such batching. Argueta and Chiang (2019) fuse the softmax and k-best extraction computation. Senellart et al. (2018) build a smaller model with knowledge distillation that allows faster decoding.
Limiting Hypothesis GenerationHu et al. (2015) limit the computation of translation probabilities to words that are in the phrase table of a traditional phrase-based model for the input sentence, leading to several fold speed-ups at little loss in quality. Extending this work, Mi et al. (2016) also consider top word translations and the most frequent words in the vocabulary filter for the prediction computation. Shi and Knight (2017) use dictionaries obtained from statistical alignment models and (unsuccessfully) Locality Sensitive Hashing to speed up decoding.
Limiting Search SpaceCompared to statistical machine translation, neural machine translation may be less adequate, even if more fluent. In other words, the translation may diverge from the input in various ways, such as not translating part of the sentence or generating un-related output words. Zhang et al. (2017) propose to limit the search space of the neural decoder to the search graph generated by a phrase-based system. Khayrallah et al. (2017) extend this to the search lattice.
RerankingNiehues et al. (2017) explore the search space considered during decoding. While they find that decoding makes very few search errors, better translation results could be obtained by picking other translations considered during beam search. Similarly, Blain et al. (2017) observe that very large beam sizes hurt 1-best decoding but generate higher scoring translations in the n-best list Liu et al. (2016) rerank the n-best list by training a model that generates the output starting with the last word of the sentence, called left-to-right decoding. Their approach was successfully used by Sennrich et al. (2016) in their winning system in the WMT 2016 shared task. Hoang et al. (2017) propose using a model trained in the inverse translation direction and a language model. Li and Jurafsky (2016) generate more diverse n-best lists by adding a bias term to penalize too many expansions of a single hypothesis. In a refinement, Li et al. (2016) learn the diversity rate with reinforcement learning, using as reward the generation of n-best lists that yield better translation quality after reranking. Stahlberg et al. (2017) use minimum Bayes risk to rerank decoding lattices. This method also allows the combination of SMT and NMT search graphs. Iglesias et al. (2018) show gains for minimum Bayes risk decoding for the Transformer model. Niehues et al. (2016) attach the phrase-based translation to the input sentence and feed that into a neural machine translation decoder. Geng et al. (2018) extend this idea to multi-pass decoding. The output of a regular decoding pass is then used as additional input to a second decoding pass. This process is iterated for a fixed number of steps, or stopped based on the decision of a so-called policy network. Zhou et al. (2017) propose a system combination method that combines the output of different translation systems (such as NMT and variants of SMT) that takes the form of multi-source decoding, i.e., using multiple encoders, one for each system output, feeding into a single decoder that produces the consensus output.
Decoding ConstraintsIn practical deployment of machine translation, there is often a need to override model predictions with pre-specified word or phrase translations, for instance to enforce required terminology or to support external components. Chatterjee et al. (2017) allow the specification of pre-defined translations for certain input words and modify the decoder to use them, based on input word attention. Hokamp and Liu (2017) modify the decoding algorithm to force the decoder to produce certain specified output strings. Each time such one of the output strings is produced, hypotheses are placed into a different beam, and final translations are picked from the beam that contains hypotheses that produced all specified output. Related to this idea, Anderson et al. (2017) mark hypotheses with states in a finite state machines that indicate the subset of constraints (pre-specified translatins) that have been satisfied. Hasler et al. (2018) refine this approach by using a linear (not exponential) number of constraint satisfaction states, and also remove attention from words whose constraints have been satisfied. Post and Vilar (2018) split up the beam into sub beams, instead of duplicating beams to prevent increase in decoding time for sentences with such constraints. Hu et al. (2019) extends this work with a trie structure to encode constraints, thus improving the handling of constraints that start with the same words, and also improve batching. Song et al. (2019) replace the words with their specified translations in the input and aid the translation of such code-switched data with a pointer network that handles the copying of the specified translations. Dinu et al. (2019) also present the specified translations as input, but in addition to the original source words, and using a source factor to label input tokens according to the three classes: regular input word, input word with specified translation, and specified translation.
Simultaneous TranslationIntegrating speech recognition and machine translation for the real-time translation of spoken language, requires decoding algorithms that operate on an incoming stream of input words and the production of translations for them before the input sentence is complete, as much as that is possible. Satija and Pineau (2016) propose using reinforcement learning to learn the trade-off between waiting for input and producing output. Cho and Esipova (2016) frame the problem as predicting a sequence of read and write actions, i.e., reading an additional input word and writing out an output word. Gu et al. (2017) optimize the decoding algorithm with reinforcement learning based on this framework. Alinejad et al. (2018) refine this with an prediction operation that predicts the next input words. Dalvi et al. (2018) propose a simpler static read-and-write approach that reads a certain number of input words ahead. Similarly, Ma et al. (2019) use a wait-k strategy that reads a fixed number of words ahead and train a prefix-to-prefix translation model. They argue that their model learns to anticipate missing content. Arivazhagan et al. (2019) integrate the learning of the size of the look-ahead window into the attention mechanism. Their training objective takes both prediction accuracy and look-ahead penalty into account. Similarly, Zheng et al. (2019) also train an end-to-end model that learns translation predictions and look-ahead (i.e., read) operations at the same time. For training, action sequences are generated from the training data with different look-ahead window sizes.
Lattice DecodingIn the case of speech translation off-line, i.e., for a stored audio file without any real-time requirements, tighter integration of the speech recognition component and the machine translation component may be attempted. A common strategy is to expose the full search graph of the speech recognition system in form of a word lattice, a method that also works for preserving ambiguity for word segmentation, morphological analysis, or differing byte pair encoding vocabularies. Zhang et al. (2019) propose a novel attention mechanism over lattices. It excludes consideration of nodes in the lattice that cannot be in the same path for any given node, and also incorporates the probabilities of nodes in the lattice.
Interactive Translation PredictionAnother special decoding scenario is the interactive translation by machines and humans. In this setup, the machine translation system offers up suggestions for word translations, one word at a time, which the human translator either accepts or modifies. Either way, the machine translation system has to propose extensions to the current partial translation. Knowles and Koehn (2016) show that neural methods make better predictions than traditional statistical methods with search lattice methods. Wuebker et al. (2016); Peris et al. (2017) also suggest to force-decode the given partial translation and let the model make subsequent predictions. Knowles et al. (2019) carry out a study with professional translators, showing that interactive translation predictions allows some of them to translate faster. Peris and Casacuberta (2019) extend this technology to other sequence-to-sequence task, such as image captioning.