Neural machine Translation
Statistical Machine Translation
Neural machine translation models are typically trained on word predictions as given by sentence pairs from a parallel corpus with cross-entropy loss as an objective function.
Training is the main subject of 54 publications. 25 are discussed here.
Topics in NeuralNetworkModelsNeural Language Models | Attention Model | Training | Inference | Coverage | Vocabulary | Embeddings | Multilingual Word Embeddings | Monolingual Data | Adaptation | Linguistic Annotation | Multilingual Multimodal Multitask | Alternative Architectures | Analysis And Visualization | Neural Components In Statistical Machine Translation
Adjusting the Learning Rate:An active topic of research are optimization methods that adjust the learning rate of gradient descent training. Popular methods are Adagrad (Duchi et al., 2011), Adadelta (Zeiler, 2012), and currently Adam (Kingma and Ba, 2015).
Sequence-Level Optimization:Shen et al. (2016) introduce minimum risk training that allows for sentence level optimization with metrics such as the BLEU score. A set of possible translation is sampled and their relative probability is used to compute the expected loss (probability-weighted BLEU scores of the sampled translations). They show large gains on a Chinese-English task. Neubig (2016) also showed gains when optimizing towards smoothed sentence-level BLEU, using a sample of 20 translations. Hashimoto and Tsuruoka (2019) optimize towards the GLEU score and speed by training by vocabulary reduction. Wiseman and Rush (2016) use a loss function that penalizes the gold standard falling off the beam during training. Ma et al. (2019) also consider the point where the gold standard falls of the beam but record the loss for this initial sequence prediction and then reset the beam to the gold standard at that point. Edunov et al. (2018) compare various word-level and sentence-level optimization techniques but see only small gains by the best-performing sentence-level minimum risk method over alternatives. Xu et al. (2019) use a mix of gold-standard and predicted words in the prefix. They use an alignment component to keep the mixed prefix and the target training sentence in sync. Zhang et al. (2019) gradually shift from matching towards ground truth towards so-called word-level oracle obtained with Gumbel noise and sentence-level oracles obtained by selecting the BLEU-best translation from the n-best list obtained by beam search.
Right-to-Left TrainingSeveral researcher report that translation quality for the right half of the sentence is lower than for the left half of the sentence and attribute this to the exposure bias: during training a correct prefix (also called teacher forcing) is used to make word predictions, while during decoding only the previously predicted words can be used. Wu et al. (2018) show that this imbalance is to a large degree due to linguistic reasons: it happens for right-branching languages like English and Chinese, but the opposite is the case for left-branching languages like Japanese.
Adversarial Training:Wu et al. (2017) introduce adversarial training to neural machine translation, in which a discriminator is trained alongside a traditional machine translation model to distinguish between machine translation output and human reference translations. The ability to fool the discriminator is used as an additional training objective for the machine translation model. Yang et al. (2018) propose a similar setup, but add a BLEU-based training objective to neural translation model training. Cheng et al. (2018) employ adversarial training to address the problem of robustness, which they identify in the evidence that 70% of translations change when an input word is changed to a synonym. They aim to achieve more robust behavior by adding synthetic training data where one of the input words is replaced with a synonym (neighbor in embedding space) and by using a discriminator that predicts from the encoding of an input sentence if it is an original or an altered source sentence.
Knowledge Distillation:There are several techniques that change the loss function to not only reward good word predictions that closely match the training data but that also closely match predictions of a previous model, called the teacher model. Khayrallah et al. (2018) use a general domain model as teacher to avoid overfitting to in-domain data during domain adaptation by fine-tuning. Wei et al. (2019) use the models that achieved the best results during training at previous checkpoints to guide training.
Faster Training:Ott et al. (2018) improve training speed with 16 bit arithmetic and larger batches that lead to less idle time due to less variance in processing batches on different GPU. They scale up training to 128 GPUs.