Neural machine Translation
Statistical Machine Translation
Domain Adaptation has been widely studied in traditional statistical machine translation. These techniques have been adapted and new techniques have been applied to neural machine translation models to adapt them to domain or other stylistic aspects.
Adaptation is the main subject of 64 publications. 39 are discussed here.
Topics in NeuralNetworkModelsNeural Language Models | Attention Model | Training | Inference | Coverage | Vocabulary | Embeddings | Multilingual Word Embeddings | Monolingual Data | Adaptation | Linguistic Annotation | Multilingual Multimodal Multitask | Alternative Architectures | Analysis And Visualization | Neural Components In Statistical Machine Translation
There is often a domain mismatch between the bulk (or even all) of the training data for a translation and its test data during deployment. There is rich literature in traditional statistical machine translation on this topic.
Fine Tuning:A common approach for neural models is to first train on all available training data, and then run a few iterations on in-domain data only (Luong and Manning, 2015), as already pioneered in neural language model adaption (Ter-Sarkisov et al., 2015). Servan et al. (2016) demonstrate the effectiveness of this adaptation method with small in-domain sets consisting of as little as 500 sentence pairs. (Etchegoyhen et al., 2018) evaluate the quality of such domain-adapted systems using subjective assessment and post-editor productivity measures. Chu et al. (2017) argue that given small amount of in-domain data leads to overfitting and suggest to mix in-domain and out-of-domain data during adaption. Freitag and Al-Onaizan (2016) identify the same problem and suggest to use an ensemble of baseline models and adapted models to avoid overfitting. Peris et al. (2017) consider alternative training methods for the adaptation phase but do not find consistently better results than the traditional gradient descent training. Vilar (2018) leave the general model parameters fixed during fine tuning, and only update an adaption layer in the recurrent states. Michel and Neubig (2018) only update an additional bias term in the output softmax. Thompson et al. (2018) explore which parameters (input embedding, recurrent state propagation, etc.) may be left unchanged while still obtaining good adaptation results. Dakwale and Monz (2017); Khayrallah et al. (2018) regularize the training objective to include a term that penalizes departure from the word predictions of the un-adapted baseline model. Barone et al. (2017) use the L2 norm between baseline parameter values and adapted parameter values as regularizer in the objective function, in addition to drop out techniques. Thompson et al. (2019) show superior results with a technique called elastic weight consolidation that also tends to preserve model parameters that were important for general model translation quality.
Curriculum training:Wees et al. (2017) adopt curriculum training for the adaptation problem. They start with corpus consisting of all data, and the train on smaller and smaller subsets that are increasingly in-domain, as determined by language model. Kocmi and Bojar (2017) employ curriculum training by first training on simpler sentence pairs, measured by the length of the sentences, the number of coordinating conjunctions, and the frequency of words. Platanios et al. (2019) show that a refined scheme that selects data of increasing difficulty based on the training progress converges faster and gives better performance for Transformer models. Zhang et al. (2019) explored various other curriculum schedules based on difficulty, including training on the hard examples first. Kumar et al. (2019) learn a curriculum for data of different degrees of noisiness with reinforcement learning using gains on the validation set as rewards.
Sentence-level adaptation:Before translating a sentence, Farajian et al. (2017); Li et al. (2018) propose to fetch a few similar sentences and their translations from a parallel corpus and adapt the neural translation model to this subsampled training set. Similarly, using only monolingual source side data, Chinea-Rios et al. (2017) subsample sentences similar to the sentences in a document to be translated and perform a self-training step. Self-training first translates the source text and then adapts the model to this synthetic parallel corpus. Gu et al. (2018) modify the model architecture to include the retrieved sentence pairs. These sentence pairs are stored in a neural key-value memory and words from these sentence pairs may be either copied over directly or fused with predictions of the baseline neural machine translation model. Zhang et al. (2018) extract phrase pairs from the retrieved sentence pairs, and add a bonus to hypotheses during search, if these contain them. Bapna and Firat (2019) retrieve similar sentence pairs from a domain-specific corpus at inference time and provide these as additional conditioning context. Kothur et al. (2018) show that machine translation systems can be adapted instantly to the post-edits of a translator working through a single document. They show gains with both fine-tuning to edited sentence pairs and adding new word translations via fine-tuning. Wuebker et al. (2018) build personalized translation models in a similar scenario. They show that just modifying the output layer predictions and use group lasso regularization to limit the divergence between the general model and the personalized offsets. Simianer et al. (2019) compare different sentence-level adaptation training methods in terms of how well they perform of translating words that occur once in a adaptation sentence pair as well as new words not yet encountered during adaptation. They show that lasso-adaptation (Wuebker et al., 2018) improves on once-seen words while not degrading on previously not encountered words.
Subsampling and Instance Weighting:Inspired by domain adaptation work in statistical machine translation on sub-sampling, Wang et al. (2017) augment the canonical neural translation model with a sentence embedding state that allows distinction between in-domain and out-of-domain sentences. It is computed as the sum of all input word representations, and then used as initial state of the decoder. This sentence embedding allows them to distinguish between in-domain and out-of-domain sentences, using the centroids of all in-domain and out-of-domain sentence embeddings, respectively. Out-of-domain sentences that are closer to the in-domain centroid are included in the training data. Chen et al. (2017) combine the idea of sub-sampling with sentence weighting. They build an in-domain vs. out-of-domain classifier for sentence pairs in the training data, and then use its prediction score to reduce the learning rate for sentence pairs that are out of domain. Wang et al. (2017) also explore such sentence-level learning rate scaling, and compare it against oversampling of in-domain data, showing similar results. Farajian et al. (2017) show that traditional statistical machine translation outperforms neural machine translation when training general-purpose machine translation systems on a collection of data, and then tested on niche domains. The adaptation technique allows neural machine translation to catch up.
Domain Tokens:A multi-domain model may be trained and informed at run-time about the domain of the input sentence. Kobus et al. (2016) apply an idea initially proposed by Sennrich et al. (2016) - to augment input sentences for register with a politeness feature token - to the domain adaptation problem. They add a domain token to each training and test sentence. Tars and Fishel (2018) give results that show domain tokens outperform fine tuning and also explore word-level domain factors.
Topic Models:If the data contains sentences from multiple domains but the composition is unknown, then automatically detecting different domains (then typically called topics) with methods such as LDA is an option. Zhang et al. (2016) apply such clustering and then compute for each word a topic distribution vector. It is used in addition to the word embedding to inform the encoder and decoder in a otherwise canonical neural translation model. Instead of word-level topic vectors, Chen et al. (2016) encode the given domain membership of each sentence as an additional input vector to the conditioning context of word prediction layer. Tars and Fishel (2018) use sentence embeddings and k-means clustering to obtain topic clusters.
Noisy Data:Text to be translated by machine translation models may be noisy, either due to misspellings or creative language use which is common in social media text. Machine translation models may be adapted to such noise to be more robust. Vaibhav et al. (2019) add synthetic training data that contains types of noise similar to what has been seen in a test set of web discussion forum posts. Anastasopoulos et al. (2019) employ corpora from grammatical error correction tasks (sentences with errors from non-native speakers alongside their corrections) to create synthetic input that mirrors the same type of errors. They compare translation quality between clean and noisy input and reduce the gap by adding similar synthetic noisy data to training.