Domain Adaptation has been widely studied in traditional statistical machine translation. These techniques have been adapted and new techniques have been applied to neural machine translation models to adapt them to domain or other stylistic aspects.
Adaptation is the main subject of 28 publications.
Topics in NeuralNetworkModelsNeural Language Models | Attention Model | Coverage | Vocabulary | Embeddings | Multilingual Word Embeddings | Monolingual Data | Adaptation | Linguistic Annotation | Multilingual Multimodal Multitask | Alternative Architectures | Analysis And Visualization | Neural Components In Statistical Machine Translation
There is often a domain mismatch between the bulk (or even all) of the training data for a translation and its test data during deployment. There is rich literature in traditional statistical machine translation on this topic.
Fine Tuning:A common approach for neural models is to first train on all available training data, and then run a few iterations on in-domain data only (Luong and Manning, 2015), as already pioneered in neural language model adaption (Ter-Sarkisov et al., 2015). Servan et al. (2016) demonstrate the effectiveness of this adaptation method with small in-domain sets consisting of as little as 500 sentence pairs. (Etchegoyhen et al., 2018) evaluate the quality of such domain-adapted systems using subjective assessment and post-editor productivity measures. Chu et al. (2017) argue that given small amount of in-domain data leads to overfitting and suggest to mix in-domain and out-of-domain data during adaption. Freitag and Al-Onaizan (2016) identify the same problem and suggest to use an ensemble of baseline models and adapted models to avoid overfitting. Peris et al. (2017) consider alternative training methods for the adaptation phase but do not find consistently better results than the traditional gradient descent training. Vilar (2018) leave the general model parameters fixed during fine tuning, and only update an adaption layer in the recurrent states. Michel and Neubig (2018) only update an additional bias term in the output softmax. Dakwale and Monz (2017) regularize the training objective to include a term that penalizes departure from the word predictions of the un-adapted baseline model. Barone et al. (2017) use the L2 norm between baseline parameter values and adapted parameter values as regularizer in the objective function, in addition to drop out techniques. Wees et al. (2017) adopt curriculum training for the adaptation problem. They start with corpus consisting of all data, and the train on smaller and smaller subsets that are increasingly in-domain, as determined by language model cross-entropy.
Sentence-level adaptation:Before translating a sentence, Farajian et al. (2017); Li et al. (2018) propose to fetch a few similar sentences and their translations from a parallel corpus and adapt the neural translation model to this subsampled training set. Similarly, using only monolingual source side data, Chinea-Rios et al. (2017) subsample sentences similar to the sentences in a document to be translated and perform a self-training step. Self-training first translates the source text and then adapts the model to this synthetic parallel corpus. Gu et al. (2018) modify the model architecture to include the retrieved sentence pairs. These sentence pairs are stored in a neural key-value memory and words from these sentence pairs may be either copied over directly or fused with predictions of the baseline neural machine translation model. Zhang et al. (2018) extract phrase pairs from the retrieved sentence pairs, and add a bonus to hypotheses during search, if these contain them.
Subsampling and Instance Weighting:Inspired by domain adaptation work in statistical machine translation on sub-sampling, Wang et al. (2017) augment the canonical neural translation model with a sentence embedding state that allows distinction between in-domain and out-of-domain sentences. It is computed as the sum of all input word representations, and then used as initial state of the decoder. This sentence embedding allows them to distinguish between in-domain and out-of-domain sentences, using the centroids of all in-domain and out-of-domain sentence embeddings, respectively. Out-of-domain sentences that are closer to the in-domain centroid are included in the training data. Chen et al. (2017) combine the idea of sub-sampling with sentence weighting. They build an in-domain vs. out-of-domain classifier for sentence pairs in the training data, and then use its prediction score to reduce the learning rate for sentence pairs that are out of domain. Wang et al. (2017) also explore such sentence-level learning rate scaling, and compare it against oversampling of in-domain data, showing similar results. Farajian et al. (2017) show that traditional statistical machine translation outperforms neural machine translation when training general-purpose machine translation systems on a collection of data, and then tested on niche domains. The adaptation technique allows neural machine translation to catch up.
Domain Tokens:A multi-domain model may be trained and informed at run-time about the domain of the input sentence. Kobus et al. (2016) apply an idea initially proposed by Sennrich et al. (2016) - to augment input sentences for register with a politeness feature token - to the domain adaptation problem. They add a domain token to each training and test sentence. Tars and Fishel (2018) give results that show domain tokens outperform fine tuning and also explore word-level domain factors.
Topic Models:If the data contains sentences from multiple domains but the composition is unknown, then automatically detecting different domains (then typically called topics) with methods such as LDA is an option. Zhang et al. (2016) apply such clustering and then compute for each word a topic distribution vector. It is used in addition to the word embedding to inform the encoder and decoder in a otherwise canonical neural translation model. Instead of word-level topic vectors, Chen et al. (2016) encode the given domain membership of each sentence as an additional input vector to the conditioning context of word prediction layer. Tars and Fishel (2018) use sentence embeddings and k-means clustering to obtain topic clusters.