Parallel corpora may contain misaligned or otherwise noisy sentence pairs whose removal may help.
Corpus Cleaning is the main subject of 37 publications. 21 are discussed here.
Statistical machine translation models are generally assumed to be fairly robust to noisy data, such as data that includes misalignments. This is less true for neural machine translation models (Khayrallah and Koehn, 2018)
Data cleaning has been shown to help (Vogel, 2003)
. Often, for instance in the case of news reports that are rewritten for a different audience during translation, documents are not very parallel, so the task of sentence alignment becomes more of a task of sentence extraction (Fung and Cheung, 2004
; Fung and Cheung, 2004b)
. For good performance it has proven crucial, especially when only small amounts of training data are available, to exploit all of the data, may it be by augmenting phrase translation tables to include all words or breaking up sentences that are too long (Mermer et al., 2007)
There is a robust body of work on filtering out noise in parallel data. For example: Taghipour et al. (2011)
use an outlier detection algorithm to filter a parallel corpus; Xu and Koehn (2017)
generate synthetic noisy data (inadequate and non-fluent translations) and use this data to train a classifier to identify good sentence pairs from a noisy corpus; and Cui et al. (2013)
use a graph-based random walk algorithm and extract phrase pair scores to weight the phrase translation probabilities to bias towards more trustworthy ones.
Most of this work was done in the context of statistical machine translation, but more recent work (Carpuat et al., 2017)
targets neural models. That work focuses on identifying semantic differences in translation pairs using cross-lingual textual entailment and additional length-based features, and demonstrates that removing such sentences improves neural machine translation performance.
As Rarrick et al. (2011)
point out, one problem of parallel corpora extracted from the web is translations that have been created by machine translation. Venugopal et al. (2011)
propose a method to watermark the output of machine translation systems to aid this distinction. Antonova and Misyurev (2011)
report that rule-based machine translation output can be detected due to certain word choices, and statistical machine translation output due to lack of reordering.
In 2016, a shared task on sentence pair filtering was organized (Barbu et al., 2016)
, albeit in the context of cleaning translation memories which tend to be cleaner than web crawled data. In 2018, a shared task explored filtering techniques for neural machine translation UNKNOWN CITATION 'koehn-EtAl:2018:WMT'.
Belinkov and Bisk (2018)
investigate noise in neural machine translation, but they focus on creating systems that can translate the kinds of orthographic errors (typos, misspellings, etc.) that humans can comprehend. In contrast, we address noisy training data and focus on types of noise occurring in web-crawled corpora.
There is a rich literature on data selection which aims at sub-sampling parallel data relevant for a task-specific machine translation system (Axelrod et al., 2011)
. Wees et al. (2017)
find that the existing data selection methods developed for statistical machine translation are less effective for neural machine translation. This is different from our goals of handling noise since those methods tend to discard perfectly fine sentence pairs (say, about cooking recipes) that are just not relevant for the targeted domain (say, software manuals). Our work is focused on noise that is harmful for all domains.
Since we begin with a clean parallel corpus and potentially noisy data to it, this work can be seen as a type of data augmentation. Sennrich et al. (2016)
incorporate monolingual corpora into NMT by first translating it using an NMT system trained in the opposite direction. While such a corpus has the potential to be noisy, the method is very effective. Currey et al. (2017)
create additional parallel corpora by copying monolingual corpora in the target language into the source, and find it improves over back-translation for some language pairs. Fadaee et al. (2017)
improve NMT performance in low-resource settings by altering existing sentences to create training data that includes rare words in different contexts.
Other work has also considered copying in NMT. Currey et al. (2017)
add copied data and back-translated data to a clean parallel corpus. They report improvements on English-Romanian when adding as much back-translated and copied data as they have parallel (1:1:1 ratio). For English-Turkish and English-German, they add twice as much back translated and copied data as parallel data (1:2:2 ratio), and report improvements on English-Turkish but not on English-German. However, their English-German
systems trained with the copied corpus did not perform worse than baseline systems.
Ott et al. (2018)
found that while copied training sentences represent less than 2.0% of their training data (WMT 14 English-German and English-French), copies are over-represented in the output of beam search. Using a subset of training data from WMT 17, they replace a subset of the true translations with a copy of the input. They analyze varying amounts of copied noise, and a variety of beam sizes. Larger beams are more effected by this kind of noise; however, for all beam sizes performance degrades completely with 50% copied sentences.
- Poncelas et al. (2018)
- Pinnis (2018)
- Barbu (2017)
- Guo et al. (2018)
- Schwenk (2018)
- Xu and Koehn (2017)
- Enarvi and Kurimo (2013)
- Barbu (2015)
- Sabet et al. (2016)
- Axelrod et al. (2015)
- Cui et al. (2013)
- Shah and Specia (2014)
- Rousseau (2013)
- Arase and Zhou (2013)
- Aharoni et al. (2014)
- Simard (2014)
- Taghipour et al. (2011)
- Formiga and Fonollosa (2012)
- Lui and Baldwin (2012)
- Jehl et al. (2012)