Corpus Cleaning

Parallel corpora may contain misaligned or otherwise noisy sentence pairs whose removal may help.

Corpus Cleaning is the main subject of 37 publications. 21 are discussed here.

Topics in Data

Publications

Statistical machine translation models are generally assumed to be fairly robust to noisy data, such as data that includes misalignments. This is less true for neural machine translation models (Khayrallah and Koehn, 2018). Data cleaning has been shown to help (Vogel, 2003). Often, for instance in the case of news reports that are rewritten for a different audience during translation, documents are not very parallel, so the task of sentence alignment becomes more of a task of sentence extraction (Fung and Cheung, 2004; Fung and Cheung, 2004b). For good performance it has proven crucial, especially when only small amounts of training data are available, to exploit all of the data, may it be by augmenting phrase translation tables to include all words or breaking up sentences that are too long (Mermer et al., 2007).

There is a robust body of work on filtering out noise in parallel data. For example: Taghipour et al. (2011) use an outlier detection algorithm to filter a parallel corpus; Xu and Koehn (2017) generate synthetic noisy data (inadequate and non-fluent translations) and use this data to train a classifier to identify good sentence pairs from a noisy corpus; and Cui et al. (2013) use a graph-based random walk algorithm and extract phrase pair scores to weight the phrase translation probabilities to bias towards more trustworthy ones.

Most of this work was done in the context of statistical machine translation, but more recent work (Carpuat et al., 2017) targets neural models. That work focuses on identifying semantic differences in translation pairs using cross-lingual textual entailment and additional length-based features, and demonstrates that removing such sentences improves neural machine translation performance.

As Rarrick et al. (2011) point out, one problem of parallel corpora extracted from the web is translations that have been created by machine translation. Venugopal et al. (2011) propose a method to watermark the output of machine translation systems to aid this distinction. Antonova and Misyurev (2011) report that rule-based machine translation output can be detected due to certain word choices, and statistical machine translation output due to lack of reordering.

In 2016, a shared task on sentence pair filtering was organized (Barbu et al., 2016), albeit in the context of cleaning translation memories which tend to be cleaner than web crawled data. In 2018, a shared task explored filtering techniques for neural machine translation UNKNOWN CITATION 'koehn-EtAl:2018:WMT'.

Belinkov and Bisk (2018) investigate noise in neural machine translation, but they focus on creating systems that can translate the kinds of orthographic errors (typos, misspellings, etc.) that humans can comprehend. In contrast, we address noisy training data and focus on types of noise occurring in web-crawled corpora.

There is a rich literature on data selection which aims at sub-sampling parallel data relevant for a task-specific machine translation system (Axelrod et al., 2011). Wees et al. (2017) find that the existing data selection methods developed for statistical machine translation are less effective for neural machine translation. This is different from our goals of handling noise since those methods tend to discard perfectly fine sentence pairs (say, about cooking recipes) that are just not relevant for the targeted domain (say, software manuals). Our work is focused on noise that is harmful for all domains.

Since we begin with a clean parallel corpus and potentially noisy data to it, this work can be seen as a type of data augmentation. Sennrich et al. (2016) incorporate monolingual corpora into NMT by first translating it using an NMT system trained in the opposite direction. While such a corpus has the potential to be noisy, the method is very effective. Currey et al. (2017) create additional parallel corpora by copying monolingual corpora in the target language into the source, and find it improves over back-translation for some language pairs. Fadaee et al. (2017) improve NMT performance in low-resource settings by altering existing sentences to create training data that includes rare words in different contexts.

Copy Noise:

Other work has also considered copying in NMT. Currey et al. (2017) add copied data and back-translated data to a clean parallel corpus. They report improvements on English-Romanian when adding as much back-translated and copied data as they have parallel (1:1:1 ratio). For English-Turkish and English-German, they add twice as much back translated and copied data as parallel data (1:2:2 ratio), and report improvements on English-Turkish but not on English-German. However, their English-German

systems trained with the copied corpus did not perform worse than baseline systems.

Ott et al. (2018) found that while copied training sentences represent less than 2.0% of their training data (WMT 14 English-German and English-French), copies are over-represented in the output of beam search. Using a subset of training data from WMT 17, they replace a subset of the true translations with a copy of the input. They analyze varying amounts of copied noise, and a variety of beam sizes. Larger beams are more effected by this kind of noise; however, for all beam sizes performance degrades completely with 50% copied sentences.

Benchmarks

Discussion

New Publications

Alberto Poncelas and Gideon Maillette de Buy Wenniger and Andy Way (2018): Data Selection with Feature Decay Algorithms Using an Approximated Target Side, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
add
@inproceedings{iwslt18-Selection-Poncelas,
author = {Alberto Poncelas and Gideon Maillette de Buy Wenniger and Andy Way},
title = {Data Selection with Feature Decay Algorithms Using an Approximated Target Side},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
year = 2018
}
Poncelas et al. (2018)
Pinnis, Marcis (2018): Tilde's Parallel Corpus Filtering Methods for WMT 2018, Proceedings of the Third Conference on Machine Translation: Shared Task Papers
add
@inproceedings{W18-6486,
author = {Pinnis, Marcis},
title = {Tilde{'}s Parallel Corpus Filtering Methods for WMT 2018},
booktitle = {Proceedings of the Third Conference on Machine Translation: Shared Task Papers},
month = {oct},
address = {Belgium, Brussels},
publisher = {Association for Computational Linguistics},
url = {https://www.aclweb.org/anthology/W18-6486},
pages = {939--945},
year = 2018
}
Pinnis (2018)
Barbu, Eduard (2017): Ensembles of Classifiers for Cleaning Web Parallel Corpora and Translation Memories, Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
add
@inproceedings{barbu-2017-ensembles,
author = {Barbu, Eduard},
title = {Ensembles of Classifiers for Cleaning Web Parallel Corpora and Translation Memories},
booktitle = {Proceedings of the International Conference Recent Advances in Natural Language Processing, {RANLP} 2017},
month = {sep},
address = {Varna, Bulgaria},
publisher = {INCOMA Ltd.},
url = {https://doi.org/10.26615/978-954-452-049-6_011},
doi = {10.26615/978-954-452-049-6_011},
pages = {71--77},
year = 2017
}
Barbu (2017)
Guo, Mandy and Shen, Qinlan and Yang, Yinfei and Ge, Heming and Cer, Daniel and Hernand ez Abrego, Gustavo and Stevens, Keith and Constant, Noah and Sung, Yun-hsuan and Strope, Brian and Kurzweil, Ray (2018): Effective Parallel Corpus Mining using Bilingual Sentence Embeddings, Proceedings of the Third Conference on Machine Translation: Research Papers
add
@inproceedings{W18-6317,
author = {Guo, Mandy and Shen, Qinlan and Yang, Yinfei and Ge, Heming and Cer, Daniel and Hernand ez Abrego, Gustavo and Stevens, Keith and Constant, Noah and Sung, Yun-hsuan and Strope, Brian and Kurzweil, Ray},
title = {Effective Parallel Corpus Mining using Bilingual Sentence Embeddings},
booktitle = {Proceedings of the Third Conference on Machine Translation: Research Papers},
month = {oct},
address = {Belgium, Brussels},
publisher = {Association for Computational Linguistics},
url = {https://www.aclweb.org/anthology/W18-6317},
pages = {165--176},
year = 2018
}
Guo et al. (2018)
Schwenk, Holger (2018): Filtering and Mining Parallel Data in a Joint Multilingual Space, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) mentioned in Corpus Cleaning and Multilingual Word Embeddings
add
@InProceedings{P18-2037,
author = {Schwenk, Holger},
title = {Filtering and Mining Parallel Data in a Joint Multilingual Space},
booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
publisher = {Association for Computational Linguistics},
pages = {228--234},
location = {Melbourne, Australia},
url = {http://aclweb.org/anthology/P18-2037},
year = 2018
}
Schwenk (2018)
Xu, Hainan and Koehn, Philipp (2017): Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
add
@InProceedings{D17-1318,
author = {Xu, Hainan and Koehn, Philipp},
title = {Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora},
booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
publisher = {Association for Computational Linguistics},
pages = {2935--2940},
location = {Copenhagen, Denmark},
url = {http://aclweb.org/anthology/D17-1318},
year = 2017
}
Xu and Koehn (2017)
Seppo Enarvi and Mikko Kurimo (2013): Studies on training text selection for conversational Finnish language modeling, Proceedings of the International Workshop on Spoken Language Translation (IWSLT)
add
@inproceedings{Enarvi:iwslt:2013,
author = {Seppo Enarvi and Mikko Kurimo},
title = {Studies on training text selection for conversational {Finnish} language modeling},
url = {http://www.mt-archive.info/10/IWSLT-2013-Enarvi.pdf},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
year = 2013
}
Enarvi and Kurimo (2013)
Barbu, Eduard (2015): Spotting false translation segments in translation memories, Proceedings of the Workshop Natural Language Processing for Translation Memories
add
@InProceedings{barbu:2015:NLP4TM,
author = {Barbu, Eduard},
title = {Spotting false translation segments in translation memories},
booktitle = {Proceedings of the Workshop Natural Language Processing for Translation Memories},
month = {September},
address = {Hissar, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {9--16},
url = {http://www.aclweb.org/anthology/W15-5202},
year = 2015
}
Barbu (2015)
Jalili Sabet, Masoud and Negri, Matteo and Turchi, Marco and C. de Souza, José G. and Federico, Marcello (2016): TMop: a Tool for Unsupervised Translation Memory Cleaning, Proceedings of ACL-2016 System Demonstrations
add
@InProceedings{jalilisabet-EtAl:2016:P16-4,
author = {Jalili Sabet, Masoud and Negri, Matteo and Turchi, Marco and C. de Souza, Jos\'{e} G. and Federico, Marcello},
title = {TMop: a Tool for Unsupervised Translation Memory Cleaning},
booktitle = {Proceedings of ACL-2016 System Demonstrations},
month = {August},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {49--54},
url = {http://anthology.aclweb.org/P/P16/P16-4009},
year = 2016
}
Sabet et al. (2016)
Axelrod, Amittai and Resnik, Philip and He, Xiaodong and Ostendorf, Mari (2015): Data Selection With Fewer Words, Proceedings of the Tenth Workshop on Statistical Machine Translation
add
@InProceedings{axelrod-EtAl:2015:WMT,
author = {Axelrod, Amittai and Resnik, Philip and He, Xiaodong and Ostendorf, Mari},
title = {Data Selection With Fewer Words},
booktitle = {Proceedings of the Tenth Workshop on Statistical Machine Translation},
month = {September},
address = {Lisbon, Portugal},
publisher = {Association for Computational Linguistics},
pages = {58--65},
url = {http://aclweb.org/anthology/W15-3003},
year = 2015
}
Axelrod et al. (2015)
Cui, Lei and Zhang, Dongdong and Liu, Shujie and Li, Mu and Zhou, Ming (2013): Bilingual Data Cleaning for SMT using Graph-based Random Walk, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
add
@InProceedings{cui-EtAl:2013:Short,
author = {Cui, Lei and Zhang, Dongdong and Liu, Shujie and Li, Mu and Zhou, Ming},
title = {Bilingual Data Cleaning for {SMT} using Graph-based Random Walk},
booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {340--345},
url = {http://www.aclweb.org/anthology/P13-2061},
year = 2013
}
Cui et al. (2013)
Kahif Shah and Lucia Specia (2014): Quality estimation for translation selection, Proceedings of 17th Annual conference of the European Association for Machine Translation
add
@inproceedings{eamt-2014-Shah,
author = {Kahif Shah and Lucia Specia},
title = {Quality estimation for translation selection},
booktitle = {Proceedings of 17th Annual conference of the European Association for Machine Translation},
pages = {109-116},
url = {http://www.mt-archive.info/10/EAMT-2014-Shah.pdf},
location = {Dubrovnik, Croatia},
year = 2014
}
Shah and Specia (2014)
Anthony Rousseau (2013): XenC: An Open-Source Tool for Data Selection in Natural Language Processing, The Prague Bulletin of Mathematical Linguistics
add
@article{pbml-100-rousseau,
author = {Anthony Rousseau},
title = {XenC: An Open-Source Tool for Data Selection in Natural Language Processing},
url = {http://ufal.mff.cuni.cz/pbml/100/art-rousseau.pdf},
pages = {73--82},
journal = {The Prague Bulletin of Mathematical Linguistics},
volume = {100},
year = 2013
}
Rousseau (2013)
Arase, Yuki and Zhou, Ming (2013): Machine Translation Detection from Monolingual Web-Text, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
add
@InProceedings{arase-zhou:2013:ACL2013,
author = {Arase, Yuki and Zhou, Ming},
title = {Machine Translation Detection from Monolingual Web-Text},
booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {1597--1607},
url = {http://www.aclweb.org/anthology/P13-1157},
year = 2013
}
Arase and Zhou (2013)
Aharoni, Roee and Koppel, Moshe and Goldberg, Yoav (2014): Automatic Detection of Machine Translated Text and Translation Quality Estimation, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
add
@InProceedings{aharoni-koppel-goldberg:2014:P14-2,
author = {Aharoni, Roee and Koppel, Moshe and Goldberg, Yoav},
title = {Automatic Detection of Machine Translated Text and Translation Quality Estimation},
booktitle = {Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
month = {June},
address = {Baltimore, Maryland},
publisher = {Association for Computational Linguistics},
pages = {289--295},
url = {http://www.aclweb.org/anthology/P14-2048},
year = 2014
}
Aharoni et al. (2014)
Michel Simard (2014): Clean data for training statistical MT: the case of MT contamination, Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA)
add
@inproceedings{AMTA-2014-Simard,
author = {Michel Simard},
title = {Clean data for training statistical MT: the case of {MT} contamination},
pages = {69-82},
url = {http://www.mt-archive.info/10/AMTA-2014-Simard.pdf},
volume = {1},
booktitle = {Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA)},
location = {Vancouver, BC, Canada},
year = 2014
}
Simard (2014)
Kaveh Taghipour and Shahram Khadivi and Jia Xu (2011): Parallel Corpus Refinement as an Outlier Detection Algorithm, Proceedings of the 13th Machine Translation Summit (MT Summit XIII)
add
@inproceedings{MTS-2011-Taghipour,
author = {Kaveh Taghipour and Shahram Khadivi and Jia Xu},
title = {Parallel Corpus Refinement as an Outlier Detection Algorithm},
url = {http://www.mt-archive.info/MTS-2011-Taghipour.pdf},
pages = {414-421},
booktitle = {Proceedings of the 13th Machine Translation Summit (MT Summit XIII)},
publisher = {International Association for Machine Translation},
location = {Xiamen, China},
year = 2011
}
Taghipour et al. (2011)
Formiga, Lluís and Fonollosa, José A. R. (2012): Dealing with Input Noise in Statistical Machine Translation, Proceedings of COLING 2012: Posters
add
@InProceedings{formiga-fonollosa:2012:POSTERS,
author = {Formiga, Llu{\'i}s and Fonollosa, Jos{\'e} A. R.},
title = {Dealing with Input Noise in Statistical Machine Translation},
booktitle = {Proceedings of COLING 2012: Posters},
month = {December},
address = {Mumbai, India},
publisher = {The COLING 2012 Organizing Committee},
pages = {319--328},
url = {http://www.aclweb.org/anthology/C12-2032},
year = 2012
}
Formiga and Fonollosa (2012)
Lui, Marco and Baldwin, Timothy (2012): langid.py: An Off-the-shelf Language Identification Tool, Proceedings of the ACL 2012 System Demonstrations
add
@InProceedings{lui-baldwin:2012:Demo,
author = {Lui, Marco and Baldwin, Timothy},
title = {langid.py: An Off-the-shelf Language Identification Tool},
booktitle = {Proceedings of the ACL 2012 System Demonstrations},
month = {July},
address = {Jeju Island, Korea},
publisher = {Association for Computational Linguistics},
pages = {25--30},
url = {http://www.aclweb.org/anthology/P12-3005},
year = 2012
}
Lui and Baldwin (2012)
Jehl, Laura and Hieber, Felix and Riezler, Stefan (2012): Twitter Translation using Translation-Based Cross-Lingual Retrieval, Proceedings of the Seventh Workshop on Statistical Machine Translation
add
@InProceedings{jehl-hieber-riezler:2012:WMT,
author = {Jehl, Laura and Hieber, Felix and Riezler, Stefan},
title = {Twitter Translation using Translation-Based Cross-Lingual Retrieval},
booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation},
month = {June},
address = {Montreal, Canada},
publisher = {Association for Computational Linguistics},
pages = {163--174},
url = {http://www.aclweb.org/anthology/W12-3121},
year = 2012
}
Jehl et al. (2012)

MT Research Survey Wiki

A Comprehensive Survey of Neural and Statistical Machine Translation Research Publications

Search Descriptions

Corpus Cleaning

Publications

Benchmarks

Discussion

Related Topics

New Publications