This file contains the data used in "On the Impact of Various Types of Noise on Neural Machine Translation" by Huda Khayrallah and Philipp Koehn.
All data has been tokenized using the following scripts from Moses (https://github.com/moses-smt/mosesdecoder). You only need to download it, you do not need to install moses to use the scripts):
moses-smt/mosesdecodertokenizer/normalize-punctuation.perl $lang | moses-smt/mosesdecoder/tokenizer/tokenizer.perl -a -l $lang
with $lang
in {de,en}
For our baseline we use Europarl, News Commentary and the Rapid EU Press Release parallel corpus, all from the WMT 2017 shared task.
To create the data sets used in Table 9, concatenate baseline.tok.$lang
with the noisy file of the desired amount: $noise_type.$amount.tok.$lang
.
{misaligned_sent, misordered_words_src, misordered_words_trg, wrong_lang_fr_src, wrong_lang_fr_trg, untranslated_en_src, untranslated_de_trg, short_max2, short_max5, raw_paracrawl}
{05, 10, 20, 50, 100}
.If you use this data please cite our paper, and the original data sources as follows:
@inproceedings{khayrallah-koehn-2018-impact, title = "On the Impact of Various Types of Noise on Neural Machine Translation", author = "Khayrallah, Huda and Koehn, Philipp", booktitle = "Proceedings of the 2nd Workshop on Neural Machine Translation and Generation", month = jul, year = "2018", address = "Melbourne, Australia", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W18-2709", doi = "10.18653/v1/W18-2709", pages = "74--83"}
Europarl (http://www.statmt.org/europarl/):
@InProceedings{Koehn:2005:MTS, url = {http://mt-archive.info/MTS-2005-Koehn.pdf}, googlescholar = {6985235632472432229}, author = {Philipp Koehn}, title = {Europarl: A Parallel Corpus for Statistical Machine Translation}, booktitle = {Proceedings of the Tenth Machine Translation Summit (MT Summit X)}, month = {September}, year = {2005}, address = {Phuket, Thailand}, }
News Commentary: (http://www.casmacat.eu/corpus/news-commentary.html)
WMT shared task (http://statmt.org/wmt17/):
@inproceedings{bojar-etal-2017-findings, title = "Findings of the 2017 Conference on Machine Translation ({WMT}17)", author = "Bojar, Ond{\v{r}}ej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huang, Shujian and Huck, Matthias and Koehn, Philipp and Liu, Qun and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Post, Matt and Rubino, Raphael and Specia, Lucia and Turchi, Marco", booktitle = "Proceedings of the Second Conference on Machine Translation", month = sep, year = "2017", address = "Copenhagen, Denmark", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W17-4717", doi = "10.18653/v1/W17-4717", pages = "169--214", }