Data

The key data resources for statistical machine translation are parallel corpora, which are sentence aligned. Other low-level data preparation issues are splitting sentences into words (tokenization or segmentation), spelling correction, and truecasing (handling lowercase/uppercase).

Data and its 11 sub-topics are the main subject of 423 publications.

Topics in Data

Publications

Benchmarks

Guzmán, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio (2019): Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English, arXiv preprint arXiv:1902.01382
add
@inproceedings{flores-2019,
author = {Guzm\'{a}n, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio},
title = {Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English},
journal = {arXiv preprint arXiv:1902.01382},
year = 2019
}
Guzmán et al. (2019)

Discussion

New Publications

Ellie Pavlick and Matt Post and Ann Irvine and Dmitry Kachaev and Chris Callison-Burch (2014): The Language Demographics of Amazon Mechanical Turk, Transactions of the Association for Computational Linguistics (TACL)
add
@article{tacl14-Pavlick,
author = {Ellie Pavlick and Matt Post and Ann Irvine and Dmitry Kachaev and Chris Callison-Burch},
title = {The Language Demographics of Amazon Mechanical Turk},
number = {2},
pages = {79-92},
url = {http://www.aclweb.org/anthology/Q/Q14/Q14-1007.pdf},
booktitle = {Transactions of the Association for Computational Linguistics (TACL)},
year = 2014
}
Pavlick et al. (2014)
Burak Ayd\in, Arzucan \"Ozgür (2014): Expanding machine translation training data with an out-of-domain corpus using language modeling based vocabulary saturation, Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA)
add
@inproceedings{AMTA-2014-Aydin,
author = {Burak Ayd{\i}n, Arzucan {\"O}zg{\"u}r},
title = {Expanding machine translation training data with an out-of-domain corpus using language modeling based vocabulary saturation},
pages = {180-192},
url = {http://www.mt-archive.info/10/AMTA-2014-Aydin.pdf},
volume = {1},
booktitle = {Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA)},
location = {Vancouver, BC, Canada},
year = 2014
}
Ayd\in (2014)
Burak Ayd\in, Arzucan \"Ozgür (2014): Expanding machine translation training data with an out-of-domain corpus using language modeling based vocabulary saturation, Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA)
add
@inproceedings{AMTA-2014-Aydin,
author = {Burak Ayd{\i}n, Arzucan {\"O}zg{\"u}r},
title = {Expanding machine translation training data with an out-of-domain corpus using language modeling based vocabulary saturation},
pages = {180-192},
url = {http://www.mt-archive.info/10/AMTA-2014-Aydin.pdf},
volume = {1},
booktitle = {Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA)},
location = {Vancouver, BC, Canada},
year = 2014
}
Ayd\in (2014)
Lewis, William and Eetemadi, Sauleh (2013): Dramatically Reducing Training Data Size Through Vocabulary Saturation, Proceedings of the Eighth Workshop on Statistical Machine Translation
add
@InProceedings{lewis-eetemadi:2013:WMT,
author = {Lewis, William and Eetemadi, Sauleh},
title = {Dramatically Reducing Training Data Size Through Vocabulary Saturation},
booktitle = {Proceedings of the Eighth Workshop on Statistical Machine Translation},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {281--291},
url = {http://www.aclweb.org/anthology/W13-2235},
year = 2013
}
Lewis and Eetemadi (2013)
David Kurokawa and Cyril Goutte and Pierre Isabelle (2009): Automatic Detection of Translated Text and its Impact on Machine Translation, Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)
add
@inproceedings{MTS09:Kurokawa,
author = {David Kurokawa and Cyril Goutte and Pierre Isabelle},
title = {Automatic Detection of Translated Text and its Impact on Machine Translation},
url = {http://www.mt-archive.info/MTS-2009-Kurokawa.pdf},
googlescholar = {6286367840082517865},
booktitle = {Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)},
publisher = {International Association for Machine Translation},
location = {Ottawa, Ontario, Canada},
year = 2009
}
Kurokawa et al. (2009)
Qibo Zhu and Diana Inkpen and Ash Asudeh (2007): Automatic extraction of translations from web-based bilingual materials, Machine Translation
add
@article{MTJ:2007:Zhu,
author = {Qibo Zhu and Diana Inkpen and Ash Asudeh},
title = {Automatic extraction of translations from web-based bilingual materials},
url = {http://ccc.inaoep.mx/~villasen/bib/Automatic%20extraction%20of%20translations%20from%20web-based.pdf},
googlescholar = {17340055012285203908},
pages = {139--163},
journal = {Machine Translation},
volume = {21},
number = {3},
month = {September},
year = 2007
}
Zhu et al. (2007)

MT Research Survey Wiki

A Comprehensive Survey of Neural and Statistical Machine Translation Research Publications

Search Descriptions

Data

Publications

Benchmarks

Discussion

New Publications