Search Descriptions

Main Topics

Search Publications





Corpus Cleaning

Parallel corpora may contain misaligned or otherwise noisy sentence pairs whose removal may help.

Corpus Cleaning is the main subject of 19 publications.


Statistical machine translation models are generally assumed to be fairly robust to noisy data, such as data that includes misalignments. However, data cleaning has been shown to help (Vogel, 2003). Often, for instance in the case of news reports that are rewritten for a different audience during translation, documents are not very parallel, so the task of sentence alignment becomes more of a task of sentence extraction (Fung and Cheung, 2004; Fung and Cheung, 2004b). For good performance it has proven crucial, especially when only small amounts of training data are available, to exploit all of the data, may it be by augmenting phrase translation tables to include all words or breaking up sentences that are too long (Mermer et al., 2007).



Related Topics

New Publications

  • Xu and Koehn (2017)
  • Enarvi and Kurimo (2013)
  • Barbu (2015)
  • Sabet et al. (2016)
  • Axelrod et al. (2015)
  • Cui et al. (2013)
  • Shah and Specia (2014)
  • Rousseau (2013)
  • Arase and Zhou (2013)
  • Aharoni et al. (2014)
  • Simard (2014)
  • Taghipour et al. (2011)
  • Formiga and Fonollosa (2012)
  • Lui and Baldwin (2012)
  • Jehl et al. (2012)