Search Descriptions

Main Topics

Search Publications





Comparable Corpora

A comparable corpus is a pair of corpora in two different languages, which come from the same domain.

Comparable Corpora is the main subject of 29 publications.


Parallel sentences may also be mined from comparable corpora such as news stories written on the same topic in different languages. Munteanu and Marcu (2002) uses suffix trees, and in later work log-likelyhood ratios (Munteanu et al., 2004; Munteanu and Marcu, 2005), to detect parallel sentences.
Abdul-Rauf and Schwenk (2009); Rauf and Schwenk (2009); Rauf and Schwenk (2011) translate one side of the comparable corpus into the other language, use information retrieval methods to find matching sentences and use the TER metric to measure their similarity. \,Stef\uanescu et al. (2012) report improvements with a more complex sentence similarity measure.
Instead of full sentences, parallel sentence fragments may be extracted from comparable corpora (Munteanu and Marcu, 2006). Methods have been proposed to extract matching phrases (Tanaka, 2002) or web pages (Smith, 2002) from such large collections. Quirk et al. (2007) propose a generative model for the same task.
Hewavitharana and Vogel (2011) extract phrase pairs from comparable corpora, using a classifier approach.



Related Topics

The transition from parallel corpora over noisy corpora that require cleaning all the way to comparable corpora is fluent. A special topic is the extraction of bilingual dictionaries from comparable corpora. A comparable corpus is always a pair of two monolingual corpora. The target-side monolingual corpus may be used for training language models and the source-side monolingual corpus may be used for some domain adaptation methods.

New Publications

  • Barrón-Cedeño et al. (2015)
  • Hazem and Morin (2016)
  • Zhang et al. (2016)
  • Liu et al. (2016)
  • Wołk and Marasek (2015)
  • Wołk and Wołk (2015)
  • Wołk and Marasek (2014)
  • Wołk and Marasek (2014)
  • Dou et al. (2015)
  • Nuhn et al. (2015)
  • Dong et al. (2015)
  • Chu et al. (2013)
  • Fu et al. (2013)
  • McCrae and Cimiano (2013)
  • Lapshinova-Koltunski (2013)
  • Preiss (2012)
  • Badia et al. (2005)