Search Descriptions

Main Topics

Search Publications





Collecting Parallel Corpora

The web is the main source for parallel corpora today, which requires a number of processing steps, but also other data resources have been explored.

Parallel Corpora is the main subject of 81 publications.


Resnik (1999) describes a method to automatically find parallel documents on the web. Fukushima et al. (2006) use a dictionary to detect parallel documents, while Li and Liu (2008) use a number of criteria such as similarity of the URL and page content. Acquiring parallel corpora, however, typically requires some manual involvement (Koehn, 2002; Martin et al., 2003; Koehn, 2005), including the matching of documents (Utiyama and Isahara, 2003). A large collection of corpora is maintained at the OPUS web site (Tiedemann, 2012).
Uchiyama and Isahara (2007) report on the efforts to build a Japanese-English patent corpus and Macken et al. (2007) on efforts on a broad-based Dutch-English corpus. Täger (2011) describes the creation of the European patent corpus. Cettolo et al. (2012) explain the creation of a multilingual parallel corpus of subtitles from the TED Talks website. A discussion of the pitfalls during the construction of parallel corpora is given by Kaalep and Veskis (2007). A 200 million word Czech-English corpus from various sources was collected (Bojar et al., 2010) and linguistically annotated (Bojar et al., 2012).
Uszkoreit et al. (2010) address the problem of document alignment by translation of all documents into English and then use of information retrieval methods.
With the increasing use of machine translation on the web, distinguishing between human and machine translated texts becomes a challenge. Venugopal et al. (2011) propose a method to watermark the output of machine translation systems to aid this distinction. Antonova and Misyurev (2011) report that rule-based machine translation output can be detected due to certain word choices, and machine translation output due to lack of reordering. Rarrick et al. (2011) train a classifier to learn the distinction and show that removing such data leads to better translation quality.
Parallel corpora may also be built by dedicated manual translation efforts (Germann, 2001). It may be useful to focus on the most relevant new sentences, using methods such active learning (Majithia et al., 2005). Crowd-sourcing with inexperienced translators (Zaidan and Callison-Burch, 2011) may be used to reduce cost. Post et al. (2012) follow this approach to create parallel corpora for 6 Indian languages.
Translation memories may also be a useful training resource (Langlais and Simard, 2002).
Other methods focus on fishing the web for the translation of particular terms (Nagata et al., 2001) or phrases (Cao and Li, 2002). Related is the targeted crawling for in-domain parallel corpora (Pecina et al., 2011).
It is not clear, if it matters in which translation direction the parallel corpus was constructed, of if both sides were translated from a third language. Halteren (2008) shows that it is possible to reliably detect the source language in English texts from the European Parliament proceedings, so the original source language does have some effect.



Related Topics

New Publications

  • Deng and Xue (2014)
  • Hieber et al. (2013)
  • Du et al. (2015)
  • Resnik and Smith (2003)
  • Shi et al. (2006)
  • Llitjós (2006)
  • Germann (2016)
  • Gomes and Lopes (2016)
  • Germann (2016)
  • Gomes and Lopes (2016)
  • Jakubina and Langlais (2016)
  • Dara and Lin (2016)
  • Esplà-Gomis et al. (2016)
  • Le et al. (2016)
  • Medveď et al. (2016)
  • Azpeitia and Etchegoyhen (2016)
  • Papavassiliou et al. (2016)
  • Lohar et al. (2016)
  • Mahata et al. (2016)
  • Shchukin et al. (2016)
  • Buck and Koehn (2016)
  • Buck and Koehn (2016)
  • Ling et al. (2016)
  • Barrón-Cedeño et al. (2015)
  • Sabet et al. (2016)
  • Ma and Liberman (1999)
  • Zariņa et al. (2015)
  • Guzman et al. (2013)
  • Toral et al. (2014)
  • Haddow et al. (2013)
  • Ling et al. (2014)
  • Eck et al. (2014)
  • Ling et al. (2013)
  • Smith et al. (2013)
  • Bond and Wang (2014)
  • Papavassiliou et al. (2013)
  • Eisele (2005)
  • Arranz et al. (2011)
  • Lu et al. (2011)
  • Gascó et al. (2012)
  • Ishisaka et al. (2009)
  • Rafalovitch and Dale (2009)
  • Utiyama et al. (2009)
  • Zhu et al. (2009)
  • Hong et al. (2010)
  • Han et al. (2009)
  • Xu and Tan (1999)
  • Esplà-Gomis (2009)
  • Ambati and Vogel (2010)
  • Hu et al. (2011)
  • Krstovski and Smith (2011)
  • Cartoni et al. (2011)
  • Gahbiche-Braham et al. (2011)
  • Patry and Langlais (2011)
  • Fry (2005)