Search Descriptions

Main Topics

Search Publications


author

title

other

year

Learning Bilingual Dictionaries from Comparable Corpora

While parallel corpora are the best training of word based models, probabilistic bilingual dictionaries may also be trained from more abundant comparable corpora.

Dictionaries From Comparable Corpora is the main subject of 103 publications.

Topics in WordBasedModels

IBM Models | Symmetrization | Word Alignment | Dictionaries From Comparable Corpora | Lexical Choice | MT Without Word Alignment

Publications

It is also possible to extract terminologies from non-parallel comparable corpora. If only monolingual corpora are available, the first task is to find words that are translations of each other. Often, a seed lexicon is needed, which may be identically spelled words or cognates (Koehn and Knight, 2002), although attempts have been made without such a seed but purely relying on co-occurrence statistics (Rapp, 1995) or generative models based on canonical correlation analysis (Haghighi et al., 2008). Several criteria may be used to find matching words, such as co-occurrence vectors based on mutual information (Fung, 1997; Kaji, 2004) or tf/idf (Fung and Yee, 1998; Chiao and Zweigenbaum, 2002), co-occurence vectors with considerations of ambiguous words (Tanaka and Iwasaki, 1996) or reduced using latent semantic analysis (Kim et al., 2002) or Fisher kernels (Gaussier et al., 2004), heterogeneity of the word context (Fung, 1995), distributional similarity (Rapp, 1999), semantic relationship vectors (Diab and Finch, 2000), spelling similarity (Schulz et al., 2004), automatically constructed thesauri (Babych et al., 2007), syntactic templates for the word context (Gamallo Otero, 2007). Other resources such as Wikipedia or WordNet may help with the construction of dictionaries (Ramiírez et al., 2008). These methods may also be applied for collocations, not just single words (Lü and Zhou, 2004; Daille and Morin, 2008).
If monolingual corpora and bilingual dictionaries are available, the task is to find word senses or sets of mutually translated words across multiple languages. Kikui (1998) use context vectors to disambiguate words, then adding an initial clustering step (Kikui, 1999). Koehn and Knight (2000) use a language model to learn translation probabilities for ambiguous word using the EM algorithm. Sammer and Soderland (2007) use point-wise mutual information to match contexts in which words occur to separate out different sense of a word. Li and Li (2004) apply the Yarowski algorithm (Yarowsky, 1994) to bootstrap bilingual word translation models for ambiguous words. The use of bridge languages has been shown to be useful (Mann and Yarowsky, 2001; Schafer and Yarowsky, 2002). Otero (2005) extends the use of the DICE coefficient with local context to better deal with polysemous words. Comparable corpora also allow the construction of translation models between a formal language and a dialect (Hwa et al., 2006).

Benchmarks

Discussion

Related Topics

Various other approaches have been explored to use comparable corpora.

New Publications

  • Wijaya et al. (2017)
  • Zhang et al. (2017)
  • Nakashole and Flauger (2017)
  • Pourdamghani and Knight (2017)
  • Arnaud et al. (2017)
  • Hauer et al. (2017)
  • Kim et al. (2017)
  • Irvine and Callison-Burch (2013)
  • Nuhn and Ney (2014)
  • Wang and Sitbon (2014)
  • Pal et al. (2014)
  • Rapp and Sharoff (2014)
  • Irvine and Callison-Burch (2014)
  • Dou et al. (2014)
  • Dou and Knight (2013)
  • Irvine (2013)
  • Irvine and Callison-Burch (2014)
  • Irvine et al. (2013)
  • Irvine and Callison-Burch (2013)
  • Aker et al. (2013)
  • Nuhn and Ney (2013)
  • Nuhn et al. (2013)
  • Ravi (2013)
  • Kontonatsios et al. (2014)
  • Kontonatsios et al. (2014)
  • Saluja et al. (2014)
  • Apidianaki et al. (2013)
  • Bouamor et al. (2013)
  • HAZEM and MORIN (2013)
  • Bouamor et al. (2013)
  • Rivera et al. (2013)
  • Delpech (2011)
  • Delpech et al. (2012)
  • Gupta et al. (2013)
  • Afli et al. (2012)
  • Niehues and Waibel (2011)
  • Morin and Daille (2012)
  • Qian et al. (2012)
  • Su and Babych (2012)
  • Riesa and Marcu (2012)
  • Ture and Lin (2012)
  • Chang et al. (2012)
  • Pinnis et al. (2012)
  • Hall and Klein (2011)
  • Bourdaillet and Langlais (2012)
  • Do et al. (2009)
  • Yu and Tsujii (2009)
  • Wu et al. (2008)
  • Wu et al. (2009)
  • Tillmann (2009)
  • Cettolo et al. (2010)
  • Huang et al. (2010)
  • Lee et al. (2010)
  • Smith et al. (2010)
  • Belz and Kow (2011)
  • Pekar et al. (2006)
  • Suzuki and Kumano (2005)