Training of Phrase Based Models
The fundamental data structure in phrase based models is a table of phrase pairs with associated scores which may come from probability distributions. Most commonly, this table is acquired from a word aligned parallel corpus.
Phrase Based Model Training is the main subject of 31 publications. 22 are discussed here.
Several methods to extract phrases from a parallel corpus have been proposed. Most make the use of word alignments (Tillmann, 2003
; Zhang et al., 2003
; Zhao and Vogel, 2005
; Zhang and Vogel, 2005
; Setiawan et al., 2005)
. One may restrict extraction of phrase pairs to the smallest phrases that cover the sentence (Mariño et al., 2005)
. Lambert and Banchs (2005)
compare this restrictive method with the method described in this book and proposes some refinements.
Phrase alignment may be carried out directly from sentence-aligned corpora using a probabilistic model (Shin et al., 1996)
, pattern mining methods (Yamamoto et al., 2003)
, or using matrix factorization (Goutte et al., 2004)
. IBM Model 1 probabilities may be used to separate word aligned to each phrase against words outside it (Vogel, 2005)
— a method also used for splitting long sentences (Xu et al., 2005)
. Zhao et al. (2004)
use a measure based on the td-idf score from information retrieval to score phrase translations. Additional feature scores may be also used during the parameter tuning of the decoder to determine which phrase pairs should be discarded (Deng et al., 2008)
. Kim and Vogel (2007)
use an iterative method that adds extracted phrases to the parallel corpus to bootstrap better alignments and extract better phrases. Turchi et al. (2008)
give an overall analysis of the learning problem for phrase-based machine translation.
Word alignment probabilities may guide decisions on phrase extraction and phrase scoring. Venugopal et al. (2008)
use posterior probabilities and (Tomeh et al., 2011)
use discriminative trained word alignment confidence scores.
Existing bilingual dictionaries may be simply added as additional parallel data to the training data. This may, however, miss the right context in which these words occur. Okuma et al. (2007)
propose to insert phrases into the phrase tables that adapt existing entries with a very similar word to the dictionary word by replacing it with the dictionary word.
A stream of new training data may be added continuously to an existing translation model, requiring fast incremental updating with the use of a variant of the online EM algorithm (Ortiz-Martínez et al., 2010)
and dynamic suffix arrays (Levenberg et al., 2010)
With the increasing size of available parallel corpora and translation models, efficient use of working memory becomes an issue, motivating the development of parallel infrastructures for training such as Google's MapReduce (Dyer et al., 2008)
- He et al. (2013)
- Mansour and Ney (2014)
- Tomeh et al. (2014)
- Flanagan (2014)
- Flanagan (2014)
- Wuebker and Ney (2013)
- Srivastava and Way (2009)
- Gupta et al. (2011)
- Chen et al. (2009)
- Guzman et al. (2009)