Training of Phrase Based Models

The fundamental data structure in phrase based models is a table of phrase pairs with associated scores which may come from probability distributions. Most commonly, this table is acquired from a word aligned parallel corpus.

Phrase Based Model Training is the main subject of 31 publications. 22 are discussed here.

Topics in PhraseBasedModels

Publications

Several methods to extract phrases from a parallel corpus have been proposed. Most make the use of word alignments (Tillmann, 2003; Zhang et al., 2003; Zhao and Vogel, 2005; Zhang and Vogel, 2005; Setiawan et al., 2005). One may restrict extraction of phrase pairs to the smallest phrases that cover the sentence (Mariño et al., 2005). Lambert and Banchs (2005) compare this restrictive method with the method described in this book and proposes some refinements.

Phrase alignment may be carried out directly from sentence-aligned corpora using a probabilistic model (Shin et al., 1996), pattern mining methods (Yamamoto et al., 2003), or using matrix factorization (Goutte et al., 2004). IBM Model 1 probabilities may be used to separate word aligned to each phrase against words outside it (Vogel, 2005) — a method also used for splitting long sentences (Xu et al., 2005). Zhao et al. (2004) use a measure based on the td-idf score from information retrieval to score phrase translations. Additional feature scores may be also used during the parameter tuning of the decoder to determine which phrase pairs should be discarded (Deng et al., 2008). Kim and Vogel (2007) use an iterative method that adds extracted phrases to the parallel corpus to bootstrap better alignments and extract better phrases. Turchi et al. (2008) give an overall analysis of the learning problem for phrase-based machine translation.

Word alignment probabilities may guide decisions on phrase extraction and phrase scoring. Venugopal et al. (2008) use posterior probabilities and (Tomeh et al., 2011) use discriminative trained word alignment confidence scores.

Existing bilingual dictionaries may be simply added as additional parallel data to the training data. This may, however, miss the right context in which these words occur. Okuma et al. (2007) propose to insert phrases into the phrase tables that adapt existing entries with a very similar word to the dictionary word by replacing it with the dictionary word.

A stream of new training data may be added continuously to an existing translation model, requiring fast incremental updating with the use of a variant of the online EM algorithm (Ortiz-Martínez et al., 2010) and dynamic suffix arrays (Levenberg et al., 2010).

With the increasing size of available parallel corpora and translation models, efficient use of working memory becomes an issue, motivating the development of parallel infrastructures for training such as Google's MapReduce (Dyer et al., 2008).

Benchmarks

Discussion

New Publications

He, Hua and Lin, Jimmy and Lopez, Adam (2013): Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
add
@InProceedings{he-lin-lopez:2013:NAACL-HLT,
author = {He, Hua and Lin, Jimmy and Lopez, Adam},
title = {Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs},
booktitle = {Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
address = {Atlanta, Georgia},
publisher = {Association for Computational Linguistics},
pages = {325--334},
url = {http://www.aclweb.org/anthology/N13-1033},
year = 2013
}
He et al. (2013)
Saab Mansour and Herman Ney (2014): Translation model based weighting for phrase extraction, Proceedings of 17th Annual conference of the European Association for Machine Translation
add
@inproceedings{eamt-2014-Mansour,
author = {Saab Mansour and Herman Ney},
title = {Translation model based weighting for phrase extraction},
booktitle = {Proceedings of 17th Annual conference of the European Association for Machine Translation},
pages = {35-43},
url = {http://www.mt-archive.info/10/EAMT-2014-Mansour.pdf},
location = {Dubrovnik, Croatia},
year = 2014
}
Mansour and Ney (2014)
Nadi Tomeh and Alexandre Allauzen and François Yvon (2014): Maximum-entropy word alignment and posterior-based phrase extraction for machine translation, Machine Translation
add
@article{MTJ:2014:Tomeh,
author = {Nadi Tomeh and Alexandre Allauzen and Fran{\,c}ois Yvon},
title = {Maximum-entropy word alignment and posterior-based phrase extraction for machine translation},
pages = {19-56},
journal = {Machine Translation},
volume = {28},
number = {1},
month = {March},
year = 2014
}
Tomeh et al. (2014)
Kevin Flanagan (2014): Bilingual phrase-to-phrase alignment for arbitrarily-small datasets, Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA)
add
@inproceedings{AMTA-2014--Flanagan,
author = {Kevin Flanagan},
title = {Bilingual phrase-to-phrase alignment for arbitrarily-small datasets},
pages = {83-95},
url = {http://www.mt-archive.info/10/AMTA-2014--Flanagan.pdf},
volume = {1},
booktitle = {Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA)},
location = {Vancouver, BC, Canada},
year = 2014
}
Flanagan (2014)
Kevin Flanagan (2014): Bilingual phrase-to-phrase alignment for arbitrarily-small datasets, Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA)
add
@inproceedings{AMTA-2014--Flanagan,
author = {Kevin Flanagan},
title = {Bilingual phrase-to-phrase alignment for arbitrarily-small datasets},
pages = {83-95},
url = {http://www.mt-archive.info/10/AMTA-2014--Flanagan.pdf},
volume = {1},
booktitle = {Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA)},
location = {Vancouver, BC, Canada},
year = 2014
}
Flanagan (2014)
Wuebker, Joern and Ney, Hermann (2013): Length-Incremental Phrase Training for SMT, Proceedings of the Eighth Workshop on Statistical Machine Translation
add
@InProceedings{wuebker-ney:2013:WMT,
author = {Wuebker, Joern and Ney, Hermann},
title = {Length-Incremental Phrase Training for {SMT}},
booktitle = {Proceedings of the Eighth Workshop on Statistical Machine Translation},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {309--319},
url = {http://www.aclweb.org/anthology/W13-2238},
year = 2013
}
Wuebker and Ney (2013)
Ankit Srivastava and Andy Way (2009): Using Percolated Dependencies for Phrase Extraction in SMT, Proceedings of the Twelfth Machine Translation Summit (MT Summit XII) mentioned in Phrase Based Model Training and Parallel Treebanks
add
@inproceedings{MTS09:Srivastava,
author = {Ankit Srivastava and Andy Way},
title = {Using Percolated Dependencies for Phrase Extraction in {SMT}},
url = {http://doras.dcu.ie/15152/1/SrivastavaWay\_mts\_09.pdf},
googlescholar = {11366049404327383407},
booktitle = {Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)},
publisher = {International Association for Machine Translation},
location = {Ottawa, Ontario, Canada},
year = 2009
}
Srivastava and Way (2009)
Mridul Gupta and Sanjika Hewavitharana and Stephan Vogel (2011): Extending a probabilistic phrase alignment approach for SMT, Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT)
add
@inproceedings{iwslt11:Gupta,
author = {Mridul Gupta and Sanjika Hewavitharana and Stephan Vogel},
title = {Extending a probabilistic phrase alignment approach for {SMT}},
url = {http://www.mt-archive.info/IWSLT-2011-Gupta.pdf},
pages = {175-182},
editor = {Marcello Federico and Mei-Yuh Hwang and Margit R{\"o}dder and Sebastian St{\"u}ker},
booktitle = {Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT)},
location = {San Francisco, USA},
year = 2011
}
Gupta et al. (2011)
Boxing Chen and George Foster and Roland Kuhn (2009): Phrase Translation Model Enhanced with Association based Features, Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)
add
@inproceedings{MTS09:Chen,
author = {Boxing Chen and George Foster and Roland Kuhn},
title = {Phrase Translation Model Enhanced with Association based Features},
url = {http://www.mt-archive.info/MTS-2009-Chen.pdf},
googlescholar = {16816990626962163720},
booktitle = {Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)},
publisher = {International Association for Machine Translation},
location = {Ottawa, Ontario, Canada},
year = 2009
}
Chen et al. (2009)
Francisco Guzman and Qin Gao and Stephan Vogel (2009): Reassessment of the Role of Phrase Extraction in PBSMT, Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)
add
@inproceedings{MTS09:Guzman,
author = {Francisco Guzman and Qin Gao and Stephan Vogel},
title = {Reassessment of the Role of Phrase Extraction in {PBSMT}},
url = {http://www.mt-archive.info/MTS-2009-Guzman.pdf},
googlescholar = {3178600509478785340},
booktitle = {Proceedings of the Twelfth Machine Translation Summit (MT Summit XII)},
publisher = {International Association for Machine Translation},
location = {Ottawa, Ontario, Canada},
year = 2009
}
Guzman et al. (2009)

MT Research Survey Wiki

A Comprehensive Survey of Neural and Statistical Machine Translation Research Publications

Search Descriptions

Training of Phrase Based Models

Publications

Benchmarks

Discussion

Related Topics

New Publications