Statistical Significance

Do differences in evaluation scores om a given test set indicate real quality differences of the underlying machine translation systems? We would like to compute the statistical significance of these differences.

Statistical Significance is the main subject of 7 publications. 3 are discussed here.

Topics in Evaluation

Publications

For the commonly used BLEU score, there is no analytical method to determine statistical significance, so we need to rely on methods such as bootstrap resampling

Koehn, Philipp (2004): Statistical Significance Tests for Machine Translation Evaluation , Proceedings of EMNLP 2004

(Koehn, 2004). For further comments on this technique and an alternative, see work by

Riezler, Stefan and Maxwell, John T. (2005): On Some Pitfalls in Automatic Evaluation and Significance Testing for MT, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization

Riezler and Maxwell (2005).

Paula Estrella and Olivier Hamon and Andrei Popescu-Belis (2007): How Much Data is Needed for Reliable MT Evaluation? Using Bootstrapping to Study Human and Automatic Metrics, Proceedings of the MT Summit XI

Estrella et al. (2007) examine the minimum size of the test set for a reliable comparison of different machine translation systems.

Benchmarks

Discussion

New Publications

Graham, Yvette and Mathur, Nitika and Baldwin, Timothy (2014): Randomized Significance Tests in Machine Translation, Proceedings of the Ninth Workshop on Statistical Machine Translation
add
@InProceedings{graham-mathur-baldwin:2014:W14-33,
author = {Graham, Yvette and Mathur, Nitika and Baldwin, Timothy},
title = {Randomized Significance Tests in Machine Translation},
booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
month = {June},
address = {Baltimore, Maryland, USA},
publisher = {Association for Computational Linguistics},
pages = {266--274},
url = {http://www.aclweb.org/anthology/W14-3333},
year = 2014
}
Graham et al. (2014)
Graham, Yvette and Baldwin, Timothy (2014): Testing for Significance of Increased Correlation with Human Judgment, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
add
@InProceedings{graham-baldwin:2014:EMNLP2014,
author = {Graham, Yvette and Baldwin, Timothy},
title = {Testing for Significance of Increased Correlation with Human Judgment},
booktitle = {Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
month = {October},
address = {Doha, Qatar},
publisher = {Association for Computational Linguistics},
pages = {172--176},
url = {http://www.aclweb.org/anthology/D14-1020},
year = 2014
}
Graham and Baldwin (2014)
Bradley Efron and Robert J. Tibshirani (1993): An Introduction to the Bootstrap
add
@Book{BootstrapResampling,
author = {Bradley Efron and Robert J. Tibshirani},
title = {An Introduction to the Bootstrap},
publisher = {Chapman and Hall},
year = 1993
}
Efron and Tibshirani (1993)
Bradley Efron and Robert J. Tibshirani (1993): An Introduction to the Bootstrap
add
@Book{BootstrapResampling,
author = {Bradley Efron and Robert J. Tibshirani},
title = {An Introduction to the Bootstrap},
publisher = {Chapman and Hall},
year = 1993
}
Efron and Tibshirani (1993)
Ying Zhang and Stephan Vogel (2010): Significance tests of automatic machine translation evaluation metrics, Machine Translation
add
@article{MTJ:2010:Zhang,
author = {Ying Zhang and Stephan Vogel},
title = {Significance tests of automatic machine translation evaluation metrics},
pages = {51-65},
journal = {Machine Translation},
volume = {24},
number = {1},
month = {March},
year = 2010
}
Zhang and Vogel (2010)

MT Research Survey Wiki

A Comprehensive Survey of Neural and Statistical Machine Translation Research Publications

Search Descriptions

Statistical Significance

Publications

Benchmarks

Discussion

Related Topics

New Publications