Search Descriptions

Main Topics

Search Publications


author

title

other

year

Statistical Significance

Do differences in evaluation scores om a given test set indicate real quality differences of the underlying machine translation systems? We would like to compute the statistical significance of these differences.

Statistical Significance is the main subject of 7 publications.

Publications

For the commonly used BLEU score, there is no analytical method to determine statistical significance, so we need to rely on methods such as bootstrap resampling (Koehn, 2004). For further comments on this technique and an alternative, see work by Riezler and Maxwell (2005).
Estrella et al. (2007) examine the minimum size of the test set for a reliable comparison of different machine translation systems.

Benchmarks

Discussion

Related Topics

New Publications

  • Graham et al. (2014)
  • Graham and Baldwin (2014)
  • Efron and Tibshirani (1993)
  • Efron and Tibshirani (1993)
  • Zhang and Vogel (2010)

Actions

Download

Contribute