If we define the quality of a metric by correlation to human judgment, then it is possible to train metrics to optimize this correlation.
Trained Metrics is the main subject of 25 publications. 9 are discussed here.
Albrecht and Hwa (2007)
argue for the general advantages of learning evaluation metrics from a large number of features, although Sun et al. (2008)
point out that carefully designed features may be more important.
Jones and Rusk (2000)
propose a method that learns automatically to distinguish human translations from machine translations. Since in practice the purpose of evaluation is to distinguish good translations from bad translations, it may be beneficial to view evaluation as a ranking task (Ye et al., 2007
; Duh, 2008)
. Lin and Och (2004)
propose a metric for the evaluation of evaluation metrics, which does not require human judgment data for correlation. The metric is based on the rank given to a reference translation among machine translations.
Multiple metrics may be combined uniformly (Giménez and Màrquez, 2008)
, or by adding metrics greedily until no improvement is seen (Giménez and Màrquez, 2008)
(Denkowski and Lavie, 2010)
add a number of parameters to the n-gram based METEOR metric and extend it into a trainable metric.
- Stanojević and Sima'an (2017)
- Gupta et al. (2015)
- Guzmán et al. (2014)
- Gonzàlez et al. (2014)
- Stanojevic and Sima'an (2014)
- Stanojević and Sima'an (2014)
- Specia and Shah (2014)
- Han et al. (2013)
- Wang and Manning (2012)
- Fishel et al. (2012)
- Wong and Kit (2010)
- Dahlmeier et al. (2011)
- Song and Cohn (2011)
- Albrecht and Hwa (2008)
- Lita et al. (2005)
- Albrecht and Hwa (2007)