Shared Task: Metrics

Metrics Task Important Dates

System outputs ready to download July 15, 2021July 26, 2021
Submission deadline for metrics task July 22, 2021 August 09, 2021(AoE)
Paper submission deadline to WMT >Aug 5, 2021
WMT Notification of acceptance Sept 5, 2021
WMT Camera-ready deadline Sept 15, 2021
Conference Nov 10—11, 2021

System outputs are now available to download. (see below for the link and sumission details)

Update, 27 July 12:15 pm UTC: en-de Challenge set source, ref and system outputs updated.

Update, 30 July 8:45 am UTC: additional system outputs added to newstest2021 en-de, en-ru and zh-en.

Please enter yourself ASAP to this shared spreadsheet if you intend to submit to this year's metrics task.

Metrics Task Overview

This shared task will examine automatic evaluation metrics for machine translation. We will provide you with MT system outputs along with source text and the human reference translations. We are looking for automatic metric scores for translations at the system-level, and segment-level. We will calculate the system-level, and segment-level correlations of your scores with human judgements.

We invite submissions of reference-free metrics in addition to reference-based metrics.

Goals

The goals of the shared metrics task are:

Changes this year

Recent work demonstrated that WMT DA has low correlation with expert-based human evaluations for WMT2020 English to German and Chinese to English. These findings call into question conclusions drawn on the basis of WMT human evaluation for high quality MT output. Furthermore, the same paper showed that automatic metrics based on pre-trained embeddings already outperform WMT human ratings on both language pairs. As a consequence, we will integrate the following changes in this year's evaluation campaign:

Task Description

We will provide you with the source sentences, output of machine translation systems and reference translations.

1. Official results: Correlation with MQM scores on in-domain (news) and out-of-domain data at the sentence and system level on the language pairs:

The inputs will include a selection of MT system submissions to the WMT21 news translation task, online systems, human translations and development systems. We will use Pearson correlation for system-level evaluation and Kendall's Tau for segment-level evaluation.

2. Challenge sets: Accuracy on selecting the better translation on the above language pairs, when comparing high quality translations with MT system outputs that are deliberately corrupted in ways that can be challenging for current automatic metrics.

3. Secondary Evaluation: Correlation with official WMT Direct Assessment (DA) scores at the sentence and system level on the language pairs:

The inputs will include all MT system submissions to the WMT21 news translation task, online systems, and human translations if available. We will use Pearson correlation for system-level evaluation and Kendall's Tau-like evaluation on 'relative ranking' judgements implied from DA for segment-level evaluation.

Paper Describing Your Metric

You are invited to submit a short paper (4 to 6 pages) to WMT describing your automatic evaluation metric. Information on how to submit is available here. Shared task submission description papers are non-archival, and you are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.

Training Data

You may want to use some of the following data to tune or train your metric:

MQM (Multidimensional Quality Metrics) Framework Development/Training Data

WMT20 en-de, zh-en: https://github.com/google/wmt-mqm-human-evaluation

The MQM dataset contains segment scores, as well as annotations on the category of error and error severity. There are two different file formats for MQM:
DA (Direct Assessment) Development/Training Data

For system-level, see the results from the previous years:

For segment-level, the following datasets are available:

Each DA dataset (WMT15/WMT16) contains:

RR (Relative Ranking) from Past Years

Although RR is no longer the manual evaluation employed in the metrics task, human judgments from the previous year's data sets may still prove useful:

You can use any past year's data to tune your metric's free parameters if it has any for this year's submission. Additionally, you can use any past data as a test set to compare the performance of your metric against published results from past years metric participants.

Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.

 

Test Sets (Evaluation Data)

WMT21 metrics task test sets are now available .

Update Aug 30, 2021: (1) added ref-C and ref-D for newstest2021 EnDe, (2) added ref-B for TED ZhEn

There are three subsets of outputs that we need you to evaluate:

newstest2021 and florestest2021
These contain source sentences translated as part of the WMT News translation task.
tedtalks
These are additional sets of sentences translated by WMT21 translation systems in the TED talks domain.
challengeset
These are synthetic outputs generated specifically to challenge automatic metrics.

These testsets are available in the txt format. We also provide metadata that specifies the document id of each line in the corresponding source/ref/systemoutput.

Submission Format

The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both). This year, we no longer include document level evaluation.

There are sample metrics (with random scores) available in the validation folder in the
shared data folder.

Output file format for system-level rankings

The output files for system-level rankings should be called YOURMETRIC.sys.score.gz and formatted in the following way:

  <METRIC NAME>   <LANG-PAIR>   <TEST SET> <REFERENCE>   <SYSTEM-ID>   <SYSTEM LEVEL SCORE>   
  

The output files for segment-level scores should be called YOURMETRIC.seg.score.gz and formatted in the following way:

  <METRIC NAME>   <LANG-PAIR>   <TESTSET>   <REFERENCE>   <SYSTEM-ID>      <SEGMENT NUMBER>   SEGMENT SCORE>  
  

Each field should be delimited by a single tab character.

Where:

Additional References

The newstest2021 testset contains additional independent references for language-pairs csen, de-en, en-de, en-ru, en-zh, ru-en and zh-en. We would like scores for every MT system for each set of references. Note that we do not collect metric scores using multiple references this year.

Evaluating Human Translations

For language pairs with two references, these reference translations are included in the system-outputs folder to be evaluated alongside the MT systems.

Source-based metrics

Source-based metrics should score all human translations in addition to the MT systems. Here is a toy example of a langpair where we have N systems and 2 references. The metric.sys.score file would contain:

 
  src-metric    langpair    newstest2021    src    sys1      score  
  ...
  src-metric    langpair    newstest2021    src    sysN      score   
  src-metric    langpair    newstest2021    src    ref-A    score 
  src-metric    langpair    newstest2021    src    ref-B    score  
  
Reference-based metrics

For languages with two references available, you will need to score the MT systems evaluated against both sets of references individually. Then score each reference against the other. The metric.sys.score file would contain:

 
#score the MT systems with ref-A
  ref-metric    en-zh    newstest2021    ref-A    sys1    score  
  ...
  ref-metric    en-zh    newstest2021    ref-A    sysN    score  

#score the MT systems with ref-B
  ref-metric    en-zh    newstest2021    ref-B    sys1    score
  ..
  ref-metric    en-zh    newstest2021    ref-B    sysN    score  
 
#score ref-B against ref-A  
  ref-metric    en-zh    newstest2021    ref-A    ref-B   score 

#score ref-A against ref-B   
  ref-metric    en-zh    newstest2021    ref-B    ref-A   score 
 
  

The metric.seg.score file would, likewise, contain the segment scores for systems and human translations included in the metric.sys.score file.

How to submit

Before you submit, please run your scores files through a validation script, which is now available along with the data in the shared folder.

Note that the English to German data was updated at around 27 July 12:15 pm UTC, and additional system outputs were added to newstest2021 en-de, en-ru and zh-en on July 30th at 8:45 am UTC.

Please enter yourself to this shared spreadsheet asap so we can keep track of your submissions. Submissions should be sent to wmt.metrics@gmail.com with the subject "WMT Metrics submission". You are allowed to submit multiple metrics, but we need you to indicate the primary metric in the email. If submitting more than one metric, please share a folder with all your metrics, for example on Google Drive or Dropbox.

Before August 6th (AOE), please send us an email with:

Metrics Task Organizers

Markus Freitag (Google Research)
Ricardo Rei (Unbabel)
Nitika Mathur (University of Melbourne)
Chi-kiu (Jackie) Lo (NRC Canada)
George Foster (Google Research)
Craig Stewart (Unbabel)
Alon Lavie (Unbabel)
Ondřej Bojar (Charles University)