Metrics Task - ACL 2020 Fifth Conference on Machine Translation (WMT20)

Shared Task: Metrics

Metrics Task Important Dates

System outputs ready to download	~~July 22nd~~ Aug 23th, 2020 Updated Sept 18th, 2020
Start of manual evaluation period	TBD 2020
Paper submission deadline	Aug 15th, 2020 (indeed earlier than the final submission of your scores)
Submission deadline for metrics task	~~Sept 15th~~ Sept 30th, 2020 (AoE)
End of manual evaluation	TBD 2020
Notification of acceptance	Sept 19th, 2020
Camera-ready deadline	Oct 10th, 2020
Conference	Nov 19th—20th, 2020

Metrics Task Overview

This shared task will examine automatic evaluation metrics for machine translation. We will provide you with all of the translations produced in the translation task along with the human reference translations. We are looking for automatic metric scores for translations at the system-level, document-level, segment-level. For some languages (English to/from German and Czech, and for English to Chinese), segments are paragraphs that can contain multiple sentences. Note that online news text typically has short paragraphs (generally the average for each reference/source is less than 2 sentences). We will calculate the system-level, document-level, and segment(sentence/paragraph)-level correlations of your scores with WMT20 human judgements once the manual evaluation has been completed.

Goals

The goals of the shared metrics task are:

To achieve the strongest correlation with human judgement of translation quality;
To illustrate the suitability of an automatic evaluation metric as a surrogate for human evaluation;
To address problems associated with comparison with a single reference translation;
To move automatic evaluation beyond system-level ranking to finer-grained sentence-level ranking.
NEW in 2020! To move automatic evaluation beyond sentence-level to including context.
NEW in 2020! To analyse the influence of references when evaluating MT systems.
NEW in 2020! To analyse how well metrics evaluate human translations.

Task Description

We will provide you with the source sentences, output of machine translation systems and reference translations for the following language pairs :

English with Chinese, Czech, German, Inuktitut, Khmer, Japanese, Pashto, Polish, Russian and Tamil (both directions)
German with French (both directions)

Additionally, we will run the "QE as a metric" task, where you need to provide the same outputs as standard metrics participants (see below) but you must not make use of the references.

Multiple References

There will be multiple references available for five language-pairs. In addition, we have paraphrased references for English-to-German (newstest2020P) (Thanks to Marcus Freitag from Google for providing this). We ask that you submit scores for MT systems against each reference individually. If your metric can handle multiple references, please submit an additional set of scores using all available references. This will enable additional analysis on the role of the reference translation.

Evaluating Human Translations

Finally, we want to see how well metrics evaluate human translations. For metrics that compare against the source, we ask you to score the reference translations in addition to MT system outputs. If your metric needs references, then we can only evaluate reference translations when additional references are available. For these language-pairs, we ask that you score each reference against every other available reference.

Sacrebleu-BLEU, when scoring English->German translations using the newstestB2020 reference, ranks 9 MT systems above the newstest2020 translations, and all 14 MT systems above the paraphrased reference translations[which makes sense as it is designed to not have word/ngram overlap with the original reference]. When using the newstestP2020 reference, the two WMT references are both ranked above all MT systems. How would your metric score the human reference translations?

More details in the how to submit section at the end of the page.

Assessment

We will assess automatic evaluation metrics in the following ways:

System-level correlation: We will use absolute Pearson correlation coefficient to measure the correlation of the automatic metric scores with official human scores as computed in the translation task. Direct Assessment will be the official human evaluation, see last year's results for further details.
Document-level correlation (all language pairs): This year, we are trialling Document-level evaluation. We will use the Pearson correlation of your scores with human judgements of translation quality. (We might fallback to Kendall's tau on "relative ranking" implied from direct assessments, as with segment level evaluation, depending on the data available.)
Segment-level correlation (sentence-level/paragraph-level): "Direct Assessment" will use the Pearson correlation of your scores with human judgements of translation quality. (Fallback to Kendall's tau on "relative ranking" implied from direct assessments may be necessary for some language pairs, as done in 2018 and 2019.)

Paragraph-level:
English to/from German
English to/from Czech
English to Chinese

Sentence level:
English to/from Inuktitut, Khmer, Japanese, Pashto, Polish, Russian, and Tamil
German to/from French
Chinese to English

Other Requirements

If you participate in the metrics task, we ask you to commit about 8 hours of time to do the manual evaluation. You are also invited to submit a paper describing your metric. We would also like a paragraph describing your metric to include in the metrics task results paper.

Manual Evaluation

The evaluation will be done with an online tool, details will be posted here.

Paper Describing Your Metric

You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.
Note that shared task submission description papers are non-archival.

Training Data

You may want to use some of the following data to tune or train your metric.

DA (Direct Assessment) Development/Training Data

For system-level, see the results from the previous years:

For segment-level, the following datasets are available:

WMT19: http://www.statmt.org/wmt19/results.html
WMT18: http://www.statmt.org/wmt18/results.html
WMT17: http://www.statmt.org/wmt17/results.html
DAseg-wmt-newstest2016.tar.gz: 7 language pairs (sampled from newstest2016, tr-en fi-en cs-en ro-en ru-en en-ru de-en; always 560 sentence pairs)
DAseg-wmt-newstest2015.tar.gz: 5 language pairs (sampled from newstest2015, en-ru de-en ru-en fi-en cs-en; always 500 sentence pairs)

Each dataset contains:

the source sentence
MT output (blind, no identification of the actual system that produced it)
the reference translation
human score (a real number between -Inf and +Inf)

RR (Relative Ranking) from Past Years

Although RR is no longer the manual evaluation employed in the metrics task, human judgments from the previous year's data sets may still prove useful:

WMT16: http://www.statmt.org/wmt16/results.html
WMT15: http://www.statmt.org/wmt15/results.html
WMT14: http://www.statmt.org/wmt14/results.html
WMT13: http://www.statmt.org/wmt13/results.html
WMT12: http://www.statmt.org/wmt12/results.html
WMT11: http://www.statmt.org/wmt11/results.html
WMT10: http://www.statmt.org/wmt10/results.html
WMT09: http://www.statmt.org/wmt09/results.html
WMT08: http://www.statmt.org/wmt08/results.html

You can use any past year's data to tune your metric's free parameters if it has any for this year's submission. Additionally, you can use any past data as a test set to compare the performance of your metric against published results from past years metric participants.

Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.

Test Sets (Evaluation Data)

WMT20 metrics task test sets are now available , apologies for the delay.

There are two subsets of outputs that we would like you to evaluate:

newstest2020: This is the very basis of the metrics task, with source sentences translated as part of the WMT News translation task.
testsuites2020: These are the additional sets of sentences translated by WMT20 translation systems to allow detailed inspection of system's (linguistic) properties. There will be no manual evaluations collected for these translations, but on the other hand, your automatic scoring will help the testsuite authors to interpret the performance of MT systems on their testsuite. We would like you to score these. We have filtered out testsuite segments where the reference is not available, so the number of segments is much smaller than last year.

These testsets are available both in sgm and txt format. The txt format comes with an additional set of files that specifies the document id of each line in the corresponding source/ref/systemoutput.

You can download the following files from the google drive folder (newstest2020 and testsuites2020, both plain text and sgm)

Combined dataset: wmt20metricsdata.tar.gz (88.8MB; includes all folders described below)

Individual datasets:

newstest2020txt.tar.gz (27.9MB; newstest2020 excluding testsuites, plain text)
newstest2020sgm.tar.gz (29.7MB; newstest2020 excluding testsuites, SGML format)
testsuites2020txt.tar.gz (16.1MB; only testsuites, plain text)
testsuites2020sgm.tar.gz (15MB; only testsuites, SGML format)

Please contact Nitika if there are any issues with the data.

Here are bash scripts that you may want to run around your scorer to process everything

Submission Format

Note: updated on 27th August

The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both). This year, we are also trialling document level evaluation.

Output file format for system-level rankings

Since we assume that your metrics are mostly simple arithmetic averages of segment-level scores, your system-level outputs serve primarily as a sanity check if we get the exact same averages.

Note: we do not require the fields ENSEMBLE and AVAILABLE this year, as we ask you to enter these details of your metric in the shared spreadsheet. Please ensure that you have filled it.

The output files for system-level rankings should be called YOURMETRIC.sys.score.gz and formatted in the following way:

  <METRIC NAME>   <LANG-PAIR>   <TEST SET>   <SYSTEM-ID>   <SYSTEM LEVEL SCORE>

The output files for document-level scores should be called YOURMETRIC.doc.score.gz and formatted in the following way:

  <METRIC NAME>   <LANG-PAIR>   <TESTSET>   <REFERENCE>   <SYSTEM-ID>   <DOCUMENT-ID>   DOCUMENT SCORE>

The output files for segment-level (both sentence and paragraph level) scores should be called YOURMETRIC.seg.score.gz and formatted in the following way:

  <METRIC NAME>   <LANG-PAIR>   <TESTSET>   <REFERENCE>   <SYSTEM-ID>   <DOCUMENT-ID>   <SEGMENT-ID>   SEGMENT SCORE>

Each field should be delimited by a single tab character.

Where:

METRIC NAME is the name of your automatic evaluation metric.
LANG-PAIR is the language pair using two letter abbreviations for the languages (de-en for German-English, for example).
TEST SET is the ID of the test set (newstest2020 or testsuites2020 .
REFERENCE is the ID of the reference set (newstestB2020,newstestP2020,newstestM2020 or testsuites2020).

SYSTEM-ID is the ID of system being scored (given by the part of the filename for the plain text file, uedin-syntax.3866 for example).
DOCUMENT-ID is the ID of document that the current segment belongs to (Found in details/langpair.txt if using plaintext files novinky.cz.121062 for example).
SEGMENT-ID is the ID of segment being scored (Found in details/langpair.txt if using plaintext files.).
LINE NUMBER is the line number starting from 1 of the plain text input files.

SYSTEM SCORE is the score your metric predicts for the particular system.
DOCUMENT SCORE is the score your metric predicts for the particular document.
SEGMENT SCORE is the score your metric predicts for the particular segment.

If using plaintext files, you will find an additional folder called 'details' under the newstest2020/txt and testsuites2020/txt folders. This contains a file for each language pair, which has the corresponding doc and seg ids of each line in the source/reference/system-output. The file format of a details file is :

  <LINE NUMBER>   <TESTSET>  <LANG-PAIR>   <DOCUMENT-ID>   SEGMENT-ID>

Additional References

The newstest2020 testset contains additional independent references for language-pairs de-en, en-de, en-zh, ru-en, zh-en. We also have paraphrased references for en-de. We would like scores for every MT system for each set of references (using the values newstest2020, newstestB2020 or newstestP2020 for the reference set column of submitted files). Finally, we would also like a set of scores when evaluating against ALL references for the language-pair, denoted with the refset newstestM2020.

Evaluating Human Translations

This year, we would also like to see how metrics evaluate human translations for the News task. So we have also included human translations in the system-outputs folder, with the same filename format as system translations. So newstest2020-deen-ref.de.txt is renamed to newstest2020.de-en.Human-A.0.txt, newstestB2020-deen-ref.de.txt to newstest2020.de-en.Human-B.0.txt and newstestP2020-deen-ref.de.txt to newstest2020.de-en.Human-P.0.txt.

Source-based metrics

Source-based metrics should score all human translations in addition to the MT systems. Here is a toy example of a langpair where we have N systems and 3 references. The metric.sys.score file would contain:

 
  src-metric    langpair    newstest2020    newstest2020    sys1.0      score  
  ...
  src-metric    langpair    newstest2020    newstest2020    sysN.0      score   
  src-metric    langpair    newstest2020    newstest2020    Human-A.0   score 
  src-metric    langpair    newstest2020    newstest2020    Human-B.0   score  
  src-metric    langpair    newstest2020    newstest2020    Human-P.0   score

Note, src-based metrics do not need to score testsuites. For reference-based metric, scoring testsuites is optional (but encouraged).

Reference-based metrics

For language pairs with only a single reference, reference-based metrics need to score only the MT systems.

For de-en, en-zh, ru-en, and zh-en, we have two references available (newstest2020 and newstestB2020). You will need to score the MT systems evaluated against both sets of references individually, and then combined. Then score each reference against the other. The metric.sys.score file would contain:

 
#score the MT systems with newstest2020
  ref-metric    en-zh    newstest2020    newstest2020    sys1.0    score  
  ...
  ref-metric    en-zh    newstest2020    newstest2020    sysN.0    score  

#score the MT systems with newstestB2020
  ref-metric    en-zh    newstest2020    newstestB2020    sys1.0    score
  ..
  ref-metric    en-zh    newstest2020    newstestB2020    sysN.0    score  

#score the MT systems with multiple references  (newstest2020 and newstestB2020)
  ref-metric    en-zh    newstest2020    newstestM2020    sys1.0    score 
  ..
  ref-metric    en-zh    newstest2020    newstestM2020    sysN.0    score

#score Human-B.0 system [newstestB2020 reference]  against newstest2020  
  ref-metric    en-zh    newstest2020    newstest2020    Human-B.0   score 

#score Human-A.0 system [newstest2020 reference] against newstestB2020   
  ref-metric    en-zh    newstest2020    newstestB2020    Human-A.0   score 

#score testsuites translations of the MT systems against testsuites2020 reference  (testsuites have only a single reference, so no scoring human translations here)
  ref-metric    en-zh    testsuites2020    testsuites2020    sys1.0    score 
  ..
  ref-metric    en-zh    testsuites2020    testsuites2020    sysN.0    score

We have three references for en-de. The metric.sys.score file would contain:

 
#insert MT system scores here against each reference individually  (newstest2020, newstestB2020 and newstestP2020) and combined (newstestM2020)
 
#score Human-B.0 and Human-P.0 systems against newstest2020  reference
  ref-metric    en-de    newstest2020    newstest2020    Human-B.0   score 
  ref-metric    en-de    newstest2020    newstest2020    Human-P.0   score   

#score Human-A.0 and Human-P.0 systems against newstestB2020 reference  
  ref-metric    en-de    newstest2020    newstestB2020    Human-A.0   score 
  ref-metric    en-de    newstest2020    newstestB2020    Human-P.0   score 

#score Human-A.0 and Human-B.0 systems against newstestP2020 reference  
  ref-metric    en-de    newstest2020    newstestP2020    Human-A.0   score 
  ref-metric    en-de    newstest2020    newstestP2020    Human-B.0   score 

#score Human-A.0  system against combined newstestB2020 and newstestP2020 reference  
  ref-metric    en-de    newstest2020    newstestM2020    Human-A.0   score 

#score Human-B.0  system against combined newstest2020 and newstestP2020 reference  
  ref-metric    en-de    newstest2020    newstestM2020    Human-B.0   score 

#score Human-P.0  system against combined newstestA2020 and newstestB2020 reference  
  ref-metric    en-de    newstest2020    newstestM2020    Human-P.0   score 
  
#insert testsuite scores of MT systems here

The metric.doc.score and metric.seg.score files would, likewise, contain the document and segment scores for systems and human translations included in the metric.sys.score file.

How to submit

Before you submit, please run your scores files through a validation script, found here . The folder contains sample submissions for source-based and reference-based metrics (with random scores), and the script compares your metric with the sample.

Submissions should be sent as an e-mail to wmt-metrics-submissions@googlegroups.com with the subject "WMT Metrics submission.

If the document-level score of your metric is an average of segment-level scores, you do not have to include metric.doc.score.

As a sanity check, please enter yourself to this shared spreadsheet.

In case the above e-mail doesn't work for you (Google seems to prevent non-member postings despite we set it so), please contact us directly.

Metrics Task Organizers

Nitika Mathur (University of Melbourne)
Qingsong Ma (Tencent Inc.)
Johnny Wei (University of Southern California)
Ondřej Bojar (Charles University)