|System outputs ready to download|| |
|Start of manual evaluation period||TBD 2020|
|Paper submission deadline||Aug 15th, 2020 (indeed earlier than the final submission of your scores)|
|Submission deadline for metrics task|| |
|End of manual evaluation||TBD 2020|
|Notification of acceptance||Sept 19th, 2020|
|Camera-ready deadline||Oct 10th, 2020|
|Conference||Nov 19th—20th, 2020|
This shared task will examine automatic evaluation metrics for machine translation. We will provide you with all of the translations produced in the translation task along with the human reference translations. We are looking for automatic metric scores for translations at the system-level, document-level, segment-level. For some languages (English to/from German and Czech, and for English to Chinese), segments are paragraphs that can contain multiple sentences. Note that online news text typically has short paragraphs (generally the average for each reference/source is less than 2 sentences). We will calculate the system-level, document-level, and segment(sentence/paragraph)-level correlations of your scores with WMT20 human judgements once the manual evaluation has been completed.
The goals of the shared metrics task are:
We will provide you with the source sentences, output of machine translation systems and reference translations for the following language pairs :
Additionally, we will run the "QE as a metric" task, where you need to provide the same outputs as standard metrics participants (see below) but you must not make use of the references.
There will be multiple references available for five language-pairs. In addition, we have paraphrased references for English-to-German (newstest2020P) (Thanks to Marcus Freitag from Google for providing this). We ask that you submit scores for MT systems against each reference individually. If your metric can handle multiple references, please submit an additional set of scores using all available references. This will enable additional analysis on the role of the reference translation.
Finally, we want to see how well metrics evaluate human translations. For metrics that compare against the source, we ask you to score the reference translations in addition to MT system outputs. If your metric needs references, then we can only evaluate reference translations when additional references are available. For these language-pairs, we ask that you score each reference against every other available reference.
Sacrebleu-BLEU, when scoring English->German translations using the newstestB2020 reference, ranks 9 MT systems above the newstest2020 translations, and all 14 MT systems above the paraphrased reference translations[which makes sense as it is designed to not have word/ngram overlap with the original reference]. When using the newstestP2020 reference, the two WMT references are both ranked above all MT systems. How would your metric score the human reference translations?More details in the how to submit section at the end of the page.
We will assess automatic evaluation metrics in the following ways:
System-level correlation: We will use absolute Pearson correlation coefficient to measure the correlation of the automatic metric scores with official human scores as computed in the translation task. Direct Assessment will be the official human evaluation, see last year's results for further details.
Document-level correlation (all language pairs): This year, we are trialling Document-level evaluation. We will use the Pearson correlation of your scores with human judgements of translation quality. (We might fallback to Kendall's tau on "relative ranking" implied from direct assessments, as with segment level evaluation, depending on the data available.)
Segment-level correlation (sentence-level/paragraph-level):
"Direct Assessment" will use the Pearson correlation of your scores with human judgements of translation quality. (Fallback to Kendall's tau on "relative ranking" implied from direct assessments may be necessary for some language pairs, as done in 2018 and 2019.)
English to/from German
English to/from Czech
English to Chinese
English to/from Inuktitut, Khmer, Japanese, Pashto, Polish, Russian, and Tamil
German to/from French
Chinese to English
If you participate in the metrics task, we ask you to commit about 8 hours of time to do the manual evaluation. You are also invited to submit a paper describing your metric. We would also like a paragraph describing your metric to include in the metrics task results paper.
The evaluation will be done with an online tool, details will be posted here.
You are invited to submit a short paper (4 to 6 pages) describing your
automatic evaluation metric. You are not required to submit a paper if you do
not want to. If you don't, we ask that you give an appropriate reference
describing your metric that we can cite in the overview paper.
Note that shared task submission description papers are non-archival.
You may want to use some of the following data to tune or train your metric.
For system-level, see the results from the previous years:
For segment-level, the following datasets are available:
Each dataset contains:
Although RR is no longer the manual evaluation employed in the metrics task, human judgments from the previous year's data sets may still prove useful:
You can use any past year's data to tune your metric's free parameters if it has any for this year's submission. Additionally, you can use any past data as a test set to compare the performance of your metric against published results from past years metric participants.
Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.
WMT20 metrics task test sets are now available , apologies for the delay.
There are two subsets of outputs that we would like you to evaluate:
These testsets are available both in sgm and txt format. The txt format comes with an additional set of files that specifies the document id of each line in the corresponding source/ref/systemoutput.
You can download the following files from the google drive folder (newstest2020 and testsuites2020, both plain text and sgm)
Combined dataset: wmt20metricsdata.tar.gz (88.8MB; includes all folders described below)
Here are bash scripts that you may want to run around your scorer to process everything
Note: updated on 27th August
The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both). This year, we are also trialling document level evaluation.
Since we assume that your metrics are mostly simple arithmetic averages of segment-level scores, your system-level outputs serve primarily as a sanity check if we get the exact same averages.
Note: we do not require the fields
AVAILABLE this year, as we ask you to enter these details of your metric in the shared spreadsheet. Please ensure that you have filled it.
The output files for system-level rankings should be called
YOURMETRIC.sys.score.gz and formatted in the following way:
<METRIC NAME> <LANG-PAIR> <TEST SET> <SYSTEM-ID> <SYSTEM LEVEL SCORE>
The output files for document-level scores should be called
YOURMETRIC.doc.score.gz and formatted in the following way:
<METRIC NAME> <LANG-PAIR> <TESTSET> <REFERENCE> <SYSTEM-ID> <DOCUMENT-ID> DOCUMENT SCORE>
The output files for segment-level (both sentence and paragraph level) scores should be called
YOURMETRIC.seg.score.gz and formatted in the following way:
<METRIC NAME> <LANG-PAIR> <TESTSET> <REFERENCE> <SYSTEM-ID> <DOCUMENT-ID> <SEGMENT-ID> SEGMENT SCORE>
Each field should be delimited by a single tab character.
METRIC NAMEis the name of your automatic evaluation metric.
LANG-PAIRis the language pair using two letter abbreviations for the languages (
de-enfor German-English, for example).
TEST SETis the ID of the test set (
REFERENCEis the ID of the reference set (
SYSTEM-IDis the ID of system being scored (given by the part of the filename for the plain text file,
DOCUMENT-IDis the ID of document that the current segment belongs to (Found in details/langpair.txt if using plaintext files
SEGMENT-IDis the ID of segment being scored (Found in details/langpair.txt if using plaintext files.).
LINE NUMBERis the line number starting from 1 of the plain text input files.
SYSTEM SCOREis the score your metric predicts for the particular system.
DOCUMENT SCOREis the score your metric predicts for the particular document.
SEGMENT SCOREis the score your metric predicts for the particular segment.
If using plaintext files, you will find an additional folder called 'details' under the newstest2020/txt and testsuites2020/txt folders. This contains a file for each language pair, which has the corresponding doc and seg ids of each line in the source/reference/system-output. The file format of a details file is :
<LINE NUMBER> <TESTSET> <LANG-PAIR> <DOCUMENT-ID> SEGMENT-ID>
The newstest2020 testset contains additional independent references for language-pairs de-en, en-de, en-zh, ru-en, zh-en. We also have paraphrased references for en-de. We would like scores for every MT system for each set of references (using the values newstest2020, newstestB2020 or newstestP2020 for the reference set column of submitted files). Finally, we would also like a set of scores when evaluating against ALL references for the language-pair, denoted with the refset newstestM2020.
This year, we would also like to see how metrics evaluate human translations for the News task. So we have also included human translations in the system-outputs folder, with the same filename format as system translations. So newstest2020-deen-ref.de.txt is renamed to newstest2020.de-en.Human-A.0.txt, newstestB2020-deen-ref.de.txt to newstest2020.de-en.Human-B.0.txt and newstestP2020-deen-ref.de.txt to newstest2020.de-en.Human-P.0.txt.
Source-based metrics should score all human translations in addition to the MT systems. Here is a toy example of a langpair where we have N systems and 3 references. The metric.sys.score file would contain:
src-metric langpair newstest2020 newstest2020 sys1.0 score ... src-metric langpair newstest2020 newstest2020 sysN.0 score src-metric langpair newstest2020 newstest2020 Human-A.0 score src-metric langpair newstest2020 newstest2020 Human-B.0 score src-metric langpair newstest2020 newstest2020 Human-P.0 scoreNote, src-based metrics do not need to score testsuites. For reference-based metric, scoring testsuites is optional (but encouraged).
For language pairs with only a single reference, reference-based metrics need to score only the MT systems.
For de-en, en-zh, ru-en, and zh-en, we have two references available (newstest2020 and newstestB2020). You will need to score the MT systems evaluated against both sets of references individually, and then combined. Then score each reference against the other. The metric.sys.score file would contain:
#score the MT systems with newstest2020 ref-metric en-zh newstest2020 newstest2020 sys1.0 score ... ref-metric en-zh newstest2020 newstest2020 sysN.0 score #score the MT systems with newstestB2020 ref-metric en-zh newstest2020 newstestB2020 sys1.0 score .. ref-metric en-zh newstest2020 newstestB2020 sysN.0 score #score the MT systems with multiple references (newstest2020 and newstestB2020) ref-metric en-zh newstest2020 newstestM2020 sys1.0 score .. ref-metric en-zh newstest2020 newstestM2020 sysN.0 score #score Human-B.0 system [newstestB2020 reference] against newstest2020 ref-metric en-zh newstest2020 newstest2020 Human-B.0 score #score Human-A.0 system [newstest2020 reference] against newstestB2020 ref-metric en-zh newstest2020 newstestB2020 Human-A.0 score #score testsuites translations of the MT systems against testsuites2020 reference (testsuites have only a single reference, so no scoring human translations here) ref-metric en-zh testsuites2020 testsuites2020 sys1.0 score .. ref-metric en-zh testsuites2020 testsuites2020 sysN.0 scoreWe have three references for en-de. The metric.sys.score file would contain:
#insert MT system scores here against each reference individually (newstest2020, newstestB2020 and newstestP2020) and combined (newstestM2020) #score Human-B.0 and Human-P.0 systems against newstest2020 reference ref-metric en-de newstest2020 newstest2020 Human-B.0 score ref-metric en-de newstest2020 newstest2020 Human-P.0 score #score Human-A.0 and Human-P.0 systems against newstestB2020 reference ref-metric en-de newstest2020 newstestB2020 Human-A.0 score ref-metric en-de newstest2020 newstestB2020 Human-P.0 score #score Human-A.0 and Human-B.0 systems against newstestP2020 reference ref-metric en-de newstest2020 newstestP2020 Human-A.0 score ref-metric en-de newstest2020 newstestP2020 Human-B.0 score #score Human-A.0 system against combined newstestB2020 and newstestP2020 reference ref-metric en-de newstest2020 newstestM2020 Human-A.0 score #score Human-B.0 system against combined newstest2020 and newstestP2020 reference ref-metric en-de newstest2020 newstestM2020 Human-B.0 score #score Human-P.0 system against combined newstestA2020 and newstestB2020 reference ref-metric en-de newstest2020 newstestM2020 Human-P.0 score #insert testsuite scores of MT systems here
The metric.doc.score and metric.seg.score files would, likewise, contain the document and segment scores for systems and human translations included in the metric.sys.score file.
Before you submit, please run your scores files through a validation script, found here . The folder contains sample submissions for source-based and reference-based metrics (with random scores), and the script compares your metric with the sample.
Submissions should be sent as an e-mail to email@example.com with the subject "WMT Metrics submission.
If the document-level score of your metric is an average of segment-level scores, you do not have to include metric.doc.score.
As a sanity check, please enter yourself to this shared spreadsheet.
In case the above e-mail doesn't work for you (Google seems to prevent non-member postings despite we set it so), please contact us directly.