Shared Task: Metrics

Metrics Task Important Dates

System outputs ready to downloadMay 14th, 2017 June 3rd, 2017
Start of manual evaluation periodMay 15th, 2017 June 16th, 2017
End of manual evaluation (provisional)June 4th, 2017 June 23rd, 2017
Paper submission deadlineJune 9th, 2017 extended to June 17th, 2017 (AoE)
Submission deadline for metrics taskJune 15th, 2017 extended to June 21, 2017 (AoE, indeed later than the paper)
Notification of acceptanceJune 30th, 2017
Camera-ready deadlineJuly 14th, 2017
Conference in CopenhagenSeptember 7-8, 2017

Metrics Task Overview

This shared task will examine automatic evaluation metrics for machine translation. We will provide you with all of the translations produced in the translation task along with the human reference translations. You will return your automatic metric scores for translations at the system-level and/or at the sentence-level. We will calculate the system-level and sentence-level correlations of your scores with WMT17 human judgements once the manual evaluation has been completed.


The goals of the shared metrics task are:

Changes This Year

Submissions to this year's metrics task should include in each submission:

As trialed in WMT16, the system-level evaluation will optionally include evaluation of metrics with reference to large sets of 10k MT hybrid systems.

We will also include a medical domain evaluation of metrics on the sentence-level via HUME manual evaluation based on UCCA.

Task Description

We will provide you with the output of machine translation systems and reference translations for language pairs involving English and the following languages

You will compute scores for each of the outputs at the system-level and/or the sentence-level. If your automatic metric does not produce sentence-level scores, you can participate in just the system-level ranking. If your automatic metric uses linguistic annotation and supports only some language pairs, you are free to assign scores only where you can.

We will assess automatic evaluation metrics in the following ways:

Summary of Tracks

The following table summarizes the planned evaluation methods and text domains of each evaluation track.

Track Text Domain Level Golden Truth Source
DAsys news, from WMT17 news task system-level direct assessment
DAseg news, from WMT17 news task segment-level direct assessment
HUMEseg mix of (consumer) medical from HimL and news (WARNING: WMT16 news task) segment-level correctness of translation of all semantic nodes
HUMEsys mix of (consumer) medical from HimL and news (WARNING: WMT16 news task) system-level aggregate correctness of translation of all semantic nodes

Other Requirements

If you participate in the metrics task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool.

You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.


Test Sets (Evaluation Data)

WMT17 metrics task test sets are ready. Since we are trying to establish better confidence intervals for system-level evaluation, we have more than 10k system outputs per language pair and test set, so the package is quite big.

We have changed the format of hybrid systems inputs, see the file wmt17-metrics-task/hybrids/hybrid-instructions in the package for description. We plan to provide a wrapper for TXT format to run your metric on the hybrid systems.

If possible, please submit results for all systems, including the hybrids. If you know you won't have the resources to run the hybrids, you can use the smaller package:

Note that the actual sets of sentences differ across test sets (that's natural) but they also differ across language pairs. So always use the triple {test set name, source language, target language} to identify the test set source, reference and a system output.

There are two references for English-to-Finnish newstest: and You are free to use both; if you use only one, please pick former variant.

Training Data

You may want to use some of the following data to tune or train your metric.

DA (Direct Assessment) Development/Training Data

For system-level, see last year's results

  • WMT16:
  • For segment-level, there are two past development sets available covering

  • DAseg-wmt-newstest2016.tar.gz: 7 language pairs (sampled from newstest2016, tr-en fi-en cs-en ro-en ru-en en-ru de-en; always 560 sentence pairs)
  • DAseg-wmt-newstest2015.tar.gz: 5 language pairs (sampled from newstest2015, en-ru de-en ru-en fi-en cs-en; always 500 sentence pairs)
  • Each dataset contains:


    For HUMEseg training data see last year's metrics task results

  • WMT16:, the package called "Metrics Task data and results" with these files:
  • For HUMEseg, golden truth segment-level scores are constructed from manual annotations indicating if each node in the semantic tree of the source sentence was translated correctly. The underlying semantic representation is UCCA.

    In contrast to previous year, there will be a handful of system outputs per segment. (A different set of systems for each language pair.)

    RR (Relative Ranking) from Past Years

    Although RR is no longer the manual evaluation employed in the metrics task, human judgments from the previous year's data sets may still prove useful:

    You can use any past year's data to tune your metric's free parameters if it has any for this year's submission. Additionally, you can use any past data as a test set to compare the performance of your metric against published results from past years metric participants.

    Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.

    Submission Format

    The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).

    If you have a single setup for all domains and evaluation tracks, simply report the test set name (newstest2017 and himltest) with your scores as usual and described below. We will evaluate your outputs in all applicable tracks.

    If your setups differ based on the provided training data or domain knowledge, please include evaluation track name as a part of the test set name. Valid track names are: DAsys, DAseg and HUMEseg; see above.

    Output file format for system-level rankings

    The output files for system-level rankings should be called YOURMETRIC.sys.score.gz and formatted in the following way:

    Where: Each field should be delimited by a single tab character.

    Timestamps should be in Epoch seconds, ie. using the "date +%s" command (Linux) or equivalent. We will use the two timestamps to work out the rough total duration in seconds for your metric to produce scores for the system-level submissions. To avoid inconsistencies across submissions, we request timestamps at the very beginning (and end) of processing the raw data, i.e. before all preprocessing such as tokenization (for both MT output and reference translations) so that this is consistently included in durations for all metrics.

    Output file format for segment-level rankings

    The output files for segment-level rankings should be called YOURMETRIC.seg.score.gz and formatted in the following way:

    Where: Each field should be delimited by a single tab character.

    Note: fields ENSEMBLE and AVAILABLE should be filled with the same value in every line of the submission file for a given metric. Inclusion in this format involves some redundancy but avoids adding extra files to the submission requirements.

    How to submit

    Submissions should be sent as an e-mail to

    In case the above e-mail doesn't work for you (Google seems to prevent non-member postings despite we set it so), please contact us directly.

    Metrics Task Organizers

    Ondřej Bojar (Charles University in Prague)
    Yvette Graham (Dublin City University)
    Amir Kamran (University of Amsterdam, ILLC)


    Supported by the European Commission under the QT 21 project (grant number 645452)