WMT16 Metrics Task

Metrics Task Important Dates

System outputs ready to download May 1, 2016
Start of manual evaluation periodMay 2, 2016
Paper submission deadlineMay 8 May 15, 2016
Submission deadline for metrics taskMay 22, 2016
End of manual evaluationMay 22, 2016
Notification of acceptanceJune 5, 2016
Camera-ready deadlineJune 22, 2016
Conference in BerlinAugust 11-12th, 2016

Metrics Task Overview

This shared task will examine automatic evaluation metrics for machine translation. We will provide you with all of the translations produced in the translation task along with the reference human translations. You will return your automatic metric scores for each of the translations at the system-level and/or at the sentence-level. We will calculate the system-level and sentence-level correlations of your rankings with WMT16 human judgements once the manual evaluation has been completed.


The goals of the shared metrics task are:

Changes This Year

Metrics Task goes crazy this year. The good news is that if you do not aim at bleeding edge performance, you will be affected minimally:

File formats are not changed (see below), only the TEST SET should include the track name.

If you do want to provide bleeding-edge results, you may want to know a bit more about the composition of the test sets, system sets, ways of evaluation and the training data we provide.

In short, we are adding "tracks" to cover:

The madness is fully summarized in a live Google sheet.

You can easily identify the track by the test set label (e.g. “RRsegNews+”) and based on that, you may want to use a variant of your metric adapted for the task, e.g. tuned on a different development set. Training data are listed below.

Remember to describe the exact setup of your metric used for each of the tracks in your metric paper.

Task Description

We will provide you with the output of machine translation systems and reference translations (2 references for Finnish, 1 for others) for several language pairs involving English and the following languages: Basque, Bulgarian, Czech, Dutch, Finnish, German, Polish, Portuguese, Romanian, Russian, Spanish, and Turkish. You will compute scores for each of the outputs at the system-level and/or the sentence-level. If your automatic metric does not produce sentence-level scores, you can participate in just the system-level ranking. If your automatic metric uses linguistic annotation and supports only some language pairs, you are free to assign scores only where you can.

We will measure the goodness of automatic evaluation metrics in the following ways:

Summary of Tracks

The following table summarizes the planned evaluation methods and text domains of each evaluation track.

Track Text Domain Level Golden Truth Source
RRsysNews news, from WMT16 news task system-level relative ranking
RRsysIT IT, from WMT16 IT task system-level relative-ranking
DAsysNews news, from WMT16 news task system-level direct assessment
RRsegNews news, from WMT16 news task segment-level relative ranking
DAsegNews news, from WMT16 news task segment-level direct assessment
HUMEseg (consumer) medical, from HimL segment-level correctness of translation of all semantic nodes

Other Requirements

If you participate in the metrics task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool.

You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.


Test Sets (Evaluation Data)

WMT16 metrics task test sets are ready. Since we are trying to establish better confidence intervals for system-level evaluation, we have more than 10k system outputs per language pair and test set, so the packages are quite big.

See the Google sheet if you want to take part in only some of the languages or tracks and do not want to download more than needed.

Note that the actual sets of sentences differ across test sets (that's natural) but they also differ across language pairs. So always use the triple {test set name, source language, target language} to identify the test set source, reference and a system output.

There are two references for English-to-Finnish newstest: newstest2016-enfi-ref.fi and newstest2016-enfi-ref.fiB. You are free to use both; if you use only one, please pick former variant.

Packages per Language Pair

To take part in a particular language pair (seg-level or sys-level), download the package for the language pair (as we are adding them):

This loop downloads all the packages (10 GB): for lp in cs-en de-en en-bg en-cs en-de en-es en-eu en-fi en-nl en-pl en-pt en-ro en-ru en-tr fi-en ro-en ru-en tr-en; do wget http://ufallab.ms.mff.cuni.cz/~bojar/wmt16-metrics-task-data/wmt16-metrics-inputs-for-$lp.tar.bz2; done

By downloading the above packages, you have everything for that language pair.

Each package contains one or more test sets (their source, e.g. newstest2016-csen-src.cs, reference newstest2016-csen-ref.en) and system outputs for each of the test sets (e.g. newstest2016.online-B.0.cs-en). Along with the normal MT systems, there are 10k hybrid systems for the newstest2016 stored in the directories H0 through H9 and/or 10k hybrid systems for the ittest2016 stored in the directories I0 through I9.

The filename of each system follows the pattern TESTSET.SYSTEMNAME.SYSTEMID.SRC-TGT, including the hybrids which differ only in their IDs. All filenames across the whole metrics task are unique, but do not put more than 10k files in a directory.

For system-level evaluation, you need to score all systems, including the hybrid ones. For segment-level evaluation, you need to score only the normal systems and you can ignore the [HI]* directories.

Package for Segment-Level Metrics Only

If you want to participate only in segment-level metrics, we do not need the 10k extra systems, so the package is smaller and includes all languages:

Training Data

You may want to use some of the following dataset to tune or train your metric.

RR (Relative Ranking) from Past Years

The system outputs and human judgments from the previous workshops are available for download from the following links:

You can use them to tune your metric's free parameters if it has any. If you want to report results in your paper, you can use this data to compare the performance of your metric against the published results from past years.

Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.

There are no specific training data for RRsysNews vs. RRsysIT. (Or put differently, you have to resort to news-based RR data also for RRsysIT).

DA (Direct Assessment) Training Data

For segment-level, we provide a development set of 500 sentences translated from Czech, German, Finnish and Russian (500 each) into English (translations were sampled at random from outputs of all systems participating in WMT15 translation task). The dataset contains:

The package is available here:

There are some direct assessments judgements for system-level English<->Spanish, but this language pairs is not among the tested pairs this year. Contact Yvette Graham if you are interested in this dataset.


There are no training data for the HUMEseg track.

To give you at least some background, the golden truth segment-level scores are constructed from manual annotations indicating if each node in the semantic tree of the source sentence was translated correctly. The underlying semantic representation is UCCA.

There is only one system output per segment.

Submission Format

The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).

If you have a single setup for all domains and evaluation tracks, simply report the test set name (newstest2016, ittest2016 and himltest) with your scores as usual and described below. We will evaluate your outputs in all applicable tracks.

If your setups differ based on the provided training data or domain knowledge, please include evaluation track name as a part of the test set name. Valid track names are: RRsysNews, RRsysIT, DAsysNews, RRsegNews, DAsegNews and HUMEseg; see above.

Output file format for system-level rankings

The output files for system-level rankings should be called YOURMETRIC.sys.score.gz and formatted in the following way:

Where: Each field should be delimited by a single tab character.

Output file format for segment-level rankings

The output files for segment-level rankings should be called YOURMETRIC.seg.score.gz and formatted in the following way:

Where: Each field should be delimited by a single tab character.

How to submit

Submissions should be sent as an e-mail to wmt-metrics-submissions@googlegroups.com.

In case the above e-mail doesn't work for you (Google seems to prevent non-member postings despite we set it so), please contact us directly.

Metrics Task Organizers

Miloš Stanojević (University of Amsterdam, ILLC)
Amir Kamran (University of Amsterdam, ILLC)
Yvette Graham (Dublin City University)
Ondřej Bojar (Charles University in Prague)


Supported by the European Commision under the QT 21 project (grant number 645452)