Shared Task: Quality Estimation

June 7 - 8, 2012
Montreal, Quebec, Canada


This shared task will examine automatic methods for estimating machine translation output quality at run-time. Quality estimation is a topic of increasing interest in MT. It aims at providing a quality indicator for unseen translated sentences at various granularity levels. In this shared task, we will focus on sentence-level estimation. Different from MT evaluation, quality estimation systems do not rely on reference translations and are generally addressed using machine learning techniques to predict quality scores. Some interesting uses of sentence-level quality estimation are the following:

Efforts in the area are scattered around several groups and, as a consequence, comparing different systems is difficult as there are neither well established baselines, datasets nor standard evaluation metrics. In this shared-task we will provide a first common ground for development and comparison of quality estimation systems: training and test sets, along with evaluation metrics and a baseline system.


The goals of the shared quality estimation task are:

Task Description

This is the first time quality estimation is addressed as a shared task. This year we will provide datasets for a single language pair, text domain and MT system: English-Spanish news texts produced by a phrase-based SMT system (Moses) trained on Europarl and News Commentaries corpora as provided by WMT. As training data, we will provide translations manually annotated for quality in terms of post-editing effort (1-5 scores), together with their source sentences, reference translations, and post-edited translations. Additional training data can be used, as deemed appropriate. As test data, we will provide source and MT-translated sentences only, but the evaluation will be performed against the manual annotations of those translations (obtained in the same fashion as for the training data). Besides the datasets, we will provide a system to extract baseline quality estimation features and resources that can be used to extract additional features (language model, Giza++ tables, etc.).

The manual annotation for both training and test sets was performed by professional translators as a measure of post-editing effort according to the following scoring scheme:

1 -- The MT output is incomprehensible, with little or no information transferred accurately. It cannot be edited, needs to be translated from scratch.
2 -- About 50% -70% of the MT output needs to be edited. It requires a significant editing effort in order to reach publishable level.
3 -- About 25-50% of the MT output needs to be edited. It contains different errors and mistranslations that need to be corrected.
4 -- About 10-25% of the MT output needs to be edited. It is generally clear and intelligible. 
5 -- The MT output is perfectly clear and intelligible.  It is not necessarily a perfect translation, but requires little to no editing.

Each translation was annotated by 3 different annotators and the average of the 3 annotations is used as the final score (a real number between 1 and 5).

We propose two variations of the task:

While rankings can sometimes be generated directly from the sentence-level quality scores (modulo ties), participants can choose to submit to either one or both variations of the task. Please note that the evaluation script will not attempt to explicitly derive rankings from the scores.


Data and baseline system can be downloaded from github.

Submission Format

The source and translations (and reference) sentences will be distributed as plain text files with one segment per line. The output of your software should produce scores for the translations at the segment-level formatted in the following way:


Where: Each field should be delimited by a single tab character.

Submission Requirements

We require that each participating team submits at most 2 separate submissions (consisting of either or both variations of the task), sent via email to the organizers (Lucia Specia and Radu Soricut Please use the "METHOD NAME" field in the submission format to indicate the name of the team and a descriptor for the method. For instance, a submission from team ABC using method "BestAlg2012" should have the "METHOD NAME" field in the submission as "ABC_BestAlg2012". For reasons that have to do with the ease of processing of a large estimated number of entries, the official scoring script (available with the official distribution of resources) will enforce this format for the "METHOD NAME" field as: <TEAMNAME>_<DESCRIPTION> (please make sure the official script parses your field without complaining before you submit your official submission(s)).


Release of training data + baseline feature extractor January 16, 2012  (on github)
Release of test set February 29, 2012 (on github)
Submission deadline for quality estimation task March 7, 2012 (11:59pm PST)
Paper submission deadline April 6, 2012 (11:59pm PST)


Lucia Specia (University of Sheffield)
Radu Soricut (SDL Language Weaver)

Other Requirements

You are invited to submit a short paper (4 to 6 pages) describing your quality estimation method. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your method that we can cite in the overview paper.

In addition to this short paper, we are planning to invite participants to submit an extended version of their papers to a special issue of the MT journal on Quality Estimation.

We encourage individuals who are submitting research papers to submit entries in the shared-task using the training resources provided by this workshop (in addition to potential entries that may use other training resources), so that their experiments can be repeated by others using these publicly available resources.


For questions, comments, etc. please send email to Lucia Specia and Radu Soricut

Supported by:
SDL Language Weaver
University of Sheffield
EuroMatrixPlus project  
funded by the European  
Comission under  
Framework Programme 7