This shared task will examine automatic methods for estimating machine translation output quality at run-time. Quality estimation is a topic of increasing interest in MT. It aims at providing a quality indicator for unseen translated sentences at various granularity levels. In this shared task, we will focus on sentence-level estimation. Different from MT evaluation, quality estimation systems do not rely on reference translations and are generally addressed using machine learning techniques to predict quality scores. Some interesting uses of sentence-level quality estimation are the following:
The goals of the shared quality estimation task are:
This is the first time quality estimation is addressed as a shared task. This year we will provide datasets for a single language pair, text domain and MT system: English-Spanish news texts produced by a phrase-based SMT system (Moses) trained on Europarl and News Commentaries corpora as provided by WMT. As training data, we will provide translations manually annotated for quality in terms of post-editing effort (1-5 scores), together with their source sentences, reference translations, and post-edited translations. Additional training data can be used, as deemed appropriate. As test data, we will provide source and MT-translated sentences only, but the evaluation will be performed against the manual annotations of those translations (obtained in the same fashion as for the training data). Besides the datasets, we will provide a system to extract baseline quality estimation features and resources that can be used to extract additional features (language model, Giza++ tables, etc.).
The manual annotation for both training and test sets was performed by professional translators as a measure of post-editing effort according to the following scoring scheme:
|1||--||The MT output is incomprehensible, with little or no information transferred accurately. It cannot be edited, needs to be translated from scratch.|
|2||--||About 50% -70% of the MT output needs to be edited. It requires a significant editing effort in order to reach publishable level.|
|3||--||About 25-50% of the MT output needs to be edited. It contains different errors and mistranslations that need to be corrected.|
|4||--||About 10-25% of the MT output needs to be edited. It is generally clear and intelligible.|
|5||--||The MT output is perfectly clear and intelligible. It is not necessarily a perfect translation, but requires little to no editing.|
Each translation was annotated by 3 different annotators and the average of the 3 annotations is used as the final score (a real number between 1 and 5).
We propose two variations of the task:
While rankings can sometimes be generated directly from the sentence-level quality scores (modulo ties), participants can choose to submit to either one or both variations of the task. Please note that the evaluation script will not attempt to explicitly derive rankings from the scores.
Data and baseline system can be downloaded from github.
The source and translations (and reference) sentences will be distributed as plain text files with one segment per line. The output of your software should produce scores for the translations at the segment-level formatted in the following way:
<METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>Where:
METHOD NAMEis the name of your quality estimation method.
SEGMENT NUMBERis the line number of the plain text translation file you are scoring/ranking.
SEGMENT SCOREis the score for the particular segment - assign all 0's to it if you are only submiting ranking results.
SEGMENT RANKis the ranking of the particular segment - assign all 0's to it if you are only submiting scores.
"METHOD NAME"field in the submission format to indicate the name of the team and a descriptor for the method. For instance, a submission from team ABC using method "BestAlg2012" should have the
"METHOD NAME"field in the submission as "ABC_BestAlg2012". For reasons that have to do with the ease of processing of a large estimated number of entries, the official scoring script (available with the official distribution of resources) will enforce this format for the
"METHOD NAME"field as:
<TEAMNAME>_<DESCRIPTION>(please make sure the official script parses your field without complaining before you submit your official submission(s)).
|Release of training data + baseline feature extractor||January 16, 2012 (on github)|
|Release of test set||February 29, 2012 (on github)|
|Submission deadline for quality estimation task||March 7, 2012 (11:59pm PST)|
|Paper submission deadline||April 6, 2012 (11:59pm PST)|
You are invited to submit a short paper (4 to 6 pages) describing your quality estimation method. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your method that we can cite in the overview paper.
In addition to this short paper, we are planning to invite participants to submit an extended version of their papers to a special issue of the MT journal on Quality Estimation.
We encourage individuals who are submitting research papers to submit entries in the shared-task using the training resources provided by this workshop (in addition to potential entries that may use other training resources), so that their experiments can be repeated by others using these publicly available resources.
|SDL Language Weaver|
|University of Sheffield|
funded by the European
Framework Programme 7