EMNLP 2015 TENTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Shared Task: Quality Estimation

17-18 September 2015
Lisbon, Portugal

[HOME] | [TRANSLATION TASK] | [METRICS TASK] | [TUNING TASK] | [QUALITY ESTIMATION TASK] | [AUTOMATIC POST-EDITING TASK] | [SCHEDULE] | [PAPERS] | [AUTHORS] | [RESULTS]

This shared task will build on its previous three editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We once again consider word-level and sentence-level estimation. Moreover, this year introduces a new task: document-level estimation. The sentence- and word-level tasks will explore a much larger dataset in comparison to previous years. In addition, the quality annotations for this dataset have been produced from crowdsourced post-editions, instead of professional translators. Altogether, our tasks have the following goals:

This year's shared task provides new training and test datasets for all tasks, but allow participants to reuse data and resources from previous years, or any external resource deemed relevant. An online system was used to produce translations for the sentence- and word-level tasks, and multiple MT systems were used to produce translations for the document-level tasks. Therefore, resources used to build the actual MT systems (or any internal MT features) cannot be made available.



Task 1: Sentence-level QE

Results here, gold-standard labels here

This task consists in scoring (and ranking) sentences according to the percentage of edits need to be fixed (HTER). It is similar to task 1.2 in WMT14, with HTER used as quality score , i.e. the minimum edit distance between the machine translation and its manually post-edited version in [0,1]. The data is the same as that used for the WMT15 APE task. Translations are produced by a single online SMT system, which needs to be treated as black-box as we do not have access to the actual system. Each of the training and test translations was post-edited by a crowdsourced translator, and HTER labels were computed using TER (default settings: tokenised, case insensitive, exact matching only, but with scores capped to 1).

As training and development data, we provide English-Spanish datasets with 11,271 and 1,000 source sentences, their machine translations, their post-editions (translations) and HTER scores, respectively. Download development data (and baseline features). Download training data (and baseline features).

As test data, we provide a new set of 1,817 English-Spanish translations produced by the same SMT system used for the training data. Download test data (and baseline features).

The same 17 features used in WMT12-13-14 is considered for the baseline system. This system uses SVM regression with an RBF kernel, as well as grid search algorithm for the optimisation of relevant parameters. QuEst is used to build prediction models and this script is used to evaluation the models. For significance tests, we use the bootstrap resampling method with this code.

As in previous years, two variants of the results can be submitted:

Evaluation is performed against the true label and/or HTER ranking using the same metrics as in previous years:



Task 2: Word-level QE

Results here, gold-standard labels here

The goal of this task is to evaluate the extent to which we can detect word-level errors in Machine Translation output by annotating translation errors on a sub-sentence level. Often, the overall quality of a translated segment is significantly lowered by specific errors in a a small number of words or phrases. Various types of errors can be found in translations, but for this task we consider all error types together, creating a binary distinction between 'GOOD' and 'BAD' tokens.

The data for this task is the same as provided in Task 1, with English-Spanish machine translations produced by the same online SMT system. All segments have been automatically annotated for errors with binary word-level labels by using the alignments provided by the TER tool (settings: tokenised, case insensitive, exact matching only, disabling shifts by using the `-d 0` option) between machine translations and their post-edited versions. The edit operations considered as errors are: replacements, insertions and deletions. Shifts (word order errors) were not annotated as such (but rather as deletions+insertions) to avoid introducing noise in the annotation.

As training and development data, we provide the tokenized translation outputs with tokens annotated with good or bad labels. Download development data (and baseline features). Download training data (and baseline features).

As test data, we provide tokens from additional 1,817 English-Spanish sentences, produced in the same way. Download test data (and baseline features).

Submissions are evaluated in terms of classification performance (precision, recall, F-1) against the original labels. The main evaluation metric is the average F1 for the "Bad" class. Evaluation script. We also provide an alternative evaluation script that takes as input labels in the exact same format as the labels distributed for training and dev sets, i.e.: one line per sentence, one tag per word, whitespace separated, with tags in the set {'OK', 'BAD'}. For significance tests, we used the approximate randomisation method with this code.

As baseline system for this task we use the baseline features provided above to train a binary classifier using a standard logistic regression algorithm (available for example in the scikit-learn toolkit).



Task 3: Document-level QE

Results here, gold-standard labels here

This task consists of predicting the quality of units larger than sentences. For practical reasons, in this first edition, we will use paragraphs, as opposed to entire documents. We consider as application a scenario where the reader needs to process the translation of an entire text, as opposed to individual sentences, and has no knowledge of the source language. The quality label is computed against references using METEOR (settings: exact match, not tokenised, case insensitive, capped to 1 - from the Asiya toolkit). Participants are encouraged to devise and explore document-wide features.

For the training of prediction models, we provide a new dataset consisting of source paragraphs and their machine translations (for English-German or German-English), all in the news domain, extracted from the test set of WMT13 and MT systems that participated in the translation shared task:

As test data, we provide a new set of translations produced by the same SMT systems used for the training data:

Two variants of the results can be submitted:

For each language pair, evaluation is performed against the true METEOR label and/or ranking using the same metrics as in previous years for sentence-level:

QuEst's 17 baseline features for paragraph-level is used as the baseline system. As for sentence-level, the baseline system is trained using SVM regression with an RBF kernel, as well as grid search algorithm for the optimisation of relevant parameters. We use the same evaluation script as for sentence-level. For significance tests, we use the bootstrap resampling method with this code.



Additional resources

We suggest the following interesting resources that can be used as additional data for training (notice the difference in language pairs and/or text domains and/or MT systems):

These are the resources we have used to extract the baseline features in Tasks 1 and 3:

English

  • English source training corpus
  • English language model
  • English language model of POS tags
  • English n-gram counts
  • English truecase model
  • Spanish

  • Spanish source training corpus
  • Spanish language model
  • Spanish language model of POS tags
  • Spanish n-gram counts
  • Spanish truecase model
  • German

  • German source training corpus
  • German language model
  • German language model of POS tags
  • German n-gram counts
  • German truecase model
  • Giza tables

  • English-Spanish Lexical translation table src-tgt
  • English-German Lexical translation table src-tgt
  • Spanish-English Lexical translation table src-tgt
  • German-English Lexical translation table src-tgt


  • Submission Format

    Tasks 1 and 3: Sentence- and paragraph-level

    The output of your system a given subtask should produce scores for the translations at the segment-level of the relevant task (sentence or paragraph) formatted in the following way:

    <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>

    Where: Each field should be delimited by a single tab character.

    Task 2: Word-level QE

    The output of your system should produce scores for the translations at the word-level formatted in the following way:

    <METHOD NAME> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE> 

    Where: Each field should be delimited by a single tab character.

    Submission Requirements

    Each participating team can submit at most 2 systems for each of the language pairs of each subtask. These should be sent via email to Lucia Specia lspecia@gmail.com. Please use the following pattern to name your files:

    INSTITUTION-NAME_TASK-NAME_METHOD-NAME, where:

    INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF

    TASK-NAME is one of the following: 1, 2, 3.

    METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 2_J48, 2_SVM

    For instance, a submission from team SHEF for task 2 using method "SVM" could be named SHEF_2_SVM.

    You are invited to submit a short paper (4 to 6 pages) to WMT describing your QE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

    Important dates

    Release of training data February 15, 2015
    Release of test data May 4, 2015
    QE metrics results submission deadline June 2nd, 2015
    Paper submission deadlineJune 28, 2015
    Notification of acceptanceJuly 21, 2015
    Camera-ready deadlineAugust 11, 2015

    Organisers

    Chris Hokamp (Dublin City University)
    Carolina Scarton (University of Sheffield)
    Lucia Specia (University of Sheffield)
    Varvara Logacheva (University of Sheffield)

    Contact

    For questions or comments, email Lucia Specia lspecia@gmail.com.

    Supported by the European Commission under the
    projects (grant numbers 317471 and 645452)