Shared Task: Quality Estimation

This shared task will build on its previous four editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level (and a variant at phrase-level), sentence-level and document-level estimation. The sentence, phrase and word-level tasks will explore a large dataset produced from post-editions by professional translators (as opposed to crowdsourced translations as in the previous year). For the first time, the data will be domain-specific (IT domain). The document-level task will use, for the first time, entire documents, which have been human annotated for quality indirectly in two ways: through reading comprehension tests and through a two-stage post-editing exercise. Our tasks have the following goals:

This year's shared task provides new training and test datasets for all tasks, and allows participants to explore any additional data and resources deemed relevant. A in-house MT system was used to produce translations for the sentence and word-level tasks, and multiple MT systems were used to produce translations for the document-level task. Therefore, MT system-dependent information will be made available where possible.



Task 1: Sentence-level QE

Results here, gold-standard labels here.

New: Download additional data for training, development and test: independent reference translations and post-editing time.

This task consists in scoring (and ranking) sentences according to post-editing effort. Multiple labels will be made available, including the percentage of edits need to be fixed (HTER), post-editing time, and keystrokes. Prediction according to each label will be evaluated independently, and any of these can outputs (or their combination) can be used to produce a ranking of translations. The data consists of 15,000 segments on the IT domain translated by an in-house phrase-based SMT system and post-edited by professional translators. The PET tool was used to collect these various types of information during post-editing. HTER labels are computed using TER (default settings: tokenised, case insensitive, exact matching only, but with scores capped to 1).

As training and development data, we provide English-German datasets with 12,000 and 1,000 source sentences, their machine translations, their post-editions (translations) and HTER as post-editing effort scores (other scores, such as post-editing time can be provided on request). Download development and training data. The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes. Download the baseline features for training and development sets. We note that some HTER scores in training and development sets distributed go beyond 100. We recommend that for training models scores are capped at 100, as the test set scores will be capped at 100 too.

As test data, we provide a new set of 2,000 English-German translations produced by the same SMT system used for the training data. Download test data and the baseline features. Scores will be capped to 100.

The usual 17 features used in WMT12-15 is considered for the baseline system. This system uses SVM regression with an RBF kernel, as well as grid search algorithm for the optimisation of relevant parameters. QuEst++ is used to build prediction models.

Evaluation is performed against the true label and/or ranking using as metrics:



Task 2: Word and phrase-level QE

Results here, gold-standard labels here.

The goal of this task is to study the prediction of word and phrase-level errors in MT output. For practical reasons, we frame the problem as the binary task of distinguishing between 'OK' and 'BAD' tokens. The data for this task is the same as provided in Task 1, with English-German machine translations.

For the word-level variant, as in previous years, all segments are automatically annotated for errors with binary word-level labels by using the alignments provided by the TER tool (settings: tokenised, case insensitive, exact matching only, disabling shifts by using the `-d 0` option) between machine translations and their post-edited versions. Shifts (word order errors) were not annotated as such (but rather as deletions + insertions) to avoid introducing noise in the annotation.

As training and development data, we provide the tokenised translation outputs with tokens annotated with 'OK' or 'BAD' labels. Download development and training data. Please download baseline features for training and development sets from here (updated 4/4/16). The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes.

As test data, we provide tokens from additional 2,000 English-German sentences, produced in the same way. Download test data (and baseline features).

Submissions are evaluated in terms of classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the original labels. F1-score for the 'BAD' class, which has been used as a primary metric in previous years, is biased towards 'pessimistic' labellings. In other words, it favours systems which tend to label more words as 'BAD'. In contrast, the multiplication of F1-OK and F1-BAD has two components which penalise different labellings and balance each other. 'Unfair' labellings (ones where either F1-OK or F1-BAD are close to zero) will have a score close to zero, and the overall score is never greater than any of its components. We will also report the F1-BAD score. Evaluation script. We compute significance level using approximate randomisation method using this script.

As an extension of the word-level task, we introduce a new task: phrase-level prediction. For this task, given a "phrase" (segmentation as given by the SMT decoder), participants are asked to label it as 'OK' or 'BAD'. Errors made by MT engines are interdependent and one incorrectly chosen word can cause more errors, especially in its local context. Phrases as produced by SMT decoders can be seen as a representation of this local context and in this task we ask participants to consider them as atomic units, using phrase-specific information to improve upon the results of the word-level task.

The data to be used is exactly the same as for task 1 and the word-level task. The labelling of this data was adapted from word-level labelling by assigning the 'BAD' tag to any phrase that contains at least one 'BAD' word.

As training and development data, we provide the tokenised translation outputs with phrase segmentation for both source and machine-translated sentences. We also provide target-source phrase alignments and phrase-level labels in separate files. Download development and training data. Please download baseline features for training and development sets from here (updated 4/4/16). The data is publicly available but since it has been provided by our industry partners it is subject to specific terms and conditions. However, these have no practical implications on the use of this data for research purposes.

As test data, we provide tokens from additional 2,000 English-German sentences, produced in the same way. Download test data (and baseline features).

The submissions to the phrase-level task are evaluated in terms of multiplication of word-level F1-OK and word-level F1-BAD. We will use the test set labelled at the word level, but its labels will be converted in order to agree with phrase boundaries: if a phrase has at least one 'BAD' word, all its labels are replaced with 'BAD'.

Sequence

OK OK || BAD OK OK || OK || BAD OK || OK OK

will be converted to:

OK OK || BAD BAD BAD || OK || BAD BAD || OK OK

As baseline system for word and phrase-level, we will use the baseline features provided above to train a CRF model with CRFSuite tool.



Task 3: Document-level QE

Results here, gold-standard labels here.

This task consists in predicting the quality of units larger than sentences. Different from WMT14, this year we consider entire documents, instead of paragraphs. The data was extracted from the WMT 2008-2013 English-Spanish translation shared task datasets. The machine translation for each source document was randomly picked from the set of all systems that participated in the task.

The quality labels were computed based on human annotation. It is an adaptation of HTER obtained from a two-stage post-editing approach similar to the one described in (Scarton et al., 2014). These labels attempt to capture the quality of different documents translated by various MT systems by isolating quality issues that can only be fixed when the entire document is available, from other types of errors, which can be fixed based on the sentence only. In the first stage, sentences are post-edited in isolation and in random order. In the second stage, these post-edited sentences are reorganised into their original document order and further edited (by the same post-editor), now given the document as context. The difference in percentage of edits between the first and second stages is then used to weight final quality HTER score. The goal is to penalise documents that needed more editing in the second stage. Post-editing was done by professional translators.

For the training of prediction models, we provide a new dataset consisting of source documents and their machine translations (English-Spanish), all in the news domain, extracted from the test set of WMT 2008-2013 and MT systems that participated in the translation shared tasks:

As test data, we provide a new set of translations for English→Spanish documents produced in the same way as for the training data. Download test data. Download 17 baseline feature set.

Evaluation is performed against the true quality label and/or ranking using the following metrics:

QuEst++ 17 baseline features for document-level will be used as the baseline system. As with sentence-level, the baseline system is trained using SVM regression with an RBF kernel, as well as grid search algorithm for the optimisation of relevant parameters.



Additional resources

These are the resources we have used to extract the baseline features in Task 1, which can also be useful for Task 2. If you require other resources/info from the MT system, let us know:

English

  • English language model
  • English n-gram counts
  • German

  • German language model
  • German n-gram counts
  • Giza tables

  • English-German (and v.v.) lexical translation table
  • Tasks 3 uses multiple MT systems on WMT data, so the usual news translation task data resources can be used:

    We also suggest the following interesting resources that can be used as additional data for training (notice the difference in language pairs and/or text domains and/or MT systems):



    Submission Format

    Tasks 1 and 3: Sentence- and document-level

    The output of your system a given subtask should produce scores for the translations at the segment-level of the relevant task (sentence or document) formatted in the following way:

    <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>

    Where: Each field should be delimited by a single tab character.

    Task 2: Word-level QE

    The output of your system should produce scores for the translations at the word-level formatted in the following way:

    <METHOD NAME> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE> 

    Where: Each field should be delimited by a single tab character.

    Submission Requirements

    Each participating team can submit at most 2 systems for each of the language pairs of each subtask. These should be sent via email to Lucia Specia
    lspecia@gmail.com. Please use the following pattern to name your files:

    INSTITUTION-NAME_TASK-NAME_METHOD-NAME, where:

    INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF

    TASK-NAME is one of the following: 1, 2, 3.

    METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 2_J48, 2_SVM

    For instance, a submission from team SHEF for task 2 using method "SVM" could be named SHEF_2_SVM.

    You are invited to submit a short paper (4 to 6 pages) to WMT describing your QE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

    Important dates

    Release of training data January 30, 2016
    Release of test data April 10 2016
    QE metrics results submission deadline May 6 2016
    Paper submission deadlineMay 15, 2016
    Notification of acceptanceJune 5, 2016
    Camera-ready deadlineJune 22, 2016

    Organisers


    Varvara Logacheva (University of Sheffield)
    Carolina Scarton (University of Sheffield)
    Lucia Specia (University of Sheffield)

    Contact

    For questions or comments, email Lucia Specia lspecia@gmail.com.

    Supported by the European Commission under the
    projects (grant numbers 317471 and 645452)