Quality Estimation Task - EMNLP fifth Conference on Machine Translation

Shared Task: Quality Estimation

**UPDATE** -- Official results and submissions are available.

Important dates

Release of training and dev data	March 19, 2020
Release of test data	May 5, 2020
QE result submission deadline	July 15, 2020
Paper submission deadline	August 15, 2020
Notification of acceptance	September 29, 2020
Camera-ready deadline	October 10, 2020
Conference	November 19-20, 2020

Overview

This shared task will build on its previous editions to further examine automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. As in previous years, we cover estimation at various levels. Important elements introduced this year include: a new task where sentences are annotated with Direct Assessment (DA) scores instead of labels based on post-editing; a new multilingual sentence-level dataset mainly from Wikipedia articles, where the source articles can be retrieved for document-wide context; the availability of NMT models to explore system-internal information for the task.

In addition to generally advancing the state of the art at all prediction levels for modern neural MT, our specific goals are:

to create a new set of public benchmarks for tasks in quality estimation,
to investigate models for predicting DA scores and their relationship with models trained for predicting post-editing effort,
to study the feasibility of multilingual (or even language independent) approaches to QE, and
to study the influence of source-language document-level context for the task of QE, and
to analyse the applicability of NMT model information for QE.

The datasets and models released are publicly available. Participants are also allowed to explore any additional data and resources deemed relevant. Below are the three QE tasks addressing these goals.

Task 1: Sentence-Level Direct Assessment

This task will use Wikipedia data for 6 language pairs that includes high-resource English--German (En-De) and English--Chinese (En-Zh), medium-resource Romanian--English (Ro-En) and Estonian--English (Et-En), and low-resource Sinhalese--English (Si-En) and Nepalese--English (Ne-En), as well as a dataset with a combination of Wikipedia articles and Reddit articles for Russian-English (En-Ru). The datasets were collected by translating sentences sampled from source language articles using state-of-the-art NMT models built using the fairseq toolkit and annotated with Direct Assessment (DA) scores by professional translators. Each sentence was annotated following the FLORES setup, which presents a form of DA, where at least three professional translators rate each sentence from 0-100 according to the perceived translation quality. DA scores are standardised using the z-score by rater. Participating systems are required to score sentences according to z-standardised DA scores.

Data: Download the training and development data consisting of the following Wikipedia datasets, all with 7K sentences for training, 1K sentences for development, including info from the NMT model used to generate the translations: model score for the sentence and log probabilities for words, as well as the title of the Wikipedia article where the source sentence came from:

English-German
English-Chinese
Romanian-English
Estonian-English
Nepalese-English
Sinhala-English

You can donwload here the NMT models used to generate the translations.

We also provide a dataset with a combination of Russian Reddit forums (75%) and Russian WikiQuotes (25%), also with 7K sentences for training, 1K sentences for development and the same meta-information as above:

Russian-English: training and development data, and NMT model. Here are details on the training data used to build this model.

Baseline: The baseline system is a neural predictor-estimator approach implemented in OpenKiwi (Kepler at al., 2019), where the predictor model will be trained on the parallel data used to train the NMT model (see data below). To foster improvements over this baseline, we are providing the trained predictor models for all language pairs (they can be used for both Task 1 and Task 2):

Test data: Download the test data consisting of 1K source and machine translated sentences, including info from the NMT model used to generate the translations: model score for the sentence and log probabilities for words, as well as the title of the Wikipedia article where the source sentence came from. Download the corresponding Russian-English test data.

Evaluation: Sentence-level submissions will be evaluated in terms of the Pearson's correlation metric for the DA prediction agains human DA (z-standardised mean DA score, i.e. z_mean). These are the official evaluation scripts. The evaluation will focus on multilingual systems, i.e. systems that are able to provide predictions for all languages in the Wikipedia domain. Therefore, average Pearson correlation across all these languages will be used to rank QE systems. We will also evaluate QE systems on a per-language basis for those interested in particular languages.

Task 2: Word and Sentence-Level Post-editing Effort

This task evaluates the application of QE for post-editing purposes. It consists of predicting:

Word-level tags. This is done both on source side (to detect which words caused errors) and target side (to detect mistranslated or missing words).

Target. Each token is tagged as either OK or BAD. Additionally, each gap between two words is tagged as BAD if one or more missing words should have been there, and OK otherwise. Note that number of tags for each target sentence is 2*N+1, where N is the number of tokens in the sentence.
Source. Tokens are tagged as OK if they were correctly translated, and BAD otherwise. Gaps are not tagged.

Sentence-level HTER scores. HTER (Human Translation Error Rate) is the ratio between the number of edits (insertions/deletions/replacements) needed and the reference translation length.

Data: The data is a subset of the data used in Task 1 for two of the languages, consisting of the same 7K for sentences training, and 1K sentences for development.

Test data (blind):

Data preparation:Word-level labels have been obtained by using the alignments provided by the TER tool (settings: tokenised, case insensitive, exact matching only, disabling shifts by using the `-d 0` option) between machine translations and their post-edited versions. Shifts (word order errors) were not annotated as such (but rather as deletions + insertions) to avoid introducing noise in the annotation.

HTER values are obtained deterministically from word-level tags. However, when computing HTER, we allow shifts in TER.

Evaluation: For sentence-level QE, submissions are evaluated in terms of the Pearson's correlation metric for the sentence-level HTER prediction. For word-level QE, they will be evaluated in terms of MCC (Matthews correlation coefficient).

These are the official evaluation scripts.

Task 3: Document-Level QE

The goal of this task is to predict document-level quality scores as well as fine-grained annotations.

Each document has a product title and its description, and is annotated for translation errors according to the MQM framework. Each error annotation has:

Word span(s). Errors may consist of one or more words, not necessarily contiguous.
Severity. An error can be minor (if it doesn't lead to a loss of meaning and it doesn't confuse or mislead the user), major (if it changes the meaning) or critical (if it changes the meaning and carry any type of implication, or could be seen as offensive).
Type. A label specifying the error type, such as wrong word order, missing words, agreement, etc. They may provide additional information, but systems don't need to predict them.

Additionally, there are document-level scores (called MQM scores). They were generated from the error annotations using the method in this paper (footnote 6).

Systems may return not only predicted MQM scores, but also (optionally) fine-grained predicted annotations along with a severity level. They are encouraged to do so.

Data:The data is derived from the Amazon Product Reviews dataset and contains 1,448/200 English-French training/development documents.

English-French

Test data (blind):

English-French

Note: The training data for this year is the combination of the training data used in 2018 with the test sets of 2018 and 2019. The development set is the same as in 2018.

Baseline: The baseline system will be a system derived from the word-level baseline from Task 2 augmented with severity labels, which are then converted to phrases and used to compute MQM scores.

Evaluation: Submissions will be evaluated as in Task 1, in terms of Pearson's correlation between the true and predicted MQM document-level scores. Additionally, the predicted annotations will be evaluated in terms of their F1 scores with respect to the gold annotations.

The official evaluation scripts are available.

Submission Format

Tasks 1 and 2 (sentence-level)

The output of your system for the sentence-level subtask should be a single file with the predicted score for each sentence, formatted as:

<LANGUAGE PAIR> <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE>

Where:

LANGUAGE PAIR is the ID (e.g. en-de) of the language pair of the plain text translation file you are scoring.
METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring.
SEGMENT SCORE is the predicted (DA/HTER) score for the particular segment.

Each field should be delimited by a single tab character.

Task 2 (word-level)

The output for the word-level subtask can be up to two separate files: one with MT labels (for words and gaps) and another one with source words. You can submit for either of these subtasks or both of them, independently. The output format should be the same as in the .tags and .source_tags files in the training data; i.e., the .tags file should be formatted as:

GAP_1 WORD_1 GAP_2 WORD_2 ... GAP_n WORD_n GAP_n+1

and the .source_tags file should be:

WORD_1 WORD_2 ... WORD_n

Where each WORD_i and GAP_i is either OK or BAD. Tags must be delimited by whitespace or a tab character.

For MT labels, each sentence will therefore correspond to 2n+1 tags (where n is the number of words in the sentence), alterating gaps and words, in order. For source labels, each sentence will correspond to n tags.

For example, consider the following MT document and its post edited version. The wrong words are highlighted:

anschließend wird in jeder Methode die übergeordnete Superclass-Version von selbst aufgerufen .
anschließend wird in jeder Methode die Superclass-Version dieser Methode aufgerufen .

For this translation, output tags should be as follows. To make them easier to distinguish, tags referring to gaps are highlighted in yellow, while those referring to words are in blue. In this example, all BAD words can be either removed or replaced by other words in the post edited text; since no insertions are necessary, all gaps are tagged as OK.

OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
BAD
OK
OK
OK
BAD
OK
BAD
OK
OK
OK
OK
OK

Task 3 (document-level MQM)

The output of your system for the MQM subtask should produce scores for the translations at the document level. Since documents are organized in different directories, you also need to identify which document a score is assigned to. Each output line should formatted in the following way:

<METHOD NAME> <DOCUMENT ID> <DOCUMENT SCORE>

Where:

METHOD NAME is an identifier of the system used; it is not important for the evaluation result.
DOCUMENT ID is the identifier of the translation you are scoring; it is the name of the corresponding directory.
DOCUMENT SCORE is the predicted MQM score for the document.

Lines can be in any order. Each field should be delimited by a single tab character.

Example of the document-level format:

BERT	B00014ZYNS	00.000
BERT	B0001NECEG	11.111
BERT	B0002GRV72	22.222

The example shows that documents under "B00014ZYNS", "B0001NECEG", "B0002GRV72", have got predicted quality scores of 00.000, 11.111 and 22.222, respectively.

For the fine-grained annotation subtask, systems will have to predict which text spans contain translation errors, as well as classify them as minor, major or critical. Two or more spans can be part of the same error annotation (for example, in agreement errors in which a noun and an adjective are not adjacent).

The system output format is similar to the annotations.tsv files in the training data, but should include the document id. Each line in the output refers to a single error annotation (containing one or more spans) and should be formatted like this:

<METHOD NAME> <DOCUMENT ID> <LINES> <SPAN START POSITIONS> <SPAN LENGTHS> <SEVERITY>

Where:

METHOD NAME is an identifier of the system used; it is not important for the evaluation result.
DOCUMENT ID is the containing folder, as in the MQM subtask.
LINES is a list of lines containing the error spans, starting from 0 and separated by white space.
SPAN START POSITIONS is a list of the character offsets in which the spans begin, separated by white space, and also starting from 0. The number of start positions should be the same as in lines.
SPAN LENGTHS is a list of the lengths, in number of characters, of the error spans. The number of lengths must be the same as the start positions. Spans should not overlap.
SEVERITY is either minor, major or critical.

Note that while the training data includes the error category (such as missing words or word order), this field is not necessary in the system output.

For example, consider the following translation and the highlighted error spans. Notice that the two spans in green are part of the same annotation (they indicate an agreement error); while the blue span includes only a white space character (it indicates a missing word).

Kit de Remington AirMaster 77 avec portée
La carabine à air Crosman Remington AirMaster 77 multi-pompe pneumatique lance des boulettes à une énorme 725 les pieds par seconde (755 fps avec BBs), donc vous pouvez frapper efficacement vos cibles à distance.
La carabine à air comprimé est équipé d’un guidon fibre optique et complètement réglable cran de mire qui le rendent facile à acquérir votre cible.

Considering this document is under the directory B0002IL6WQ, these error spans can be described with the following output:

BERT	B0002IL6WQ	0	35	6	minor
BERT	B0002IL6WQ	1	110	3	major
BERT	B0002IL6WQ	2	49	1	major
BERT	B0002IL6WQ	2 2	3 31	8 6	minor

Additional Resources

These are the parallel data used to train the NMT models for tasks 1 and 2:

Useful Software

Here are some open source software for QE that might be useful for participants:

Submission Requirements

Each participating team can submit at most 30 systems for each of the language pairs of each subtask, except for the multilingual track of task 1 (5 systems max). These should be submitted to a CODALAB page for each subtask:

Please check that your system output on the dev data is correctly read by the official evaluation scripts.

Organisers

Lucia Specia (Imperial College London, University of Sheffield, Facebook)
Marina Fomicheva (University of Sheffield)
Frédéric Blain (University of Sheffield, University of Wolverhampton)
Paco Guzmán (Facebook)
Vishrav Chaudhary (Facebook)
Erick Fonseca (Instituto de Telecomunicações)
André Martins (Instituto de Telecomunicações, Unbabel)

Contact

For questions or comments on Task 1, email lspecia@gmail.com.
For questions or comments on Tasks 2 and 3, email erickrfonseca@gmail.com.