Shared Task: Quality Estimation



**UPDATE** -- Official results available.

Important dates

Release of training and dev data April 10th, 2021
Release of test dataEarly June, 2021
Test predictions deadlineJuly 20th, 2021 July 26th, 2021 end of day, AOE

Overview

This shared task focuses on automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. It will cover estimation at sentence and word levels. The main new elements introduced this year are: (i) a zero-shot sentence-level prediction task to encourage language independent and unsupervised approaches; (ii) a task on predicting catastrophic, i.e. critical translation errors, in other words, errors that make the translation convey a completely different meaning, which could lead to negative effects such as safety risks. In addition, we release new test sets for 2020's Tasks 1 and 2, and an extended version of the Wikipedia post-editing training data from 2 to 7 languages. Finally, for all tasks, participants will be asked to provide info on their model size (disk space without compression and number of parameters) with their submission and will be able to rank systems based on that.

In addition to generally advancing the state of the art in quality estimation, our specific goals are:

For all tasks, the datasets and NMT models that generated the translations are publicly available.

Participants are also allowed to explore any additional data and resources deemed relevant. Below are the three QE tasks addressing these goals.



Task 1: Sentence-Level Direct Assessment

This task offers the same training data as the WMT2020 Task 1: Wikipedia data for 6 language pairs that includes high-resource English--German (En-De) and English--Chinese (En-Zh), medium-resource Romanian--English (Ro-En) and Estonian--English (Et-En), and low-resource Sinhalese--English (Si-En) and Nepalese--English (Ne-En), as well as a dataset with a combination of Wikipedia articles and Reddit articles for Russian-English (Ru-En). The datasets were collected by translating sentences sampled from source language articles using state-of-the-art Transformer NMT models and annotated with a variant of Direct Assessment (DA) scores by professional translators. Each sentence was annotated following the FLORES setup, which presents a form of DA, where at least three professional translators rate each sentence from 0-100 according to the perceived translation quality. DA scores are standardised using the z-score by rater. Participating systems are required to score sentences according to z-standardised DA scores.

New: We provide new blind test sets of 1K sentence pairs for all languages, as well as test sets for 4 new language pairs for which only no training data will be given:

Training, dev and test data: Download the training, development, test20, test21 data consisting of the following Wikipedia/Reddit datasets, all with 7K sentences for training, 1K sentences for development, 1K for the 2020 test set, including info from the NMT model used to generate the translations: model score for the sentence and log probabilities for words, as well as the title of the Wikipedia article where the source sentence came from:

You can donwload the NMT models used to generate the Wikipedia translations, as well as the Ru-En Wikipedia/Reddit NMT model. Here are details on the training data used to build the Ru-En model. The zero-shot translations were produced by a multilimgual Transformer NMT model.

Test data: We provide 1K new test sentences pairs for each of the language pairs above, as well as info from the NMT model used to generate the translations: model score for the sentence and log probabilities for words. In addition, we will provide test data for 4 new language pairs for zero-shot prediction.

Download the test data 21.

Baseline: The baseline system is a neural predictor-estimator approach implemented in OpenKiwi, similar to the one used here. For the predictor/feature generation part, the baseline model uses a multilingual pre-trained encoder, namely XLM-Roberta (xlm-roberta-base model from huggingface). The baseline model is finetuned on DA scores for Task 1.

Evaluation: Sentence-level submissions will be evaluated in terms of the Pearson's correlation metric for the DA prediction agains human DA (z-standardised mean DA score, i.e. z_mean). These are the official evaluation scripts. The evaluation will focus on multilingual systems, i.e. systems that are able to provide predictions for all languages, including the zero-shot ones. Therefore, average Pearson correlation across all these languages will be used to rank QE systems. We will also evaluate QE systems on a per-language basis for those interested in particular languages, and the zero-shot scenario.



Task 2: Word and Sentence-Level Post-editing Effort

This task evaluates the application of QE for post-editing purposes. It consists of predicting:

Training, dev and test data: The data this year is the same as that used in Task 1, but with labels derived from post-editing. Download the training, development, test20 data. Word-level labels have been obtained by using the alignments provided by the TER tool (settings: tokenised, case insensitive, exact matching only, disabling shifts by using the `-d 0` option) between machine translations and their post-edited versions. Shifts (word order errors) were not annotated as such (but rather as deletions + insertions) to avoid introducing noise in the annotation. HTER values are obtained deterministically from word-level tags. However, when computing HTER, we allow shifts in TER. Please note that we replaced the 2020 training, dev and test sets as there were some issues with the annotation. Make sure to download the new version from the repository.

Test data: We provide 1K new test sentences pairs for each of the language pairs aboveand the 4 new language pairs for zero-shot prediction.


Download the test 21 data .

Baseline: The baseline system is the same as for Task 1, except that here it is finetuned on HTER and word level tags (jointly).

Evaluation: For sentence-level QE, submissions are evaluated in terms of the Pearson's correlation metric for the sentence-level HTER prediction. For word-level QE, they will be evaluated in terms of MCC (Matthews correlation coefficient).

These are the official evaluation scripts.



Task 3: Critical Error Detection

Data for this task is now available in the mlqe-pe repository, including training, dev and test set.

The goal of this task is to predict sentence-level binary scores indicating whether or not a translation contains (at least one) critical error. Translations with such errors are defined as translations that deviate in meaning as compared to the source sentence in such a way that they are misleading and may carry health, safety, legal, reputation, religious or financial implications. Meaning deviations from the source sentence can happen in three ways:

We focus on a set of critical error categories. See examples of critical translations for each category. For this task, we are not expecting the errors to be categorised or to have the span identified in the sentence, but rather to have a binary prediction: 1 (it contains at least one critical error in the above categories), or 0 (it does not contain a critical error in the above categories). In either case, the translation may also contain other types of errors, critical or not.

Training, development and test data: The data consists of Wikipedia comments in English extracted from two sources: the Jigsaw Toxic Comment Classification Challenge and the Wikipedia Comments Corpus, with translations generated by the ML50 multilingual translation model by FAIR. It contains instances in the following languages:

Test data: Approximately 1K sentence pairs for each language pair are provided.

Baseline:The baseline system is a MonoTransQuest model similar to the one used here, with default hyperparameter vaues and XLM-Roberta (namely xlm-roberta-base) as pre-trained presentation, finetuned in the labels provided as a binary classifier. We thank Genze Jiang for helping with the baselines!

Evaluation: Submissions will be evaluated in terms of standard classification metrics, with MCC as the main metric. These are the official evaluation scripts.


Submission Information

For CODALAB submissions, click:

Submission Format

Tasks 1, 2 and 3 (sentence-level)

The output of your system for the sentence-level subtask should be a single file with the two first lines indicating model size, and the rest containing predicted scores, one per line for each sentence, formatted as:

Line 1:
<DISK FOOTRPINT (in bytes, without compression)>
Line 2:
<NUMBER OF PARAMETERS>
Lines 3-n where -n is the number of test samples:
<LANGUAGE PAIR> <METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> 

Where:

Each field should be delimited by a single tab character.

Task 2 (word-level)

We request up to three separate files, one for each type of label: MT words, MT gaps and source words. You can submit for either of these tasks or all of them, independently. The output of your system for each type of label should be labels at the word-level formatted in the following way:

Line 1:

<DISK FOOTRPINT (in bytes, without compression)>

Line 2:

<NUMBER OF PARAMETERS>

Lines 3-n where -n is the number of test samples:

<LANGUAGE PAIR> <METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE>

Where:

Each field should be delimited by a single tab character.


Additional Resources

These are the parallel data used to train the NMT models for tasks 1 and 2:

Useful Software

Here are some open source software for QE that might be useful for participants:


Submission Requirements

Each participating team can submit at most 30 systems for each of the language pairs of each subtask, except for the multilingual track of tasks 1 & 2 (10 systems max). These should be submitted to a CODALAB page for each subtask.

Please check that your system output on the dev data is correctly read by the official evaluation scripts.


Organisers

Lucia Specia (Imperial College London, University of Sheffield)
Marina Fomicheva (University of Sheffield)
Zhenhao Li (Imperial College London)
Frédéric Blain (University of Wolverhampton)
Paco Guzmán (Facebook)
Vishrav Chaudhary (Facebook)
Chryssa Zerva (Instituto de Telecomunicações)
André Martins (Instituto de Telecomunicações, Unbabel)

Contact

For questions or comments on Tasks 1 and 3, email lspecia@gmail.com.
For questions or comments on Task 2, email erickrfonseca@gmail.com.
For questions or comments on Codalab, please use the forum available for each task.

Supported by the European Commission under the project Bergamot.