Quality Estimation Task - ACL 2019 Fourth Conference on Machine Translation

Shared Task: Quality Estimation

Update: Results are available for tasks 1 and 2. Participant submissions for all tasks are also available.

Important dates

Release of training data	February 21, 2019
Release of test data	April 29, 2019
QE result submission deadline	May 10, 2019
Paper submission deadline	May 17, 2019
Notification of acceptance	June 7, 2019
Camera-ready deadline	June 17, 2019
Conference in Florence	August 1-2, 2019

Overview

This shared task will build on its previous editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We cover estimation at various levels, whereas tasks that rely on annotations or post-edits use data that was created by professional translators; the datasets are domain-specific (Amazon reviews and IT domain). This year all datasets are based on neural MT output.

In addition to generally advancing the state of the art at all prediction levels, our specific goals as addressed in tasks 1-3 are:

to study the performance of quality estimation approaches on the output of modern neural MT systems, at word-level, sentence-level, and document-level
to create another set of public benchmarks for tasks in quality estimation
to study the predictability of incorrect and missing words in the MT output and corresponding source words that lead to errors in the MT output
to study new approaches to document-level quality estimation, via prediction of a global quality score as well as character-level annotations with severity judgements. This will be done using a corpus manually annotated with a fine-grained error taxonomy, from which document-level scores are derived
to assess the feasibility of fully automatic reference-free word-level error analysis of MT output
to directly compare quality estimation approaches to reference-based MT metrics in terms of correlation with human judgements of translation quality

The datasets provided for this shared task are created using proprietary MT engines and are distributed freely. Participants are allowed to explore any additional data and resources deemed relevant.

Below are the three QE tasks, addressing these goals: sentence-level and word-level QE (task 1), document-level QE (task 2) and QE as a metric (task 3).

Data

Task	Language Pair	Data
1	EN-DE	train/dev, blind test, test
1	EN-RU	train/dev, blind test, test
2	EN-FR	train/dev, 2018 test, blind test, test

Participant submissions

Task 1: Word and Sentence-Level QE

Description: the aim of this task is to test the application of QE for post-editing purposes. The participating systems are expected to predict the sentence-level HTER score (the percentage of edits needed to fix the translation) and, optionally, word-level edits (insertions/deletions/replacements/etc.).

For sentence-level predictions, participating systems are required to score sentences according to post-editing effort, via the percentage of edits need to be fixed (HTER).

For word-level predictions, we build upon last year's task, framing the problem as the binary task of distinguishing between 'OK' and 'BAD' tokens. Participating systems are required to detect errors for each token in MT output. In addition, we attempt to predict missing words in the translation. We require participants label any sequence of one or more missing token (a gap) with a single 'BAD' label and also indicate 'BAD' tokens in the source sentence that are related to the tokens missing in the translated sentence. This is particularly important to spot adequacy errors in NMT.

Data: the data consists of:

English-German: sentences on the IT domain translated by an in-house encoder-decoder attention-based NMT system (13,442 training and 1,000 development sentences). This dataset as well as its test set are the same as in the last year's QE task. Download training/development and test data.
English-Russian: sentences on the tech domain (Microsoft Office) translated by a state-of-the-art online neural MT system (15,089 training and 1,000 development sentences). Download training/development and test data.

Binary word-level labels have been obtained by using the alignments provided by the TER tool (settings: tokenised, case insensitive, exact matching only, disabling shifts by using the `-d 0` option) between machine translations and their post-edited versions. Shifts (word order errors) were not annotated as such (but rather as deletions + insertions) to avoid introducing noise in the annotation. Missing tokens in the machine translations, as indicated by the TER tool are annotated as follows: after each token in the sentence and at sentence start, a gap tag is placed. This tag will be set to 'BAD' if in that position there should be one or more tokens, and OK otherwise. Note that number of tags for each target sentence is 2*N+1, where N is the number of tokens in the sentence. All tokens in the source sentences are also labeled with either 'OK' or 'BAD'. For this, the alignments between source and post-edited sentences are used. If a token is labeled as 'BAD' in the translation, all tokens aligned to it are labeled as 'BAD' in the source sentence. This is meant to indicate which source tokens lead to errors in the translations.

As training and development data, we provide the tokenised and truecased source and translation outputs with source and target tokens annotated with 'OK' or 'BAD' labels, as well as the source-target alignments, and gaps annotated for the translations.

Baseline: the baseline system is a neural quality estimation system (NuQE) that does not use any additional parallel data. We will use the OpenKiwi (Kepler at al., 2019) implementation.

Evaluation: for sentence-level QE, submissions are evaluated in terms of the Pearson's correlation metric for the sentence-level HTER prediction. For word-level QE, they will be evaluated in terms of word-level classification performance via the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels, for two different types of labels, independently:

words and gaps in the MT ('OK' for correct words and genuine gaps, 'BAD' for incorrect words or gaps indicating missing words)
source words ('BAD' for words that lead to errors in the MT, 'OK' for other words)

The official evaluation scripts are available.

Task 2: Document-Level QE

Description: the goal of this task is to predict document-level quality scores as well as fine-grained annotations.

The data is derived from the Amazon Product Reviews dataset. More specifically, a selection of Sports and Outdoors product titles and descriptions in English which has been machine translated into French using a state of the art online neural MT system. The most popular products (those with more reviews) were chosen. This data poses interesting challenges for machine translation: titles and descriptions are often short and not always a complete sentence. The data was annotated by Unbabel for errors at the word level using a fine-grained error taxonomy (MQM).

MQM is composed of three major branches: accuracy (the translation does not accurately reflect the source text), fluency (the translation affects the reading of the text) and style (the translation has stylistic problems, like the use of a wrong register). These branches include more specific issues lower in the hierarchy. Besides the identification of an error and its classification according to this typology (by applying a specific tag), the errors will receive a severity scale that will show the impact of each error on the overall meaning, style, and fluency of the translation. An error can be minor (if it doesn’t lead to a loss of meaning and it doesn’t confuse or mislead the user), major (if it changes the meaning) or critical (if it changes the meaning and carry any type of implication, or could be seen as offensive).

For this task, we concentrate on document-level error annotations, where a document contains the product title and description for a given product. Each error annotation may consist of one or more words, not necessarily contiguous. Errors have a label specifying their type, such as wrong word order, missing words, agreement, etc. These labels may provide additional information, but they don't need to be predicted by the systems. The document-level scores were generated from the error annotations and their severity using the method in this paper (footnote 6). The dataset is the largest ever released collection with manually annotated errors.

NEW: This year, systems may return not only predicted MQM scores, but also (optionally) fine-grained predicted annotations along with a severity level. They are encouraged to do so.

Data: the training and development data contains 1,000/200 English-French training/development documents, with 6,003/1,301 segments with words annotated for errors. Download training and development sets, the test set from last year and this year's test set.

Note: these datasets are the same as last year, but with some corrections in the annotations and different preprocessing. A new fresh test set will be provided.

Baseline: the baseline system will be a system derived from the word-level baseline from Task 1 augmented with severity labels, which are then converted to phrases and used to compute MQM scores.

Evaluation: Submissions will be evaluated as in Task 1, in terms of Pearson's correlation between the true and predicted MQM document-level scores. Additionally, the predicted annotations will be evaluated in terms of their F1 scores with respect to the gold annotations.

The official evaluation scripts are available.

Task 3: QE as a Metric / Metrics without References

Description: The point of this task is to see how well quality estimation models perform as MT metrics (like BLEU or ch rF), except without using reference translations. On one hand the setting of this task is quite similar to regular sentence-level QE: based on the input segment and MT output the model has to predict a score, depicting the quality of the MT output. However, the major differences from Task 1 are:

the model predictions will be evaluated in terms of correlation with human judgements (unlike Task 1, where the objective is the HTER score based on a post-edit)
QE systems will be applied to a variety of MT outputs from different MT systems (unlike Task 1, where all of the test data, as well as train/dev sets, are homogeneous and come from the same NMT system)
the test sets are from the news text domain, since metrics (both with and without references) are evaluated on the submissions to the news translation shared task (unlike in-domain texts in Task 1)

Evaluation: This task is evaluated in the same way as the metrics shared task: based on the news translation shared task test sets and Spearman/Pearson correlation with the direct human assessments of their quality.

Submissions are welcome for any WMT'19 language pairs, with the special highlight being English⇆German and English⇆Russian language pairs. This only means that in the QE findings paper we will pay these pairs special attention, use them for analysis and conclusions, etc. Submissions for all language pairs in this task will make it into the metrics task comparisons

Data: There is a variety of resources that can be helpful for Task 3; some are listed below, and participants are welcome to use any other datasets(with the exception of the reference translations in the test sets):

explicit annotations of the same kind of direct assessments of MT quality from WMT 2016-2018
potentially usable relative rankings from WMT news translation shared tasks until 2016
last years' QE shared tasks with HTER / timing / quality label annotations
unannotated parallel and monolingual corpora from WMT and outside of it

See the metrics task page for access to the data, as well as format information

We also encourage cross-task submissions of systems from Task 1, applying the HTER-predicting systems to test data from Task 3, in order to see how well these correlate with each other.

The baseline for this task will consist of using state-of-the-art NMT baseline systems to score the MT output with a log-probability. Or perhaps using LASER cross-lingual sentence embeddings. Or both.

Submission Format

Task 1 (sentence-level)

The output of your system for the sentence-level subtask should be a single file with the HTER score for each sentence (each line in the .mt file), formatted in the same way as the .hter files provided with the training data:

<SEGMENT_1 SCORE>
<SEGMENT_2 SCORE>
...
<SEGMENT_n SCORE>

Task 1 (word-level)

The output for the word-level subtask can be up to two separate files: one with MT labels (for words and gaps) and another one with source words. You can submit for either of these subtasks or both of them, independently. The output format should be the same as in the .tags and .source_tags files in the training data; i.e., the .tags file should be formatted as:

GAP_1 WORD_1 GAP_2 WORD_2 ... GAP_n WORD_n GAP_n+1

and the .source_tags file should be:

WORD_1 WORD_2 ... WORD_n

Where each WORD_i and GAP_i is either OK or BAD. Tags must be delimited by whitespace or a tab character.

For MT labels, each sentence will therefore correspond to 2n+1 tags (where n is the number of words in the sentence), alterating gaps and words, in order. For source labels, each sentence will correspond to n tags.

For example, consider the following MT document and its post edited version. The wrong words are highlighted:

anschließend wird in jeder Methode die übergeordnete Superclass-Version von selbst aufgerufen .
anschließend wird in jeder Methode die Superclass-Version dieser Methode aufgerufen .

For this translation, output tags should be as follows. To make them easier to distinguish, tags referring to gaps are highlighted in yellow, while those referring to words are in blue. In this example, all BAD words can be either removed or replaced by other words in the post edited text; since no insertions are necessary, all gaps are tagged as OK.

OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
BAD
OK
OK
OK
BAD
OK
BAD
OK
OK
OK
OK
OK

Task 2 (document-level MQM)

The output of your system for the MQM subtask should produce scores for the translations at the document level. Since documents are organized in different directories, you also need to identify which document a score is assigned to. Each output line should formatted in the following way:

<DOCUMENT ID> <DOCUMENT SCORE>

Where:

DOCUMENT ID is the identifier of the translation you are scoring; it is the name of the corresponding directory.
DOCUMENT SCORE is the predicted MQM score for the document.

Lines can be in any order. Each field should be delimited by a single tab character.

Example of the document-level format:

B00014ZYNS	00.000
B0001NECEG	11.111
B0002GRV72	22.222

The example shows that documents under "B00014ZYNS", "B0001NECEG", "B0002GRV72", have got predicted quality scores of 00.000, 11.111 and 22.222, respectively.

Task 2 (document-level fine-grained annotations)

For the fine-grained annotation subtask, systems will have to predict which text spans contain translation errors, as well as classify them as minor, major or critical. Two or more spans can be part of the same error annotation (for example, in agreement errors in which a noun and an adjective are not adjacent).

The system output format is similar to the annotations.tsv files in the training data, but should include the document id. Each line in the output refers to a single error annotation (containing one or more spans) and should be formatted like this:

<DOCUMENT ID> <LINES> <SPAN START POSITIONS> <SPAN LENGTHS> <SEVERITY>

Where:

DOCUMENT ID is the containing folder, as in the MQM subtask.
LINES is a list of lines containing the error spans, starting from 0 and separated by white space.
SPAN START POSITIONS is a list of the character offsets in which the spans begin, separated by white space, and also starting from 0. The number of start positions should be the same as in lines.
SPAN LENGTHS is a list of the lengths, in number of characters, of the error spans. The number of lengths must be the same as the start positions. Spans should not overlap.
SEVERITY is either minor, major or critical.

Note that while the training data includes the error category (such as missing words or word order), this field is not necessary in the system output.

For example, consider the following translation and the highlighted error spans. Notice that the two spans in green are part of the same annotation (they indicate an agreement error); while the blue span includes only a white space character (it indicates a missing word).

Kit de Remington AirMaster 77 avec portée
La carabine à air Crosman Remington AirMaster 77 multi-pompe pneumatique lance des boulettes à une énorme 725 les pieds par seconde (755 fps avec BBs), donc vous pouvez frapper efficacement vos cibles à distance.
La carabine à air comprimé est équipé d’un guidon fibre optique et complètement réglable cran de mire qui le rendent facile à acquérir votre cible.

Considering this document is under the directory B0002IL6WQ, these error spans can be described with the following output:

B0002IL6WQ	0	35	6	minor
B0002IL6WQ	1	110	3	major
B0002IL6WQ	2	49	1	major
B0002IL6WQ	2 2	3 31	8 6	minor

Task 3 (QE as a metric)

Together with the metrics task; to be posted soon.

Additional Resources

These are the resources we have used to train aligners for tasks 1 and 2:

The Automatic Post-Editing task page also has links to more English-Russian and English-German data.

Useful Software

Here are some open source software for QE that might be useful for participants:

Submission Requirements

Each participating team can submit at most 2 systems for each of the language pairs of each subtask (systems producing alternative scores, e.g. post-editing time can be submitted as additional runs). These should be sent via email to andre.t.martins@gmail.com. Please use the following pattern to name your files:

INSTITUTION-NAME_TASK-NAME_LANGUAGE-PAIR_METHOD-NAME.EXTENSION, where:

INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF

TASK-NAME is one of 1, 2 or 3.

LANGUAGE-PAIR is the abbreviation of the language pair used: EN-DE, EN-RU or EN-FR.

METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 2_J48, 2_SVM

EXTENSION is .hter for sentence-level QE, .tags and .source_tags for word-level QE (as in the training data), .tsv for fine-grained document-level and .mqm for document-level MQM scores. Leave empty for task 3.

For instance, submissions from team SHEF for task 2 (English-French) using method "SVM" could be named SHEF_2_EN-FR_SVM.tsv and SHEF_2_EN-FR_SVM.mqm.

You are invited to submit a short paper (4 to 6 pages) to WMT describing your QE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

Please check that your system output on the dev data is correctly read by the official evaluation scripts.

Organisers

Mark Fishel (University of Tartu)
Erick Fonseca (Instituto de Telecomunicações)
André Martins (Instituto de Telecomunicações and Unbabel)
Lisa Yankovskaya (University of Tartu)
Christian Federmann (Microsoft)

Contact

For questions or comments, email fishel@ut.ee.