Quality Estimation task - - ACL 2014 Ninth Workshop on Statistical Machine Translation

ACL 2014 NINTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Shared Task: Quality Estimation

26-27 June 2014
Baltimore, USA

This shared task will examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. In this third edition of the shared task, we will once again consider word-level and sentence-level estimation. However, this year we will focus on settings for quality prediction that are MT system-independent (e.g. Bicici, 2013) and rely on a limited number of training instances. More specifically, our tasks have the following goals:

To investigate the effectiveness of different quality labels.
To explore word-level quality prediction at different levels of granularity.
To study the effects of training and test datasets with mixed domains, language pairs and MT systems.
To examine the effectiveness of quality prediction methods on human translations.

The WMT12-13 Quality Estimation shared tasks provided a set of baseline features, datasets, evaluation metrics, and oracle results. Building on the last two years' experience, this year's shared task will reuse some of these resources, but provide additional training and test sets for four language pairs (English-Spanish, English-German, German-English, Spanish-English) and use different quality labels at word-level (specific types of errors) and sentence-levels. These new datasets have been collected using professional translators as part of the QTLaunchPad project.

Note that this year, for some of the subtasks, translations are produced in various ways: by RBMT, SMT, and hybrid MT systems, as well as by humans. In the datasets provided, no indication is given on how these various translations were generated.

Another important difference with respect to the previous years is that this year participants will only be able to use black-box features, since internal features features of the MT systems will not be provided.

Please note that any additional training data (from WMT12 or other sources) can be used for all tasks.

Task 1: Sentence-level QE

Task 1.1 Scoring and ranking for perceived post-editing effort

Results here , gold-standard labels for all languages here

1 = perfect translation, no post-editing needed at all

2 = near miss translation: translation contains maximum of 2-3 errors, and possibly additional errors that can be easily fixed (capitalisation, punctuation)

3 = very low quality translation, cannot be easily fixed

The datasets were labelled in a "triage" phase aimed at selecting translations of type "2" (near miss) that could be annotated for errors at the word-level using the MQM metric (see Task 2, below) for systematic translation quality analysis.

For the training of prediction models, we provide a new dataset consisting of source sentences and their human translations, as well as two-three versions of machine translations (by an SMT system, an RBMT system and, for English-Spanish/German only, a hybrid system), all in the news domain, extracted from tests sets of various WMT years and MT systems that participated in the translation shared task:

954 English source sentences → 3,816 Spanish translations. Download data. Download 17 baseline feature set.
350 English source sentences → 1,400 German translations. Download data. Download 17 baseline feature set.
350 German source sentences → 1,050 English translations. Download data. Download 17 baseline feature set.
350 Spanish source sentences → 1,050 English translations. Download data. Download 17 baseline feature set.

As test data, for each language pair and MT system (or human translation) we provide a new set of translations produced by the same MT systems (and humans) as those used for the training data.

150 English source sentences → 600 Spanish translations. Download data. Download 17 baseline feature set.
150 English source sentences → 600 German translations. Download data. Download 17 baseline feature set.
150 German source sentences → 450 English translations. Download data. Download 17 baseline feature set.
150 Spanish source sentences → 450 English translations. Download data. Download 17 baseline feature set.

Additionally, for those interested, we provide some out of domain test data. These translations were annotated in the same way as above, each dataset by one LSP (one professional translator). However, they were generated using the LSP's own source data (a different domain from news), and own MT system (different from the three used for the official datasets). The results on these datasets will not be considered for the official ranking of the participating QE systems, but you are welcome to report on them in your paper. The true scores for these are also provided with the tars below:

971 English-Spanish translations, produced by two LSPs. Download data. Download 17 baseline feature set for LSP-1 and LSP-2 .
297 English-German translations. Download data. Download 17 baseline feature set.
388 Spanish-English translations. Download data. Download 17 baseline feature set.

Task 1.2 Scoring and ranking for percentage of edits needed (HTER)

Results here , gold-standard labels for all languages here

As training data, we provide a subset of the dataset from sub-task 1.1 above, but for English-Spanish only, and with a single translation per source sentence (the one by the MT systems):

896 English source sentences → 896 Spanish translation suggestions and their post-editions (translations). Download data. Download 17 baseline feature set.

As test data, we provide a new set of translations produced by the same SMT system used for the training data.

208 English source sentences → 208 Spanish translation suggestions. Download data. Download 17 baseline feature set.

Task 1.3 Scoring and ranking for post-editing time

Results here , gold-standard labels for all languages here

This task is the same as Task 1.3 in WMT13: participating systems are required to produce for each sentence its expected post-editing time, a real valued estimate of the time (in milliseconds) it takes a translator to post-edit the translation. The training and test sets are similar to those from sub-task 1.2 (subject to filtering of outliers), above, will be provided, with the difference that the labels are now the number of milliseconds that were necessary to post-edit each translation. Each of the training and test translations was post-edited by a professional translator using a web-based tool to collect post-editing time on a sentence-basis.

Training data data:

650 English source sentences → 650 Spanish translation suggestions and their post-editions. Download data. Download 17 baseline feature set.

Test data:

208 English source sentences → 208 Spanish translations. Download data. Download 17 baseline feature set.

For each of the subtasks under Task 1, as in previous years, two variants of the results can be submitted:

Scoring: An absolute quality score for each sentence translation according to the type of prediction, to be interpreted as an error metric: lower scores mean better translations.
Ranking: A ranking of sentence translations for all source sentences from best to worst. For this variant, it does not matter how the ranking is produced (from HTER predictions, likert predictions, post-editing time, etc.). The reference ranking will be defined based on the true HTER scores.

For each language pair, evaluation will be performed against the true label and/or HTER ranking using the same metrics as in previous years:

Scoring: Mean Average Error (MAE) (primary metric), Root Mean Squared Error (RMSE).
Ranking: DeltaAvg (primary metric) and Spearman's rank correlation.

For all these subtasks, the same 17 features used in WMT12-13 will be considered for the baseline systems. These systems will use SVM regression with an RBF kernel, as well as grid search algorithm for the optimisation of relevant parameters. QuEst will be used to build prediction models. For all subtasks we will use the same evaluation script.

Task 2: Word-level QE

Results here , gold-standard labels for all languages here

The data for this task is based on a subset of the same datasets provided in Task 1.1, for all language pairs, human and machine translations: those translations labelled 2 (near misses), plus additional data provided by industry (either on the news domain or on other domains, such as technical documentation, produced using their own MT systems, and also pre-labelled as 2s). All segments have been annotated with word-level labels by professional translations using the core categories in the MQM metric as error typology:

Participating systems will be required to produce for each token a label in one or more of the following settings:

Binary classification: a good/bad label, where bad indicates the need for editing the token.
Level 1 classification: a good/accuracy/fluency label, specifying the coarser level categories of errors for each token, or good for tokens with no error.
Multi-class classification: one of the labels specifying the error type for the token (terminology, mistranslation, missing word, etc.).

As training data, we provide tokenized translation output for all language pairs, human and machine translations, with tokens annotated with all issue types listed above, or good. The annotation was performed manually by professional translators as part of the QTLaunchPad project. For the coarser variants, labels will be grouped in two: accuracy versus fluency; good versus all other types of errors.

1,957 English-Spanish translations by MT systems/humans. Download
900 Spanish-English translations by MT systems/humans. Download
715 English-German translations by MT systems/humans. Download
350 German-English translations by MT systems/humans. Download

As test data, we provide additional data points for all language pairs, human and machine translations:

382 English-Spanish translations by MT systems/humans. Download
150 Spanish-English translations by MT systems/humans. Download
150 English-German translations by MT systems/humans. Download
100 German-English translations by MT systems/humans. Download

Submissions for each language pair will evaluated in terms of classification performance (precision, recall, F-1) against the original labels in the three variants (binary, level 1 and multi-class). The main evaluation metric will be the average F1 for all but the "OK" class. For the non-binary variants, the average will be weighted by the frequency of the class in the test data. Evaluation script.

Additional resources

We suggest the following interesting resources that can be used as additional data for training (notice the difference in language pairs and/or text domains and/or MT systems):

WMT13 Quality Estimation shared-task datasets for English-Spanish SMT translations and their HTER scores, post-editing time scores and annotation of edits at word-level. Description.
WMT12 Quality Estimation shared-task datasets for English-Spanish SMT translations and their 1-5 likert scores. Description.
LIG corpus of 10,881 French-English SMT translations and their human post-editions (HTER scores can be easily derived). Description.
LISMI's TRACE corpora of approximately 7,000 French-English and 7,000 English-French translations by different MT systems, for various text domains, and their post-editions by professionals translators. Description.
CRITT Translation Process Research Database with user activity data of translators behavior collected in several translation studies with Translog-II and with the CASMACAT workbench.

These are the resources we have used to extract the baseline features in tasks 1.1, 1.2, 1.3:

English

English source training corpus

English language model

English language model of POS tags

English n-gram counts

English truecase model

Spanish

Spanish source training corpus

Spanish language model

Spanish language model of POS tags

Spanish n-gram counts

Spanish truecase model

German

German source training corpus

German language model

German language model of POS tags

German n-gram counts

German truecase model

Giza tables

English-Spanish Lexical translation table src-tgt

English-German Lexical translation table src-tgt

Spanish-English Lexical translation table src-tgt

German-English Lexical translation table src-tgt

Submission Format

Task 1 Scoring and ranking for post-editing effort

The output of your system a given subtask should produce scores for the translations at the segment-level formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>

Where:

METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring/ranking.
SEGMENT SCORE is the predicted (HTER/time/likert) score for the particular segment - assign all 0's to it if you are only submitting ranking results.
SEGMENT RANK is the ranking of the particular segment - assign all 0's to it if you are only submitting absolute scores.

Each field should be delimited by a single tab character.

Task 2: Word-level QE

The output of your system should produce scores for the translations at the word-level formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <WORD INDEX> <WORD> <DETAILED SCORE> <LEVEL 1 SCORE> <BINARY SCORE>

Where:

METHOD NAME is the name of your quality estimation method.
SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
WORD INDEX is the index of the word in the tokenized sentence, as given in the training/test sets (starting at 0).
WORD actual word.
MULTI SCORE is the detailed score within any of the dimensions: assign 'OK' for no issue, or one of the MQM categories as detailed in the Figure above for the following issue types (assign all 0's to it if you are not submitting multi-class scores):

Terminology
Mistranslation
Omission
Addition
Untranslated
Accuracy
Style/register
Capitalization
Spelling
Punctuation
Typography
Morphology_(word_form)
Part_of_speech
Agreement
Word_order
Function_words
Tense/aspect/mood
Grammar
Unintelligible
Fluency

LEVEL 1 SCORE is either 'OK' for no issue, or 'Accuracy' or 'Fluency' - assign all 0's to it if you are not submitting level 1 scores.
BINARY SCORE is either 'OK' for no issue or 'BAD' for any issue - assign all 0's to it if you are not submitting binary scores.

Each field should be delimited by a single tab character.

Submission Requirements

Each participating team can submit at most 2 systems for each of the language pairs of each subtask. These should be sent via email to Lucia Specia lspecia@gmail.com. Please use the following pattern to name your files:

INSTITUTION-NAME_TASK-NAME_METHOD-NAME, where:

INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF

TASK-NAME is one of the following: 1-1, 1-2, 1-3, 2.

METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 2_J48, 2_SVM

For instance, a submission from team SHEF for task 2 using method "SVM" could be named SHEF_2_SVM.

You are invited to submit a short paper (4 to 6 pages) to WMT describing your QE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

Important dates

Release of training data	January 22, 2014
Release of test data	March 7, 2014
QE metrics results submission deadline	April 1, 2014
Paper submission deadline	April 1, 2014
Notification of acceptance	April 21, 2014
Camera-ready deadline	April 28, 2014

Organisers

Christian Buck (University of Edinburgh)
Radu Soricut (Google)
Lucia Specia (University of Sheffield)

Contact

For questions, comments, etc. email Lucia Specia lspecia@gmail.com.

Supported by the European Commission under the
projects (grant numbers 296347 and 287576)

ACL 2014 NINTH WORKSHOPON STATISTICAL MACHINE TRANSLATION

Shared Task: Quality Estimation

26-27 June 2014 Baltimore, USA

Task 1: Sentence-level QE

Task 1.1 Scoring and ranking for perceived post-editing effort

Task 1.2 Scoring and ranking for percentage of edits needed (HTER)

Task 1.3 Scoring and ranking for post-editing time

Task 2: Word-level QE

Additional resources

Submission Format

Task 1 Scoring and ranking for post-editing effort

Task 2: Word-level QE

Submission Requirements

Important dates

Organisers

Contact

ACL 2014 NINTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

26-27 June 2014
Baltimore, USA