ACL 2013 EIGHT WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Shared Task: Quality Estimation

8-9 August, 2013
Sofia, Bulgaria

[HOME] | [TRANSLATION TASK] | [METRICS TASK] | [QUALITY ESTIMATION TASK] | [SCHEDULE] | [PAPERS] | [AUTHORS] | [RESULTS]

This shared task will examine automatic methods for estimating machine translation output quality at run-time. Quality estimation aims at providing a quality indicator for unseen translated sentences without relying on reference translations. In this second edition of the shared task, we will consider both word-level and sentence-level estimation.

Some interesting uses of sentence-level quality estimation are the following:

Some interesting uses of word-level quality estimation are the following:

Last year, a first shared task was organised as part of WMT12 on sentence-level estimation. This task provided a set of baseline features, datasets, evaluation metrics, and oracle results. The task attracted an impressive number of participants. Building on last year's experience, this year's shared task will reuse some of these resources, but provide additional training and test sets, use different annotation schemes and propose a few variants of the task for word- and sentence-level quality estimation.

Goals

The main goals of the shared quality estimation task are:

Task 1: Sentence-level QE

Task 1.1 Scoring and ranking for post-editing effort

This task is similar to the one in WMT12, but with one important difference in the scoring variant: based on feedback received last year, instead of using the [1-5] scores for post-editing effort, we will use HTER as our quality score, i.e.: the minimum edit distance between the machine translation and its manually post-edited version in [0,1]. Two variants of the results can be submitted:

For the training of models, we provide the WMT12 dataset: 2,254 English-Spanish news sentences produced by a phrase-based SMT system (Moses) trained on Europarl and News Commentaries corpora as provided by WMT, along with their source sentences, reference translations, post-edited translations, and HTER scores. We used TERp (default settings: tokenised, case insensitive, etc., but capped to 1) to compute the HTER scores. Likert scores are also provided, if participants prefer to use them for the ranking variant.

NOTE: Participants are free to use as training data other post-edited material as well ("open" submission). However, for submitting to Task 1.1, we require at least one submission per participant using only the official 2,254 training set ("restricted" submission).

As test data, we provide a new set of translations produced by the same MT system as those used for training. Evaluation will be performed against the HTER and/or ranking of those translations using the same metrics as in WMT12: Mean-Average-Error (MAE), Root-Mean-Squared-Error (RMSE), Spearman's rank correlation, and DeltaAvg.

Task 1.2 System selection (new)

Participants will be required to rank up to five alternative translations for the same source sentence produced by multiple MT systems. We will use essentially the same data provided to participants of WMT's evaluation metrics task -- where MT evaluation metrics are assessed according to how well they correlate with human rankings. However, reference translations will not be allowed in this task. We provide:

Evaluation for each language pair will be performed against human ranking of pairs of alternative translations, using as metric the overall Kendall's tau correlation (i.e. weighted average).

Task 1.3 Predicting post-editing time (new)

Participating systems will be required to produce for each sentence:

For training we provide a new dataset: English-Spanish news sentences produced by a phrase-based SMT system (Moses), along with their source sentences, post-edited translations and time (in seconds) that was spend on that segment. The data was collected using five translators (with few overlapping annotations). For each segment we provide an ID that specifies the translator who post-edited it (for those interested in training translator-specific models).

As test data, we provide additional source sentences and translations produced with the same SMT system, and IDs of the translators who will post-edit each of these translations (same post-editors as in the training data).

Submissions will evaluated in terms of Mean Average Error (MAE) against the time spent by the same translators post-editing these sentences.

For both Tasks 1.1-1.3, we also provide a system and resources to extract QE features (language model, Giza++ tables, etc.), when these are available. We also provide the machine learning algorithm that will be used as baseline: SVM regression with an RBF kernel, as well as the grid search algorithm for the optimisation of relevant parameters. The same 17 features used in WMT12 will be considered for the baseline systems.

Task 2: Word-level QE (new)

The data for this task is based on the same resources and data as in Task 1.3, but with word-level labels. Participating systems will be required to produce for each token a label in one of the following settings:

As training data, we provide tokenized MT-output with tokens token annotated with multiclass (good/delete/substitute) labels. The annotation is derived automatically by computing TER (with some tweaks) between the original machine translation and its post-edited version. For the binary variant, labels will be grouped in two: good (keep) versus all others (delete or substitute).

As test data, we provide a tokenized version of the test data used in Task 1.3.

Submissions will evaluated in terms of classification performance (precision, recall, F-1) against the original labels in the two variants (binary and multi-class).

Download

Data, resources and baseline systems

Submission Format

Task 1.1 Scoring and ranking for post-editing effort

The output of your system should produce scores for the translations at the segment-level formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK>

Where: Each field should be delimited by a single tab character.

Task 1.2 System selection

The format of the output file should be the same as that of the test files provided, with the difference that the empty field "rank=" needs to be completed with a number in 1-5 indicating the ranking of the sentence (allowing ties).

Task 1.3 Predicting post-editing time

The output of your system should produce scores for the translations at the segment-level formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <SEGMENT TIME>

Where: Each field should be delimited by a single tab character.

Task 2: Word-level QE

The output of your system should produce scores for the translations at the word-level formatted in the following way:

<METHOD NAME> <SEGMENT NUMBER> <WORD INDEX> <BINARY SCORE> <MULTI SCORE>

Where: Each field should be delimited by a single tab character.

Submission Requirements

We require that each participating team submits at most 2 submissions for each of the variants of the task. These should be sent via email to Lucia Specia lspecia@gmail.com. Please use the following pattern to name your files:

INSTITUTION-NAME_TASK-NAME_METHOD-NAME, where:

INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF

TASK-NAME is one of the following: 1-1, 1-2, 1-3, 2.

METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 2_J48, 2_SVM

For instance, a submission from team SHEF for task 2 using method "SVM" could be named SHEF_2-SVM.

IMPORTANT DATES

Release of training sets + baseline systems March 6, 2013: here
Release of test sets May 17, 2013: here
Release of updated data sets for tasks 1.3 and 2 May 30, 2013: here
Submission deadline for all QE subtasks June 5, 2013
Paper submission deadline June 10, 2013

ORGANIZERS

Christian Buck (University of Edinburgh)
Radu Soricut (Google)
Lucia Specia (University of Sheffield)

Other Requirements

You are invited to submit a short paper (4 to 6 pages) describing your QE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

We encourage individuals who are submitting research papers to submit entries in the shared-task using the training resources provided by this workshop (in addition to potential entries that may use other training resources), so that their experiments can be repeated by others using these publicly available resources.

CONTACT

For questions, comments, etc. please send email to Lucia Specia lspecia@gmail.com.