This shared task will examine automatic methods for correcting errors produced by an unknown machine translation (MT) system. Since the system itself is a "black-box", automatic post-editing methods have to operate at downstream level (that is, after MT decoding), by exploiting knowledge acquired from previous human post-editions and provided as training material.
Automatic Post-editing (APE) aims at improving MT output in black-box scenarios, in which the MT system is used "as is" and cannot be modified. From the application point of view APE components would make it possible to:
In this pilot run of the shared task we will provide you with training (source, target, human post-edition) triples and you will return automatic post-editions for unseen (source, target) test pairs.
Training and development data (the same used for the Sentence-level Quality Estimation task) respectively consist of 11,272 and 1,000 English-Spanish triples in which:
Sources, targets and human post-editions are provided in separated files.
Download training and development data.
Test data consist of 1,817 tokenized (source, target) pairs having the same characteristics of the source and target sentences provided as training.
Download test data.. (NEW!!!! -- TEST SET AVAILABLE! -- )
Any use of additional data for training your system is allowed (e.g. parallel corpora, post-edited corpora).
Systems' performance will be evaluated with respect to their capability to reduce the distance that separates an automatic translation from its human-revised version. Such distance will be measured in terms of human-targeted TER (HTER).
While HTER is normally calculated as the minimum edit distance between the machine translation and its manually post-edited version in [0,1], in the APE task it will be used to measure the edit distance between automatic and manual post-editions.
The submitted runs will be ranked based on the average HTER calculated on the test set by using the tercom software.
Each run will be evaluated in two modes, namely: i) case insensitive and ii) case sensitive.
If specified by the participants at submission stage (see Submission Requirements), final results for a given run can be released according to only one of the two modes.
Instead, if not specified by the participants at submission stage (see Submission Requirements), final results will be released by measuring system's performance in both ways, that is with two separate scores.
In both cases, lower average HTER will correspond to a higher rank.
The evaluation scripts available for download allow participants to compute HTER scores in both modalities.
Download the evaluation script.
The HTER calculated between the raw MT output and human post-editions in the test set will be used as baseline (i.e. the baseline is a system that leaves all the test instances unmodified).
The output of your system should produce automatic post-editions of the target sentences in the test in the following way:
<METHOD NAME> <SEGMENT NUMBER> <APE SEGMENT>Where:
METHOD NAMEis the name of your automatic post-editing method.
SEGMENT NUMBERis the line number of the plain text target file you are post-editing.
APE SEGMENTis the automatic post-edition for the particular segment.
Each participating team can submit at most 3 systems, but they have to explicitly indicate which of them represents their primary submission. In the case that none of the runs is marked as primary, the latest submission received will be used as the primary submission.
Submissions should be sent via email to email@example.com. Please use the following pattern to name your files:
INSTITUTION-NAME is an acronym/short name for your institution, e.g. "UniXY"
METHOD-NAME is an identifier for your method, e.g. "pt_1_pruned"
SUBTYPE indicates whether the submission is primary or contrastive with the two alternative values:
EVALTYPE indicates whether the submission should be evaluated with only one of the two alternative modes or in both ways:
For instance, the name "UniXY_pt_1_pruned_PRIMARY_BOTH" could be used to indicate the primary submission from team UniXY, based on method "pt_1_pruned", to be evaluated both in case insensitive and case sensitive mode.
You are also invited to submit a short paper (4 to 6 pages) to WMT describing your APE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.
|Release of training data||January 31, 2015|
|Test set distributed||April 27, 2015|
|Submission deadline||May 15, 2015|
|Paper submission deadline||June 28, 2015|
|Notification of acceptance||July 21, 2015|
|Camera-ready deadline||August 11, 2015|
All the APE task data are kindly provided by Unbabel.
Please send your questions, comments, etc. to firstname.lastname@example.org.
To be always updated about this year's edition of the APE pilot task, you can also join the wmt-ape group.
Supported by the European Commission under the QT21
project (grant number 645452)