Automatic Post-Editing Task - ACL 2016 First Conference on Machine Translation

Shared Task: Automatic Post-Editing

OVERVIEW

The third round of the APE shared task follows the success of the previous two rounds organised in 2015 and 2016. The aim is to examine automatic methods for correcting errors produced by an unknown machine translation (MT) system. This has to be done by exploiting knowledge acquired from human post-edits, which are provided as training material.

Goals

The aim of this task is to improve MT output in black-box scenarios, in which the MT system is used "as is" and cannot be modified. From the application point of view, APE components would make it possible to:

Cope with systematic errors of an MT system whose decoding process is not accessible
Provide professional translators with improved MT output quality to reduce (human) post-editing effort
Adapt the output of a general-purpose system to the lexicon/style requested in a specific application domain

Task Description

Similar to the last round, this year the task focuses on the Information Technology (IT) domain. One novelty, however, is represented by the addition of one language direction: this year, the task will hence cover English-German, and German-English. In both cases, the source sentences have been translated into the target language by an MT system unknown to the participants and then manually post-edited by professional translators.

At training stage, the collected human post-edits have to be used to learn correction rules for the APE systems. At test stage they will be used for system evaluation with automatic metrics (TER and BLEU).

DIFFERENCES FROM THE SECOND ROUND (WMT 2016)

Compared to the the second round, the main differences are:

Additional language direction (German-English);
Additional domain (Medical);
Larger data set.

Data

Training, development and test data consist in English-German and German-English triplets (source, target, and post-edit) belonging to the IT and Medical domains respectively, and are already tokenized. All data is provided by the EU project QT21 (http://www.qt21.eu/).

For EN-DE language direction, the development set released in 2016 can be used to tune the systems.

To download the data click on the links in the table below:

Language pair	Domain	2016	2017			Additional Resource
EN-DE	IT	train, dev, test^*	train	test	test post-edits	artificial training data⁺
DE-EN	Medical		train, dev	test	test post-edits

^*: Test 2016 will be used as a progressive test set to measure the progress of the state-of-the-art systems.

⁺: This training data was created and used in "Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing"

NOTE:
1) Any use of additional data for training your system is allowed (e.g. parallel corpora, post-edited corpora).
2) Please use the following citation, if you use these data sets in your publications.

(TO BE ADDED)

Evaluation

Systems' performance will be evaluated with respect to their capability to reduce the distance that separates an automatic translation from its human-revised version.

Such distance will be measured in terms of TER, which will be computed between automatic and human post-edits in case-sensitive mode.

Also BLEU will be taken into consideration as a secondary evaluation metric. To gain further insights on final output quality, a subset of the outputs of the submitted systems will also be manually evaluated like in 2016.

The submitted runs will be ranked based on the average HTER calculated on the test set by using the tercom software.

The HTER calculated between the raw MT output and human post-editions in the test set will be used as baseline (i.e. the baseline is a system that leaves all the test instances unmodified).

The evaluation script can be downloaded here

Results for EN-DE on 2017 test set

Systems	TER	BLEU
FBK_EnsembleRerank_Primary	19.6^{^}	70.07^{^}
AMU_multi-transducer-composed_PRIMARY	19.77^{^}	69.5^{^}
AMU.multi-transducer.SECONDARY	19.83^{^}	69.38^{^}
DCU_FRANKENAPE-TUNED_PRIMARY	20.11^{^}	69.19^{^}
DCU_FRANKENAPE-TUNED_CONTRASTIVE	20.25^{^}	69.33^{^}
FBK_SingleModelRerank_Contrastive	20.3^{^}	69.11^{^}
FBK_USAARRerankStatFeat_Contrastive	21.55^{^}	67.28^{^}
USAAR_NMT-OSM_PRIMARY	23.05^{^}	65.01^{^}
LIG_chained_syn_PRIMARY	23.22^{^}	65.12^{^}
JXNU_JXNU_EDITFreq_PRIMARY	23.31^{^}	65.66^{^}
LIG_forced_CONTRASTIVE	23.51^{^}	64.52^{^}
LIG_chained_CONTRASTIVE	23.66^{^}	64.46^{^}
CUNI_char_conv_rnn_beam_PRIMARY	24.03	64.28^{^}
USAAR_OSM_CONTRASTIVE	24.17^{^}	63.55^{^}
Official Baseline (MT)	24.48	62.49
Baseline_2 (Statistical phrase-based APE)	24.69^{^}	62.97^{^}
CUNI_char_conv_rnn_greedy_CONTRASTIVE	25.94^{^}	61.65^{^}

^{^}: indicates the score is statistically significant wrt. official baseline (MT)

Results for DE-EN on 2017 test set

Systems	TER	BLEU
FBK_EnsembleRerank_Primary	15.29^{^}	79.82^{^}
FBK_SingleModelRerank_Contrastive	15.31^{^}	79.64
LIG_chained_syn_PRIMARY	15.53	79.49
Official Baseline (MT)	15.55	79.54
LIG_forced_CONTRASTIVE	15.62	79.48
LIG_chained_CONTRASTIVE	15.68^{^}	79.35^{^}
Baseline_2 (Statistical phrase-based APE)	15.74^{^}	79.28

Submission Format

The output of your system should produce automatic post-editions of the target sentences in the test in the following way:

<METHOD NAME>   <SEGMENT NUMBER>   <APE SEGMENT>

Where:

METHOD NAME is the name of your automatic post-editing method.
SEGMENT NUMBER is the line number of the plain text target file you are post-editing.
APE SEGMENT is the automatic post-edition for the particular segment.

Each field should be delimited by a single tab character.

Submission Requirements

Each participating team can submit at most 3 systems, but they have to explicitly indicate which of them represents their primary submission. In the case that none of the runs is marked as primary, the latest submission received will be used as the primary submission.

Submissions should be sent via email to wmt-ape-submission@fbk.eu. Please use the following pattern to name your files:

INSTITUTION-NAME_METHOD-NAME_SUBTYPE, where:

INSTITUTION-NAME is an acronym/short name for your institution, e.g. "UniXY"

METHOD-NAME is an identifier for your method, e.g. "pt_1_pruned"

SUBTYPE indicates whether the submission is primary or contrastive with the two alternative values: PRIMARY, CONTRASTIVE.

You are also invited to submit a short paper (4 to 6 pages) to WMT describing your APE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

Important dates

Release of training data	February 16, 2017
Release of test data	April 10 2017
Submission deadline	~~May 6 2017~~ May 13 2017
Paper submission deadline	~~June 2 2017~~ June 9 2017
Manual evaluation	TBD
Notification of acceptance	June 30 2017
Camera-ready deadline	July 14 2017

Organisers

Rajen Chatterjee (Fondazione Bruno Kessler)
Yvette Graham (Dublin City University)
Matteo Negri (Fondazione Bruno Kessler)
Raphael Rubino (Saarland University)
Marco Turchi (Fondazione Bruno Kessler)

Contact

For any information or question on the task, please send an email to:wmt-ape@fbk.eu.
To be always updated about this year's edition of the APE task, you can also join the wmt-ape group.

Supported by the European Commission under the QT21
project (grant number 645452)