Shared Task: Cross-lingual Pronoun Prediction



Pronoun translation poses a problem for current state-of-the-art SMT systems as pronoun systems do not map well across languages, e.g., due to differences in gender, number, case, formality, or humanness, and to differences in where pronouns may be used. Translation divergences typically lead to mistakes in SMT, as when translating the English "it" into French ("il", "elle", or "cela"?) or into German ("er", "sie", or "es"?). One way to model pronoun translation is to treat it as a cross-lingual pronoun prediction task.

We propose such a task, which asks participants to predict a target-language pronoun given a source-language pronoun in the context of a sentence. We further provide a lemmatised target-language human-authored translation of the source sentence, and automatic word alignments between the source sentence words and the target-language lemmata. In the translation, the words aligned to a subset of the source-language third-person pronouns are substituted by placeholders. The aim of the task is to predict, for each placeholder, the word that should replace it from a small, closed set of classes, using any type of information that can be extracted from the documents.

The cross-lingual pronoun prediction task will be similar to the task of the same name at DiscoMT 2015:

Participants are invited to submit systems for the English-French and English-German language pairs, for both directions.


The goals of the cross-lingual pronoun prediction task are:


In the cross-lingual pronoun prediction task, you are given a source-language document with a lemmatised and POS-tagged human-authored translation and a set of word alignments between the two languages. In the translation, the lemmatised tokens aligned to the source-language third-person pronouns are substituted by placeholders. Your task is to predict, for each placeholder, the fully inflected word token that should replace the placeholder from a small, closed set of classes. I.e., to provide the fully inflected (German|French) translation of the English pronoun in the context sketched by the lemmatised/tagged target side (in the case of English-to-German|French translation). You may use any type of information that you can extract from the documents.

Lemmatised and POS-tagged target-language data is provided in place of fully inflected text. The provision of lemmatised data is intended both to provide a challenging task, and to simulate a scenario that is more closely aligned with working with machine translation system output. POS tags provide additional information which may be useful in the disambiguation of lemmas (e.g. noun vs. verb, etc.) and in the detection of patterns of pronoun use.

The pronoun prediction task will be run for the following sub-tasks:

Details of the source-language pronouns and the prediction classes that exist for each of the above sub-tasks are provided in the following section (below). The different combinations of source-language pronoun and target-language prediction classes represent some of the different problems that SMT systems face when translating pronouns for a given language pair and translation direction.

The task will be evaluated automatically by matching the predictions against the words found in the reference translation by computing the overall accuracy and precision, recall and F-score for each class. The primary score for the evaluation is the macro-averaged F-score over all classes. Compared to accuracy, the macro-averaged F-score favours systems that consistently perform well on all classes and penalises systems that maximise the performance on frequent classes while sacrificing infrequent ones.

The data supplied for the classification task consists of parallel source-target text with word alignments. In the target-language text, a subset of the words aligned to source-language occurrences of a specified set of pronouns have been replaced by placeholders of the form REPLACE_xx, where xx is the index of the source-language word the placeholder is aligned to. Your task is to predict one of the classes listed in the relevant source-target section below, for each occurrence of a placeholder.

The training and development data is supplied in a file format with five tab-separated columns:

A single segment may contain more than one placeholder. In that case, columns 1 and 2 contain multiple space-separated entries in the order of placeholder occurrence. A document segmentation of the data is provided in separate files for each corpus. These files contain one line per segment, but the precise format varies depending on the type of document markup available for the different corpora. In the development and test data, the files have a single column containing the ID of the document the segment is part of.

Here is an example line from one of the training data files:

elles	Elles	They arrive first .	REPLACE_0 arriver|VER en|PRP premier|NUM .|.	0-0 1-1 2-2 2-3 3-4

The test set will be supplied in the same format, but with columns 1 and 2 empty, so that each line starts with two tab characters. Your submission should have the same format as column 1 above, so a correct solution would contain the class label elles in this case. Each line should contain as many space-separated class labels as there are REPLACE tags in the corresponding segment. For each segment not containing any REPLACE tags, an empty line should be emitted. Additional tab-separated columns may be present in the submission, but will be ignored. Note in particular that you are not required to predict the second column. The submitted files should be encoded in UTF-8 (like the data we provide).

The training, development and test datasets have been filtered to remove non-subject position pronouns. The filtering for the test set will be manually checked to ensure that no non-subject position pronouns remain. For more information, please see the section on data filtering below.

The complete test data for the classification task, including reference translations and word alignments, will be released on 6th April 2016. Your submission is due on 22nd April 2016. Detailed submission instructions can be found at the end of this page.


The following sections describe the set of source-language pronouns and target-language classes to be predicted, for each of the four sub-tasks. Please note that the sub-tasks are asymmetric in terms of the source-language pronouns and prediction classes. The selection of the source-language pronouns and their target-language prediction classes for each sub-task is based on the variation that is possible when translating a given source-language pronoun. For example, when translating the English pronoun "it" into French, a decision must be made as to the gender of the French pronoun, with "il" and "elle" both providing valid options. The translation of the English pronouns "he" and "she" into French, however, does not require such a decision. These may simply be mapped 1-to-1, as "il" and "elle" respectively. The translation of "he" and "she" from English into French is therefore not considered an "interesting" problem and as such, these pronouns are excluded from the source-language set for the English->French sub-task. In the opposite translation, the French pronoun "il" may be translated as "it" or "he", and "elle" as "it" or "she". As a decision must be taken as to the appropriate target-language translation of "il" and "elle", these are included in the set of source-language pronouns for the French->English sub-task.

You should *always* predict either a word token or "OTHER". See prediction class lists below for a list of word tokens to predict for each sub-task.


This sub-task will concentrate on the translation of subject position "it" and "they" from English into French. The following prediction classes exist for this sub-task:

ceThe French pronoun ce (sometimes with elided vowel as c') as in the expression c'est "it is"
elleFeminine singular subject pronoun
ellesFeminine plural subject pronoun
ilMasculine singular subject pronoun
ilsMasculine plural subject pronoun
celaDemonstrative pronouns. Includes "cela", "ça", the misspelling "ca", and the rare elided form "ç' "
onIndefinite pronoun
OTHERSome other word, or nothing at all, should be inserted


This sub-task will concentrate on the translation of subject position "elle", "elles", "il", and "ils" from French into English. The following prediction classes exist for this sub-task:

heMasculine singular subject pronoun
sheFeminine singular subject pronoun
itNon-gendered singular subject pronoun
theyNon-gendered plural subject pronoun
thisDemonstrative pronouns (singular). Includes both "this" and "that"
theseDemonstrative pronouns (plural). Includes both "these" and "those"
thereExistential "there"
OTHERSome other word, or nothing at all, should be inserted


This sub-task will concentrate on the translation of subject position "it" and "they" from English into German. The following prediction classes exist for this sub-task:

erMasculine singular subject pronoun
sieFeminine singular subject pronoun
esNeuter singular subject pronoun
manIndefinite pronoun
OTHERSome other word, or nothing at all, should be inserted


This sub-task will concentrate on the translation of subject position "er", "sie" and "es" from German into English. The following prediction classes exist for this sub-task:

heMasculine singular subject pronoun
sheFeminine singular subject pronoun
itNon-gendered singular subject pronoun
theyNon-gendered plural subject pronoun
youSecond person pronoun (with both generic or deictic uses)
thisDemonstrative pronouns (singular). Includes both "this" and "that"
theseDemonstrative pronouns (plural). Includes both "these" and "those"
thereExistential "there"
OTHERSome other word, or nothing at all, should be inserted


If you are interested in participating in the shared task, we recommend that you sign up to our discussion group to make sure you don't miss any important information. Feel free to ask any questions you may have about the shared task!!forum/wmt-2016-cross-lingual-pronoun-prediction-shared-task


The task is to predict the translation of subject position pronouns for all sub-tasks. In order to ensure fair and accurate evaluation of system performance, filtering has been applied to the source-language texts of the training, development and test datasets, to select only those pronoun instances that are relevant to each sub-task. For example, in the case of English-to-French translation, which focusses on the translation of the English subject position pronouns "it" and "they", only subject position instances of "it" will be included in the development and test datasets.

Training and development data

Automatic filtering has been applied to the source-language texts of the development datasets to remove non-subject position instances of English "it" and the German pronouns "sie" and "es".

Test data

The test data files have been sentence aligned automatically, and the alignments have been checked manually. The same automatic filtering of non-subject position pronouns that was applied to the source-language texts of the development datasets, has also been applied to the source-language texts of the test dataset. In addition to this, manual checks have been made to ensure that no non-subject position pronouns remain following the automatic filtering. We had originally proposed manually checking those word-alignments that affected pronouns, however, this will not ne carried out. After considerable discussion we believe that this task is extremely difficult due to the presence of many borderline cases. We therefor believe that it would be difficult to apply manual corrections or exclusions of incorrectly aligned pronoun instances in a consistent manner.


The training and development datasets can be downloaded from the following locations:

Download alternative 1:

Download alternative 2:

The download folder contains many files. See the list below:

Classification data files: *.data.gz

Filtered classification data files: *

List of prediction classes and their frequencies: *.classes

Document ids (the document to which each sentence belongs): *.doc-ids.gz

Please note that:


Participants are encouraged to use any type of information that can be extracted from the source and target-language text. You may use any tool you like, but committee members have found the following ones useful:

Adventurous people could also try using Sebastian Martschat's CORT, which has an option to perform coreference resolution on unannotated text (using the Stanford tools for preprocessing). [Disclaimer: the organisers have not used CORT themselves for that purpose]

In addition to the tools listed above, classification baselines are provided for each sub-task. See section below.


We provide baseline models for each sub-task. Each baseline looks only at the language model scores from the relevant language model.

To use the baselines you will need to:

Installing the KenLM python module: If you have pip installed, run pip install Alternatively, after downloading KenLM, run python install

You can get predictions for the baseline model by running, e.g.:

python --fmt=replace --removepos --conf en-fr.yml

These are in the format that the scorer requires, with predictions in the first column, the word it predicted in the second column (which is always ignored by the scorer, so don't worry if your system doesn't predict words), etc.

If you're interested in just using the marginal probabilities for each filler from the language model, you can also use:

python discomt_baseline --fmt=scores --removepos --conf en-fr.yml

which will give you, for each input line, one with TEXT in the second column giving you the source/target text, and zero or more lines with ITEM 0, ITEM 1 etc. giving you a (partial) probability distribution over the fillers for each "REPLACE" position.

Other flags:

Sample results with default options on *filtered* en-fr TEDdev (same data as tst2010):

       ce  :    P =   129/  182 =  70.88%     R =   129/  151 =  85.43%     F1 =  77.48%
     elle  :    P =     6/   27 =  22.22%     R =     6/   25 =  24.00%     F1 =  23.08%
     elles :    P =     0/    0 =   0.00%     R =     0/   15 =   0.00%     F1 =   0.00%
       il  :    P =    38/  138 =  27.54%     R =    38/   57 =  66.67%     F1 =  38.97%
      ils  :    P =     0/    0 =   0.00%     R =     0/  140 =   0.00%     F1 =   0.00%
     cela  :    P =    28/   40 =  70.00%     R =    28/   63 =  44.44%     F1 =  54.37%
       on  :    P =     3/   37 =   8.11%     R =     3/   10 =  30.00%     F1 =  12.77%
     OTHER :    P =    76/  139 =  54.68%     R =    76/  102 =  74.51%     F1 =  63.07%
or a macro-averaged R of 40.63%


You may also find the resources from the DiscoMT 2015 shared task on English-to-French cross-lingual pronoun prediction useful. These include annotations over the English source-language side of the test set, as well as raw training and development data. The data from the DiscoMT 2015 shared task can be downloaded from LINDAT. Please note that there are differences between the DiscoMT 2015 shared task and this year's task at WMT, namely the introduction of lemmatised + POS-tagged target-language data for this year's task.


The predicted pronoun class labels will be automatically evaluated against the gold standard translations from the test set (see the example for the classification baseline below). The current version of the scorer is available here:

The script contains instructions detailing how it should be used.

The script computes macro-averaged R scores. This is in contrast to the evaluation script for the pronoun prediction task at DiscoMT 2015, which computed macro-averaged F1. The justification for computing macro-averaged R, is that unlike with macro-averaged F1, each error is counted only once. With macro-averaged F1, the same error will be counted twice: once for precision and once again for recall, given that it averages over all prediction classes.


We will provide the input data in the same format as the training data, but with the first two columns empty. Your predictions should be submitted in the format recognised by the official scorer, see above for details. Please e-mail the file with the predictions, labelled with the name of your system, to liane [at] no later than 22nd April 2016 (any time zone). The following website provides the time Anywhere on Earth (AOE) which you may find useful.

Please note that each participant may submit up to two systems, per task. If you submit more than one system for a given task, please indicate which system is the primary system


The test data is now available for download from the following locations:

Download alternative 1:

Download alternative 2:


We are pleased to announce the results of the shared task. Please click on the following links for a PDF containing an overview of the results and an archive containing the individual submissions and their scores.

The archive contains a subfolder labelled "_gold" which contains a "solution" file for each language pair. These files may be useful in analysing the performance of your systems.


We would like to invite all groups who participated in the WMT2016 task on cross-lingual pronoun prediction to submit a system description paper detailing the design of their systems. Per the instructions on the WMT website, system description papers should be 4-6 pages in length and should conform to the ACL official style guidelines. These ACL guidelines are contained in the style files which can be downloaded from:

System description papers are subject to review and may be rejected if the quality of the description is insufficient. However, the scores or ranking achieved in the shared task evaluation have no bearing on the acceptance decision. We strongly recommend that you write a system description paper and present your system(s) at the workshop no matter how successful your approach was in the evaluation.

At minimum, a system paper should:

In addition, we strongly recommend that the system paper contains:

System papers do not need to provide a detailed description of the task itself or the data sets provided by the organisers. Instead, they may refer to the shared task overview paper, the bibliographic details of which will be announced prior to the camera-ready deadline. Unlike regular long and short papers, system description papers need not be anonymised.

We aim to release the results of the evaluation prior to the paper submission deadline on 15th May.

System description papers should be submitted electronically via the WMT START system:


Release of training data2nd February 2016
Test data released6th April 2016
System submission deadline22nd April 2016
Paper submission deadline15th May 2016
Notification of acceptance 5th June 2016
Camera-ready deadline22nd June 2016


Liane Guillou (University of Edinburgh)
Christian Hardmeier (Uppsala University)
Preslav Nakov (Qatar Computing Research Institute)
Andrei Popescu-Belis (Idiap Research Institute)
Sara Stymne (Uppsala University)
Jörg Tiedemann (University of Helsinki)
Yannick Versley (University of Heidelberg)
Bonnie Webber (University of Edinburgh)


The organisation of this task has received support from the following project: