Shared Task: Multimodal Machine Translation

This is shared task is aimed at the generation of image descriptions in a target language. The task can be addressed as a translation task, which will take a source language description and translate it into the target language, where this process can be supported by information from the image (multimodal translation), and as a multisource multimodal translation task, which takes source language descriptions in multiple languages and translates them into the target language, using the visual information as additional context.

This shared task has the following main goals:

We welcome participation from experienced and new participants. We would also particularly like to encourage participants to consider the unconstrained data setting for both tasks. Participants agree to contribute to the manual evaluation of approximately eight hours of work, per system submission.

Important dates

Release of training data:February 12, 2018
Release of test data:June 8, 2018
Results submission deadline:June 15, 2018
Start of manual evaluation:June 20, 2018
End of manual evaluation:July 20, 2018

NEW: Download all pre-processed submissions.

NEW: Please submit your (pre-processed - link above) WMT18 and any new submissions to CODALAB.


Task 1: Multimodal Machine Translation Task

This task consists in translating English sentences that describe an image into German or French or Czech, given the English sentence itself and the image that it describes (or features from this image, if participants chose to). See Specia et al. (2016) and Elliott et al. (2017) for descriptions of previous editions of this task at WMT16 and 17.

The original data for this task was created by extending the Flickr30K Entities dataset in the following way: for each image, one of the English descriptions was selected and manually translated into German, French, and Czech by human translators. For English-German, translations were produced by professional translators, who were given the source segment only (training set) or the source segment and image (validation and test sets). For English-French, translations were produced via crowd-sourcing where translators had access to source segment, the image and an automatic translation created with a standard phrase-based system (Moses baseline system built using the WMT'15 constrained translation task data) as a suggestion to make translation easier (note that this was not a post-editing task: although translators could copy and paste the suggested translation to edit, we found that they did not do so in the vast majority of cases). For English-Czech, the translations were produced by crowd-sourcing where translators had access to the source segment and the image.

Summary of the datasets:

Training Validation Test 2016 Test 2017 Ambiguous COCO Test 2018
Images Sentences Images Sentences Images Sentences Images Sentences Images Sentences Images Sentences
29,000 29,000 1,014 1,014 1,000 1,000 1,000 1,000 461 461 1,071 1,071

As training and development data, we provide 29,000, and 1,014 triples respectively, each containing an English source sentence, its German, French, and Czech human translations and corresponding image. We also provide the 2016 and 2017 test sets, which people can use for validation and internal evaluation. The English-German datasets are the same as those in 2016, but we note that human translations in the 2016 validation and test datasets have been post-edited (by humans) using the images to make sure the target descriptions are faithful to these images. There were cases where in the 2016 the source text was ambiguous and the image was used to solve the ambiguities. The French translations were added in 2017 and the Czech translations were added in 2018.

As test data, we provide a new test set of 1,071 tuples containing an English description and its corresponding image. Gold labels will be translations in German, Czech, or French.

Evaluation will be performed using human Direct Assesment. The submissions and reference translations will be pre-processed to lowercase, normalise punctuation and tokenise the sentences. Each language will be evaluated independently. If you participate in the shared task, we ask you to perform a defined amount of evaluation per language pair submitted.


En-De Flickr 2018:

En-Fr Flickr 2018:

En-Cs Flickr 2018:


Task 1b: Multisource Multimodal Machine Translation Task

This new task consists in translating English sentences that describe an image into Czech, given the English sentence itself, the image that it describes (or features from this image, if participants chose to), and parallel sentences in French and German. Participants are free to use any subset(s) of the additional source language data in their submissions.

Summary of the datasets:

Training Validation Test 2016 Test 2017
Images Sentences Images Sentences Images Sentences Images Sentences
29,000 29,000 1,014 1,014 1,000 1,000 1,000 1,000

As training and development data, we provide 29,000, and 1,014 triples respectively, each containing an English, French, and German source sentence, and its Czech human translations and corresponding image. We also provide the 2016 validation and test set, which people can use for validation and internal evaluation. The English-German datasets are the same as those in 2016, but we note that human translations in the 2016 validation and test datasets have been post-edited (by humans) using the images to make sure the target descriptions are faithful to these images. There were cases where in the 2016 the source text was ambiguous and the image was used to solve the ambiguities. The French translations were added in 2017 and the Czech translations were added in 2018.

As test data, we provide a test set of 1,000 tuples containing English, French, and German descriptions and its corresponding image. Gold labels will be translations in Czech. This test set corresponds to the unseen portion of the Czech Test 2017 data.

Evaluation will be performed using human Direct Assesment. The submissions and reference translations will be pre-processed to lowercase, normalise punctuation and tokenise the sentences. If you participate in the shared task, we ask you to perform a defined amount of evaluation per language pair submitted.


En-Cs Flickr 2017:


Textual and Visual Data

All of the textual data can be downloaded from the Multi30K Github repository. We also provide example data pre-processing scripts; their use is not mandatory.

We also provide ResNet-50 image features, although their use is not mandatory. The image features can be downloaded here. The raw images can be requested here for the training, development sets and 2016 test set. For images in the the 2017 and 2018 test files.

We released the new test set on June 8th, 2018:

If you use the datasets created for this shared task, please cite the following papers:

@inproceedings{elliott-EtAl:2016:VL16,
 author    = {{Elliott}, D. and {Frank}, S. and {Sima'an}, K. and {Specia}, L.},
 title     = {Multi30K: Multilingual English-German Image Descriptions},
 booktitle = {Proceedings of the 5th Workshop on Vision and Language},
 year      = {2016},
 pages     = {70--74},
 year      = 2016
}
@inproceedings{ElliottFrankBarraultBougaresSpecia2017,
 author = {Desmond Elliott and Stella Frank and Lo\"{i}c Barrault and Fethi Bougares and Lucia Specia},
 title = {{Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description}},
 booktitle = {Proceedings of the Second Conference on Machine Translation},
 year = {2017},
 month = {September},
 address = {Copenhagen, Denmark}
}


Additional resources

We suggest the following interesting resources that can be used as additional training data for either or both tasks:

Submissions using these or any other resources external to those provided for the tasks should indicate that their submissions are of the "unconstrained" type.

Submission Requirements

Your system description should be a short report (4 to 6 pages) submitted to WMT describing your method(s). We ask you to provide a summary and/or an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

Each participating team can submit at most 2 systems for each of the task variants for each language pair.

Submissions should be sent via email to Lucia Specia lspecia@gmail.com.

Please use the following pattern to name your files:

INSTITUTION-NAME_TASK-NAME_METHOD-NAME_TYPE, where:

INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF

TASK-NAME is one of the following: 1 (translation) or 1b (multisource multimodal).

METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 1b_MultimodalTranslation, 1b_Moses

TYPE is either C or U, where C indicates "constrained", i.e. using only the resources provided by the task organisers, and U indicates "unconstrained".

For instance, a constrained submission from team SHEF for Task 1b using method "Moses" could be named SHEF_1b_Moses_C.

If you are submitting a system for Task 1, please include the language in the TASK tag, e.g. 1_FLICKR_DE, 1_FLICKR_FR, etc.

Submission Format

The output of your system a given task should produce a target language description for each image formatted in the following way:

<METHOD NAME> <IMAGE ID> <DESCRIPTION> <TASK> <TYPE>

Where: Each field should be delimited by a single tab character.

Organisers

Lucia Specia (University of Sheffield)
Stella Frank (University of Amsterdam)
Loïc Barrault (University of Le Mans)
Fethi Bougares (University of Le Mans)
Desmond Elliott (University of Edinburgh)

Contact

For questions or comments, please use the wmt-tasks mailing list.

License

The data is licensed under Creative Commons: Attribution-NonCommercial-ShareAlike 4.0 International.

Supported by the the following European Commission projects: MultiMT M2CR.