Shared Task: Multimodal Machine Translation

This is a new shared task aimed at the generation of image descriptions in a target language, given an image and one or more descriptions in a different (source) language. The task can be addressed from two different perspectives:

We welcome participants focusing on either or both of these task variants. They will differ mainly in the training data (see below) and in the way the target language descriptions are evaluated: against one or more translations of the corresponding source description (translation variant) or against one or more descriptions of the same image in the target language, created independently from the corresponding source description (image description variant). This task has the following main goals: We will provide new training and test datasets for both variants of the task and also allow participants to use external data and resources (constrained vs unconstrained submissions). The data to be used for both tasks is an extended version of the Flickr30K dataset. The original dataset contains 31,783 images from Flickr on various topics and five crowdsourced English descriptions per image, totalling 158,915 English descriptions. This dataset was extended in different ways for each of the subtasks, as discussed below.

The code for the main baseline system for both tasks is available here, following the approach described in (Elliott et al. 2015), in particular, the MLM➝LM model (due to several requests). A secondary baseline for both tasks will be a Moses phrase-based statistical machine translation system trained using only the textual training data provided, following the pipeline described here.


Task 1: Training, Validation, and Test sentences, and the splits.

Task 2: Training and Validation, and Test sentences, and the splits.

Image features will be provided to participants, but their use is not mandatory. In particular, we will release features extracted from the VGG-19 CNN, described in (Simonyan and Zisserman, 2015) from the FC7 (relu7) and CONV5_4 layers using Caffe RC2.

If you use the dataset created for this shared task, please cite the following paper: Multi30K: Multilingual English-German Image Descriptions.

 author    = {{Elliott}, D. and {Frank}, S. and {Sima'an}, K. and {Specia}, L.},
 title     = {Multi30K: Multilingual English-German Image Descriptions},
 booktitle = {Proceedings of the 5th Workshop on Vision and Language},
 year      = {2016},
 pages     = {70--74},
 year      = 2016


The results are also available for both tasks in the following paper: A Shared Task on Multimodal Machine Translation and Crosslingual Image Description.

Stella Frank gave a presentation about the shared task submissions and results at the conference.

You can also download the submissions to the shared task.

Task 1: Multimodal Machine Translation

This task consists in translating English sentences that describe an image into German, given the English sentence itself and the image that it describes (or features from this image, if participants chose to). For this task, the Flickr30K Entities dataset was extended in the following way: for each image, one of the English descriptions was selected and manually translated into German by a professional translator. . We will provide most of the resulting parallel data and corresponding images for training, while smaller portions will be used for development and test.

As training and development data, we provide 29,000 and 1,014 triples respectively, each containing an English source sentence, its German human translation and corresponding image.

As test data, we provide a new set of 1,000 tuples containing an English description and its corresponding image.

Evaluation will be performed against the German human translation on the test set using standard MT evaluation metrics, with METEOR as the primary metric (lowercased text (with punctuation), both detokenised (primary) and tokenised versions). We will normalise punctuation in both reference translations and system submissions using this script. (Here are some additional notes on how we did the evaluation.) We may also include manual evaluation.

Task 2: Crosslingual Image Description Generation

This task consists in generating a German sentence that describes an image, given the image itself and one or more descriptions in English. For this task, the Flickr30K Entities dataset was extended in the following way: for each image, five German descriptions were crowdsourced independently from their English versions, and independently from each other. Any English-German pair of descriptions for a given image could be considered a comparable translation pair. We will provide most of the images and associated descriptions for training, while smaller portions will be used for development and test.

As training and development data, we provide 29,000 and 1,014 images, each with 5 descriptions in English and 5 descriptions in German, i.e., 29,014 tuples containing an image and 10 descriptions, 5 in each language.

As test data, we provide a new set of approximately 1,000 tuples containing an image and 5 English descriptions.

Evaluation will be performed against five German descriptions collected as reference on the test set, with lowercased text and without punctuation, using METEOR. We may also include manual evaluation.

Additional resources

We suggest the following interesting resources that can be used as additional training data for either or both tasks:

Submissions using these or any other resources external to those provided for the tasks should indicate that their submissions are of the "unconstrained" type.

Submission Format

The output of your system a given task should produce a target language description for each image formatted in the following way:


Where: Each field should be delimited by a single tab character.

Submission Requirements

Each participating team can submit at most 2 systems for each of the task variants (so up to 4 submissions). These should be sent via email to Lucia Specia Please use the following pattern to name your files:


INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF

TASK-NAME is one of the following: 1 (translation), 2 (description), 3 (both).

METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 2_NeuralTranslation, 2_Moses

TYPE is either C or U, where C indicates "constrained", i.e. using only the resources provided by the task organisers, and U indicates "unconstrained".

For instance, a constrained submission from team SHEF for task 2 using method "Moses" could be named SHEF_2_Moses_C.

You are invited to submit a short paper (4 to 6 pages) to WMT describing your method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to provide a summary and/or an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

Important dates

Release of training data January 30, 2016
Release of test data April 10, 2016
Results submission deadline May 4, 2016
Paper submission deadlineMay 15, 2016
Notification of acceptanceJune 5, 2016
Camera-ready deadlineJune 22, 2016


Lucia Specia (University of Sheffield)
Desmond Elliott (University of Amsterdam)
Stella Frank (University of Amsterdam)
Khalil Sima'an (University of Amsterdam)


For questions or comments, email Lucia Specia


The data is licensed under Creative Commons: Attribution-NonCommercial-ShareAlike 4.0 International.