Multimodal Translation Task - ACL 2016 First Conference on Machine Translation

Shared Task: Multimodal Machine Translation

This is a new shared task aimed at the generation of image descriptions in a target language, given an image and one or more descriptions in a different (source) language. The task can be addressed from two different perspectives:

as a translation task, which will take a source language description and translate it into the target language, where this process can be supported by information from the image (multimodal translation), and
as a description generation task, which will take an image and generate a description for it in the target language, where this process can be supported by the source language description (crosslingual image description generation).

We welcome participants focusing on either or both of these task variants. They will differ mainly in the training data (see below) and in the way the target language descriptions are evaluated: against one or more translations of the corresponding source description (translation variant) or against one or more descriptions of the same image in the target language, created independently from the corresponding source description (image description variant). This task has the following main goals:

To push existing work on the integration of computer vision and language processing.
To push existing work on multimodal language processing towards multilingual multimodal language processing.
To investigate the effectiveness of information from images in machine translation.
To investigate the effectiveness of crosslingual textual information in image description generation.

We will provide new training and test datasets for both variants of the task and also allow participants to use external data and resources (constrained vs unconstrained submissions). The data to be used for both tasks is an extended version of the Flickr30K dataset. The original dataset contains 31,783 images from Flickr on various topics and five crowdsourced English descriptions per image, totalling 158,915 English descriptions. This dataset was extended in different ways for each of the subtasks, as discussed below.

The code for the main baseline system for both tasks is available here, following the approach described in (Elliott et al. 2015), in particular, the MLM➝LM model (due to several requests). A secondary baseline for both tasks will be a Moses phrase-based statistical machine translation system trained using only the textual training data provided, following the pipeline described here.

Datasets

Task 1: Training, Validation, and Test sentences, and the splits.

Task 2: Training and Validation, and Test sentences, and the splits.

Image features will be provided to participants, but their use is not mandatory. In particular, we will release features extracted from the VGG-19 CNN, described in (Simonyan and Zisserman, 2015) from the FC₇ (relu7) and CONV_{5_4} layers using Caffe RC2.

we used the matlab_features_reference code in NeuralTalk
The FC_7 features were extracted from the layer labelled 'relu7', as defined in the deploy_features.prototxt in NeuralTalk.
The CONV_5,4 training, development, and test features were extracted from the layer labelled 'conv5_4', following correspondence with Kelvin Xu. (See the README for more details.)
For those who want to extract other image features, the original images can be downloaded from the Flickr30K dataset.

If you use the dataset created for this shared task, please cite the following paper: Multi30K: Multilingual English-German Image Descriptions.

@article{elliott-EtAl:2016:VL16,
 author    = {{Elliott}, D. and {Frank}, S. and {Sima'an}, K. and {Specia}, L.},
 title     = {Multi30K: Multilingual English-German Image Descriptions},
 booktitle = {Proceedings of the 5th Workshop on Vision and Language},
 year      = {2016},
 pages     = {70--74},
 year      = 2016
}

Results

The results are also available for both tasks in the following paper: A Shared Task on Multimodal Machine Translation and Crosslingual Image Description.

Stella Frank gave a presentation about the shared task submissions and results at the conference.

You can also download the submissions to the shared task.

Task 1: Multimodal Machine Translation

This task consists in translating English sentences that describe an image into German, given the English sentence itself and the image that it describes (or features from this image, if participants chose to). For this task, the Flickr30K Entities dataset was extended in the following way: for each image, one of the English descriptions was selected and manually translated into German by a professional translator. . We will provide most of the resulting parallel data and corresponding images for training, while smaller portions will be used for development and test.

As training and development data, we provide 29,000 and 1,014 triples respectively, each containing an English source sentence, its German human translation and corresponding image.

As test data, we provide a new set of 1,000 tuples containing an English description and its corresponding image.

Evaluation will be performed against the German human translation on the test set using standard MT evaluation metrics, with METEOR as the primary metric (lowercased text (with punctuation), both detokenised (primary) and tokenised versions). We will normalise punctuation in both reference translations and system submissions using this script. (Here are some additional notes on how we did the evaluation.) We may also include manual evaluation.

Task 2: Crosslingual Image Description Generation

This task consists in generating a German sentence that describes an image, given the image itself and one or more descriptions in English. For this task, the Flickr30K Entities dataset was extended in the following way: for each image, five German descriptions were crowdsourced independently from their English versions, and independently from each other. Any English-German pair of descriptions for a given image could be considered a comparable translation pair. We will provide most of the images and associated descriptions for training, while smaller portions will be used for development and test.

As training and development data, we provide 29,000 and 1,014 images, each with 5 descriptions in English and 5 descriptions in German, i.e., 29,014 tuples containing an image and 10 descriptions, 5 in each language.

As test data, we provide a new set of approximately 1,000 tuples containing an image and 5 English descriptions.

Evaluation will be performed against five German descriptions collected as reference on the test set, with lowercased text and without punctuation, using METEOR. We may also include manual evaluation.

Additional resources

We suggest the following interesting resources that can be used as additional training data for either or both tasks:

WMT16 News translation task data for both bilingual (English-German) and monolingual (English or German) data.
Flickr30K Entities dataset: an extension of the Flickr30K dataset which contains additional layers of annotation such as 244K coreference chains in the English descriptions and 276K manually annotated bounding boxes for entities in the images.
Additional image description datasets for source (English) side models, such as the Microsoft COCO Dataset, among others. See this survey for a complete list.

Submissions using these or any other resources external to those provided for the tasks should indicate that their submissions are of the "unconstrained" type.

Submission Format

The output of your system a given task should produce a target language description for each image formatted in the following way:

<METHOD NAME> <IMAGE ID> <DESCRIPTION> <TASK> <TYPE>

Where:

METHOD NAME is the name of your method.
IMAGE ID is the identifier of the test image.
DESCRIPTION is the output generated by your system (either a translation or an independently generated description).
TASK is one of the following flags: 1 (for translation task), 2 (for image description task), 3 (for both). The choice here will indicate how your descriptions will be evaluated. Option 3 means they will be evaluated both as a translation task and as an image description task.
TYPE is either C or U, where C indicates "constrained", i.e. using only the resources provided by the task organisers, and U indicates "unconstrained".

Each field should be delimited by a single tab character.

Submission Requirements

Each participating team can submit at most 2 systems for each of the task variants (so up to 4 submissions). These should be sent via email to Lucia Specia lspecia@gmail.com. Please use the following pattern to name your files:

INSTITUTION-NAME_TASK-NAME_METHOD-NAME_TYPE, where:

INSTITUTION-NAME is an acronym/short name for your institution, e.g. SHEF

TASK-NAME is one of the following: 1 (translation), 2 (description), 3 (both).

METHOD-NAME is an identifier for your method in case you have multiple methods for the same task, e.g. 2_NeuralTranslation, 2_Moses

TYPE is either C or U, where C indicates "constrained", i.e. using only the resources provided by the task organisers, and U indicates "unconstrained".

For instance, a constrained submission from team SHEF for task 2 using method "Moses" could be named SHEF_2_Moses_C.

You are invited to submit a short paper (4 to 6 pages) to WMT describing your method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to provide a summary and/or an appropriate reference describing your method(s) that we can cite in the WMT overview paper.

Important dates

Release of training data	January 30, 2016
Release of test data	April 10, 2016
Results submission deadline	May 4, 2016
Paper submission deadline	May 15, 2016
Notification of acceptance	June 5, 2016
Camera-ready deadline	June 22, 2016

Organisers

Lucia Specia (University of Sheffield)
Desmond Elliott (University of Amsterdam)
Stella Frank (University of Amsterdam)
Khalil Sima'an (University of Amsterdam)

Contact

For questions or comments, email Lucia Specia lspecia@gmail.com.

License

The data is licensed under Creative Commons: Attribution-NonCommercial-ShareAlike 4.0 International.