Invitation-only Tunable Metrics Task - EMNLP 2011 Sixth Workshop on Statistical Machine Translation

EMNLP 2011 SIXTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Invitation-only Task: Tunable Metrics

July 30 - 31, 2011
Edinburgh, UK

This page describes the invitation-only shared task at WMT11 which focuses on using evaluation metrics to tune the paramters of a statistical machine translation system. The tunable evaluation metric task is invitation-only because it is the first time we have run this task, are there are likely to be problems that need to be ironed out. If you would like to be invited to partipate in this pilot task, and do not mind being a guinea pig, you may email Chris Callison-Burch (ccb@cs.jhu.edu) to request an invitation.

Goals

The goals of this task is to get researchers who develop automatic evaluation metrics for MT to work on the problem of using their metric to optimize the parameters of MT systems. Our previous workshops have demonstrated that a number of metrics perform better than Bleu in terms of having stronger correlation with human judgments about the output of an MT system. However, most MT system developers still optimize the paramters of their systems to Bleu. Here we aim to investigate the question of whether better metrics will result in better quality output when a system is optimized to them.

The shared challenge for metric developers will be the following: given a fixed system and development, return a weight vector and a corresponding set of test set translations for the system trained to your metric. We will then perform a human evaluation of the differnt outputs of the common system, and see if there is a perceptable difference in its quality when it has been optimized to different metrics.

What we provide

To make the comparison as fair as possible, we fix the following elements and require that all metrics developers use the same:

Decoder - the Johusa decoder will be used in this pilot
Translation model - an Urdu to English translation model with syntax-based SCFG rules
Language model - a large 5-gram language model trained in the English GigaWord
Development set - a dev set with 4 reference English translations to be used to optimize system parameters
A test set - a test set consisting of 883 Urdu sentences (no references provided)
Decoder configuration file - a joshua.config file that ensures the search parameters are the same
Optimization routine - we provide the Z-MERT implementation of Minimum Error Rate Training.

Your responsibilities will be to:

Incorporate your metric into Z-MERT by subclassing the EvaluationMetric.java abstract class
Run ZMERT on the dev set with the provided decoder/models
Provide the weight vector that you get back, and use those settings to decode a test set

Getting started

To do the task, we recommend that you use a 64 bit Linux machine with at least 25GB of RAM.

Sign up for the mailing list at http://groups.google.com/group/WMT11-tunable-metrics-task
Download the tunable-metrics-task.tar file [1.8G]
Untar the tarball: tar xf tunable-metrics-task.tar; cd tunable-metrics-task
Compile the joshua decoder (this is already done if you're on a 64 bit unix machine): cd joshua-r1779/; ant; - you can find more detailed instructions for Joshua here
Test the decoder: nohup ./mert/decoder_command & - it should take about 10-20 minutes to load the model and then you should see output in mert/dev.output.nbest
Start implementing your own metric in joshua-r1779/src/joshua/zmert/EvaluationMetric.java

How to incorporate your metric into Z-MERT

Z-MERT is Omar Zaidan's MERT implementantation. It is distributed with the Joshua decoder, and it is designed to be modular with respect to the objective function (i.e. the automatic evalation metric used to score the MT output). To show how easy it is to incorporate a new metric into Z-MERT, Omar made a 20 minute video that walks you through how to incporoate a new evaluation metric.

Part 1 of the video:

Part 2 of the video:

You can find additional instructional materials linked from the Z-MERT web page.

Submission Format

Return the decoder's n-best output for the devtest set in the tarball (do not re-case or detokenize), along the joshua.config file that you used to produce it. The config file will include the weight vector that your metric produced as a result of MERT.

We will perform a human evaluation of the outputs that were produced by optimizing to different metrics. We report whether the metrics produce perceptably different outputs, and if so, whether one is better than the others. We will provide a Bleu-optimized baseline as a point for comparison.

Acknowledged Limitations and Aniticipated Changes Next Year

This is a pilot of the task so we intentionally limited the task, so that we can iron out the details on a restricted task. There are several possible changes for next year:

More language pairs / translations into languages other than English. This year we focus on Urdu-English because the language pair requires a lot of reordering, and our syntactic model has more parameters to optimize than the standard Hiero/phrase-based models.
Provide some human judgments about the model's output, so that people can experiment with regression models. If you'd like to do this yourself this year, you can collect judgments on Mechanical Turk using Omar's MAISE tool.
Include a single reference track along with the multiple reference track. Some metrics may be better at dealing with the (more common) case of there being only a single reference translation available for every source sentence.
Allow for experimentation with the MIRA optimization routine instead of MERT. MIRA can scale to a greater number of features, but requires that metrics be decompsible.

Other Requirements

Like with the other shared tasks, if you participate in this task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done through Amazon Mechanical Turk.

You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.

IMPORTANT DATES

Materials released for download	February 21, 2011
Translations and config files due (email to ccb@cs.jhu.edu)	April 14, 2011
Paper submission deadline	May 19, 2011

supported by the EuroMatrixPlus project
P7-IST-231720-STP
funded by the European Commission
under Framework Programme 7

EMNLP 2011 SIXTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION