This page describes the invitation-only shared task at WMT11 which focuses on using evaluation metrics to tune the paramters of a statistical machine translation system. The tunable evaluation metric task is invitation-only because it is the first time we have run this task, are there are likely to be problems that need to be ironed out. If you would like to be invited to partipate in this pilot task, and do not mind being a guinea pig, you may email Chris Callison-Burch (email@example.com) to request an invitation.
The goals of this task is to get researchers who develop automatic evaluation metrics for MT to work on the problem of using their metric to optimize the parameters of MT systems. Our previous workshops have demonstrated that a number of metrics perform better than Bleu in terms of having stronger correlation with human judgments about the output of an MT system. However, most MT system developers still optimize the paramters of their systems to Bleu. Here we aim to investigate the question of whether better metrics will result in better quality output when a system is optimized to them.
The shared challenge for metric developers will be the following: given a fixed system and development, return a weight vector and a corresponding set of test set translations for the system trained to your metric. We will then perform a human evaluation of the differnt outputs of the common system, and see if there is a perceptable difference in its quality when it has been optimized to different metrics.
To make the comparison as fair as possible, we fix the following elements and require that all metrics developers use the same:
Your responsibilities will be to:
To do the task, we recommend that you use a 64 bit Linux machine with at least 25GB of RAM.
tar xf tunable-metrics-task.tar; cd tunable-metrics-task
cd joshua-r1779/; ant;- you can find more detailed instructions for Joshua here
nohup ./mert/decoder_command &- it should take about 10-20 minutes to load the model and then you should see output in mert/dev.output.nbest
Z-MERT is Omar Zaidan's MERT implementantation. It is distributed with the Joshua decoder, and it is designed to be modular with respect to the objective function (i.e. the automatic evalation metric used to score the MT output). To show how easy it is to incorporate a new metric into Z-MERT, Omar made a 20 minute video that walks you through how to incporoate a new evaluation metric.
Part 1 of the video:
Part 2 of the video:
You can find additional instructional materials linked from the Z-MERT web page.
Return the decoder's n-best output for the devtest set in the tarball (do not re-case or detokenize), along the joshua.config file that you used to produce it. The config file will include the weight vector that your metric produced as a result of MERT.
We will perform a human evaluation of the outputs that were produced by optimizing to different metrics. We report whether the metrics produce perceptably different outputs, and if so, whether one is better than the others. We will provide a Bleu-optimized baseline as a point for comparison.
This is a pilot of the task so we intentionally limited the task, so that we can iron out the details on a restricted task. There are several possible changes for next year:
Like with the other shared tasks, if you participate in this task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done through Amazon Mechanical Turk.
You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.
|Materials released for download||February 21, 2011|
|Translations and config files due (email to firstname.lastname@example.org)||April 14, 2011|
|Paper submission deadline||May 19, 2011|
supported by the EuroMatrixPlus project
funded by the European Commission
under Framework Programme 7