Workshop Shared Task: Statistical Machine Translation

NAACL 2006 WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Shared Task: Exploiting Parallel Texts for Statistical Machine Translation

June 8 and 9, 2006, in conjunction with NAACL 2006 in New York City

The shared task of the workshop is to build a probabilistic phrase translation table for phrase-based statistical machine translation (SMT). Evaluation is translation quality on an unseen test set. We provide a parallel corpus as training data (with word alignment), a baseline statistical machine translation system, and additional resources. Participants may augment this system or use their own system.

Goals

The goals of staging this shared task are:

get reference performance numbers in a large-scale translation task for European languages
pose special challenges with word order (German-English) and translating from English into foreign languages
offer interested parties a (relatively) smooth start with hands-on experience in state-of-the-art statistical machine translation methods
create publicly available data for machine translation and machine translation evaluation

We hope that both beginners and established research groups will participate in this task.

Task Description

We provide training data for three European language pairs, and a common framework (including a language model and a basline system). The task is to improve methods to build a phrase translation table (e.g. by better word alignment, phrase extraction, phrase scoring), augment the system otherwiese (e.g. by preprocessing), or build entirely new translation systems.

The participants' system is used to translate a test set of unseen sentences in the source language. The translation quality is measured by the BLEU score, which measures overlap with a reference translation, and manual evaluation. Participants agree to contribute to the manual evaluation about eight hours of work.

To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide

a fixed training set
a fixed language model
a fixed baseline system

Optionally, you may use

a provided word alignment

Most current methods to train phrase translation tables build on a word alignment (i.e., the mapping of each word in the source sentence to words in the target sentence). Since word alignment is by itself a difficult task, we provide word alignments. These word alignments are acquired by automatic methods, hence they contain errors. You may get better performance by coming up with your own word alignment.

We also strongly encourage your participation, if you use

your own training corpus
your own sentence alignment
your own language model
your own decoder

Your submission report should highlight in which ways your own methods and data differ from the standard task. We may break down submitted results in different tracks, based on what resources were used.

Provided Data

The provided data is taken from the Europarl corpus, which is freely available. Please click on the links below to download the data. If you prepare training data from the Europarl corpus directly, please do not take data from Q4/2000 (October-December), since it is reserved for development and test data.

French-English and English-French: training (fr, en), word alignment (fr-en)
Spanish-English and English-Spanish: training (es, en), word alignment (es-en)
German-English and English-German: training (de, en), word alignment (de-en)
English Language Model (lowercased training data)
French Language Model (lowercased training data)
Spanish Language Model (lowercased training data)
German Language Model (lowercased training data)

Note that the training data is not lowercased. This may be useful for tagging and parsing tools. However, the phrase translation tables and language model use lowercased text. Since the provided development test set and final test set are mixed-cased, they have to be lowercased before translating.

Development Data

To tune your system during development, we provide a development set of 2000 sentences.

This data is identical with the 2005 development test data.

Development Test Data

To test your system during development, we provide a development test set of 2000 sentences.

This data is identical with the 2005 test data.

Test Data

To test your system, translate the following 3064 sentences and send the output per email to pkoehn@inf.ed.ac.uk

English (to be translated to French, Spanish and German)
French (to be translated to English)
Spanish (to be translated to English)
German (to be translated to English)

Evaluation

Evaluation will be done both automatically as well as by human judgement.

Automatic Scoring: We will use the BLEU score, a reference implementation is multi-bleu.perl.
Manual Scoring: We will collect judgments about adequacy and fluency from human annotators. If you participate in the evaluation, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool, which you can play with here (using the WPT'05 submissions).

Dates

March 20: Test data released (available on this web site)
March 31: Results submissions (by email to pkoehn@inf.ed.ac.uk)
April 7: Short paper submissions (4 pages)

Organizers

Philipp Koehn (University of Edinburgh)
Christof Monz (University of London)

NAACL 2006 WORKSHOP ON STATISTICAL MACHINE TRANSLATION