Workshop Shared Task: Machine Translation Evaluation

ACL 2008 THIRD WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Shared Task: Automatic Evaluation of Machine Translation

June 19, in conjunction with ACL 2008 in Columbus, Ohio

The shared evaluation task of the workshop will examine automatic evaluation metrics for machine translation. We will provide all of the translations produced in the shared translation task, as well as the reference translations. You will return rankings for each of each of the translations at the system-level and/or at the sentence-level. We will calculate the correlation on your rankings with the human evaluation when it is completed.

Goals

The goals of the shared evaluation task are:

To achieve the strongest correlation with human judgments of translation quality
To illustrate the suitability of an automatic evaluation metric as a surrogate for human evaluations
To address the problems associated with comparing against a single reference translation
To move automatic evaluation beyond system-level ranking to finer-grained sentence-level ranking

Submission Format

Once we receive the system outputs from the shared translation task we will post all of the system translations, along with source documents and reference translations, for you to evaluate with your metric. The translations will be available in two formats:

the NIST MT Evaluation Workshop's XML format, which looks like this:

<tstset setid="wmt08-de-en-nc-test" srclang="German" trglang="English"> 
<DOC docid="Speigel-doc1" sysid="UMD_de_en_primary">
<seg id="1"> TRANSLATED ENGLISH TEXT </seg> 
<seg id="2"> TRANSLATED ENGLISH TEXT </seg> 
...
</DOC> 
<DOC docid="Speigel-doc2"  sysid="UMD_de_en_primary"> 
<seg id="13"> TRANSLATED ENGLISH TEXT </seg> 
<seg id="14"> TRANSLATED ENGLISH TEXT </seg> 
...
</DOC> 
</tstset>

plain text files with one translation per line.

You can use either of these as input to your software. The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).

Output file format for system-level rankings

The output files for system-level rankings should be formatted in the following way:

<TEST SET>   <SYSTEM>   <SYSTEM LEVEL SCORE>

Where:

TEST SET is the ID of the test set (given by the setid attribute of of the tstset tag in the XML file, or by the directory structure in the plain text files).
SYSTEM is the ID of system being scored (given by the sysid attribute in the XML document, or as part of the filename for the plain text file).
SYSTEM LEVEL SCORE is the overall system level score.

Each field should be delimited by a single tab character.

Output file format for segment-level rankings

The output files for segment-level rankings should be formatted in the following way:

<TEST SET>   <SYSTEM>   <DOCUMENT ID>   <SEGMENT ID>   <SEGMENT SCORE>

Where:

TEST SET is the ID of the test set.
SYSTEM is the ID of system being scored.
DOCUMENT ID is the document ID (given by the docid tag in the XML document, or identical to the test set ID if you're using the plain text input files).
SEGMENT ID is the segment number of each segment (given by the seg id tag of the XML file, or the line number starting from one of the plain text input files).
SEGMENT SCORE is the score for the particular segment.

Each field should be delimited by a single tab character.

The output file formats are identical to the ones that will be used in the NIST workshop on evaluation metrics for machine translation, which is going to be held at AMTA this year.

Development Data

Segment-level and sentence-level development data is available for all of the language pairs featured in last year's workshop. The development data was compiled from the sentence-level rankings of last year's manual evaluation process. You are welcome to create customized dev data from the raw data from last year's human evaluation.

evaluation-dev-data.tar.gz (11M zipped, 33M uncompressed)
README
raw data from WMT07 evaluation

Dates

March 29: System translations released (tar file here: wmt08-eval.tar.gz)
April 4: Deadline for short paper submissions (4 pages)
Extended April 9: Deadline for submitting rankings (by email to ccb@cs

jhu

edu)

supported by the EuroMatrix project, P6-IST-5-034291-STP
funded by the European Commission under Framework Programme 6

ACL 2008 THIRD WORKSHOP ON STATISTICAL MACHINE TRANSLATION