Shared Task: Machine Translation Robustness

This is a translation task of the WMT workshop focusing on robustness of machine translation to noisy input text. The language pairs are:

GOALS

Non-standard, noisy text of the kind that can be found in social media and the internet is ubiquitous. Yet, existing machine translation systems struggle with handling the idiosyncrasies of this type of input. The goal of this shared task is to provide a testbed for improving MT models' robustness to orthographic variations, grammar errors, and other linguistic phenomena common in noisy, user-generated content, via better modelling, adaptation technique or leveraging monolingual training data.

Specifically, the shared task aims to bring improvements on the following challenges:

IMPORTANT DATES

Release of training/dev data January 21, 2019
Test data released April 12, 2019
Translation submission deadline April 29, 2019 (23:59 UTC-12)
System description paper submission deadline May 17, 2019
End of evaluation July 2, 2019

TASK DESCRIPTION

We provide training and dev data from the same domain distribution (Reddit comments) for all language pairs. In addition, we also provide pointers to more data sources focusing on the following two aspects:

Utilizing out-of-domain data

You are highly encouraged to submit systems which train with large amounts of parallel corpora with distinct distribution from the test domain. We provide pointers to past WMT training corpora.

Utilizing monolingual data

You are highly encouraged to develop novel solutions to utilize monolingual corpora (both in-domain and out-of-domain) to improve translation quality.

You can focus on either or both aspects for your submission.

Constrained submission is highly encouraged (see definition below):

You are also welcome to use text-normalization tools to preprocess train/dev/test data. If you do so, please flag the normalization tool you used, and make sure they have open-sourced code and can be acquired for free.

We also encourage participation purely focused on the text normalization aspect. If you are interested, please contact us and we will provide a pretrained baseline MT system to generate translations.

You may participate in either or both language pair.

DATA

TRAINING DATA

In-domain data: Out-of-domain data:

DEVELOPMENT DATA

In-domain data: Out-of-domain data:

NEW: TEST DATA

You can download the blind test sets.

The zip archive contains 3 files:

    en-fr.blind.tsv
    en-ja.blind.tsv
    fr-en.blind.tsv
    ja-en.blind.tsv

Each file is tab separated with 3 rows:

  1. The first row is a unique number identifying each sentence
  2. The second row is a number identifying comments. Some sentences come from the same reddit comments. Sentences are ordered as they were found in each comment. Should you want to, you may use this information to leverage context from sentences that come from the same comment.
  3. The third and last row contains the source sentence.

NEW: the test sets with reference translations are now available: MTNT2019.tar.gz. The format is the same as the blind test sets with one additional column for the translation.

DOWNLOAD

TEST SET SUBMISSION

Translation output should be submitted as real case, detokenized, and in SGML format.

For English-Japanese, your raw text output needs to be segmented with Kytea (version 0.4.7 recommended), first:

kytea -model /path/to/kytea/share/kytea/model.bin -out tok YOUR_OUTPUT > YOUR_OUTPUT_TOK

To convert plain text output into the proper format, download the SGML versions of the source files and the script wrap-xml.perl. With that at hand, you can convert your output with

wrap-xml.perl LANG SOURCE_SGM < YOUR_OUTPUT > YOUR_OUTPUT_SGM
where

Please upload this file to the website following steps below:

  1. Go to the website matrix.statmt.org.
  2. Create an account under the menu item Account -> Create Account.
  3. Go to Account -> upload/edit content, and follow the link "Submit a system run"
If you are submitting contrastive runs, please submit your primary system first and mark it clearly as the primary submission. For system description paper submission, please follow the instruction in PAPER SUBMISSION INFORMATION.

EVALUATION

Evaluation will be done both automatically as well as by human judgement. Constrained and unconstrained systems will be evaluated and compared separately.

ORGANIZERS

Questions or comments can be posted at wmt-tasks@googlegroups.com.