Shared Task: Unsupervised MT and Very Low Resource Supervised MT

There is no machine translation available for most of the ~7000 languages spoken on the planet Earth. This is because very limited or no parallel corpora are available. Research on unsupervised and very low resource machine translation is important for alleviating this problem. Unsupervised machine translation requires only monolingual data, while very low resource supervised machine translation uses very limited parallel data.

At WMT 2018 and WMT 2019, the first shared task and second shared task on Unsupervised Machine Translation (UMT), were held as part of the news translation track. In 2018, the language pairs were Turkish-English, Estonian-English and German-English. In 2019, we also tested "simulated" unsupervised systems for German to Czech unsupervised translation (where no German/Czech parallel data was allowed).

We now propose a third edition on UMT, which aims at a more realistic scenario, German to Upper Sorbian (and Upper Sorbian to German) translation. Upper Sorbian is a minority language of Germany that is in the Slavic language family (e.g., related to Lower Sorbian, Czech and Polish), and we provide here most of the digital data that is available, as far as we know.

As we were very recently able to obtain a very small amount of parallel data for this language pair, we also offer a very low resource supervised translation task.

The tasks are:

BACKGROUND

We will work on two tasks for the very low resource language Upper Sorbian (a Slavic minority language spoken in the Eastern part of Germany).

Working with the Sorbian Institute, we initially prepared an unsupervised MT task. We expect this task to become a standard benchmark for unsupervised MT development. This task particularly relies on having high quality Upper Sorbian monolingual data, which we obtained from the Sorbian Institute and the Witaj Sprachzentrum. We offer data obtained through web crawling. The Sorbian Institute has also provided medium quality data which has not been quality checked.

The Witaj Sprachzentrum (Witaj Language Center) recently provided a small training corpus of German/Upper Sorbian parallel data, which will be used in the Very Low Resource Supervised Machine Translation task. We expect this task to become a standard task for very low resource scenarios, and note that the results are important as they will directly inform efforts to create state-of-the-art machine translation systems for use by the Upper Sorbian community.

The Witaj Sprachzentrum provides development and "development test" (not blind test) sets for German to Upper Sorbian and Upper Sorbian to German translation. The development set is used to tune parameters, while the devtest set is used to measure progress using automatic metrics.

The Witaj Sprachzentrum provides the blind test sets which will be released for the evaluation on June 16.

Also, in the near future, CIS (LMU Munich), the Sorbian Institute and the Witaj Sprachzentrum hope to organize a task for Unsupervised Lower Sorbian translation!

BASIC TASKS

EVALUATION

At present, we plan to use automatic metrics for the evaluation of this task. We believe that manual evaluation may not be so necessary for unsupervised MT and very low resource MT development, because automatic metrics worked well at this (relatively low) translation quality level in the past. We may reconsider this.

IMPORTANT DATES

Release of training/dev/test data March 10, 2020
Blind test data released June 16, 2020   July 16, 2020
Translation submission deadline June 22, 2020   July 22, 2020 (17:00 UTC)
System description paper submission deadline   August 15, 2020

DATA

TRAINING DATA

We release training data for the two scenarios, Unsupervised and Very Low Resource Supervised.

Unsupervised

We allow the use of all German data released for WMT, except that the German side of the small parallel German/Upper Sorbian training corpus may not be used. All Upper Sorbian data we release may be used. No other language data may be used (no parallel, no other monolingual data sets for any language except those listed here as usable for Unsupervised).

Very Low Resource Supervised

We allow the use of all German and Upper Sorbian data released for WMT, including the 60000 sentence parallel German/Upper Sorbian training corpus. Other WMT 2020 data for other languages may be used. Upper Sorbian is a Slavic language which is related to Czech, so the German/Czech parallel data below may be of particular interest for building multilingual systems. Thank you to the Opus project for the German/Czech parallel data.

DEVELOPMENT DATA

We provide development and test sets for German to Upper Sorbian and Upper Sorbian to German. These are usable for both Unsupervised and Very Low Resource. The dev set should be used for parameter tuning (please do not use it as a parallel training corpus). The test set should be used for system evaluation during development (please do not use it as a parallel training corpus). Thanks to Jindřich Libovický for creating the splits we are using here and for work on the training data.

DOWNLOAD

Monolingual Upper Sorbian Data

sorbian_institute_monolingual.hsb.gz Upper Sorbian monolingual data provided by the Sorbian Institute (contains a high quality corpus and some medium quality data which are mixed together).

witaj_monolingual.hsb.gz Upper Sorbian monolingual data provided by the Witaj Sprachzentrum (high quality).

web_monolingual.hsb.gz Upper Sorbian monolingual data scraped from the web by CIS, LMU (thanks to Alina Fastowski). Use with caution, probably noisy, might erroneously contain some data from related languages.

Monolingual German Data

See the news translation task web page for monolingual German data. All monolingual German sets are allowed in both scenarios.

Upper Sorbian side of parallel training corpus.

train.hsb-de.hsb.gz (note that this file is usable for both the Unsupervised and Very Low Resource Supervised scenarios. In the Unsupervised scenario, it is used as a small high quality monolingual corpus)

German side of parallel training corpus.

train.hsb-de.de.gz (note that this file is NOT usable for Unsupervised!)

Dev and Test Sets

devtest.tar.gz (please use dev to tune system parameters, and test to measure progress. These files are allowed in both tracks)

German/Czech Parallel Data

This data is not allowed for Unsupervised. For Very Low Resource MT we allow all German/Czech parallel corpora obtainable from the Opus project. The de-cs corpora we particularly recommend using are: Europarl v8 and JW300 v1. These two corpora may be somewhat similar to the de-hsb parallel training and test data (but maybe not, take this with a grain of salt).

BLIND TEST SET SUBMISSION

The blindtest source is now available for both directions: blindtest_updated.tar.gz, updated to additionally contain the two sgm files.

Translation output should be submitted as real case, detokenized, and in SGML format.

We will use the Matrix for submission: matrix (thanks Barry Haddow). Please use wrap-xml.perl to create the sgm files.

Deadline is: July 22, 2020 (17:00 UTC)

ORGANIZERS

Questions or comments can be posted for discussion at wmt-tasks@googlegroups.com.

Organizational issues can be directed to Alexander Fraser

ACKNOWLEDGMENTS

This work has received funding from the European Research Council (ERC) under grant agreement No. 640550.