WMT21 Shared Task: Unsupervised MT and Very Low Resource Supervised MT

Shared Task: Unsupervised MT and Very Low Resource Supervised MT

Update, July 23: The submission deadline is over. Please identify the primary submission in OCELoT.

There is no machine translation available for most of the ~7000 languages spoken on the planet Earth. In the unsupervised and very low resource translation task, we collaborate with the local communities in providing resources and developing MT systems for local minority languages. Like last year, the task will include translation of Upper Sorbian, a minority Slavic language spoken in Germany. This year, we added a translation of closely related Lower Sorbian and Chuvash, a minority Turcic language spoken in south Russia.

This year, the tasks are:

Unsupervised Machine Translation: German to Lower Sorbian. Lower Sorbian to German.
Very Low Resource Supervised Machine Translation: German to Upper Sorbian. Upper Sorbian to German.
Low Resource Supervised Machine Translation: Russian to Chuvash. Chuvash to Russian.

TRANSLATION OF LOWER AND UPPER SORBIAN

Lower and Upper Sorbian are Slavic minority languages spoken in the Eastern part of Germany with 7k and 30k native speakers respectively.

The data for this task was provided by the Sorbian Institute (monolingual data) and The Witaj Sprachzentrum (Witaj Language Center) (both parallel and monolingual data).

The development and test data for Upper Sorbian are the same as the last year.

As far as we know, there is parallel data for Lower Sorbian except for the development and test data provided for this task. Unlike the last year, there is no unsupervised task for Upper Sorbian.

We allow the use of all German, Czech and Polish data released for WMT. All Upper Sorbian data (both monolingual and parallel) we release may be used. In addition, all parallel with German on one side (German-Czech and German-Polish might be particularly useful) from the WMT news tasks or available in the OPUS project might be used. No other language data may be used.

TRANSLATION OF CHUVASH

Chuvash is a Turkic language spoken as a minority language in the Volga Region in Russia. There is a larger amount of training data available for Chuvash, but the language is rather isolated in the Turkic language family, so unlike Sorbian, it cannot benefit that much from the existence of closely related languages.

All Chuvash (parallel and monolingual) data we release may be used. Additional data that might be used: Chuvash-Russian part of the JW300 corpus, all Russian data released for WMT. In addition, the Kazakh-Russian corpus and monolingual Kazakh data from WMT19 are allowed. Chuvash is covered by multilingual BERT which also might be used.

EVALUATION

We plan to use automatic metrics for the evaluation of this task. We believe that manual evaluation may not be so necessary for unsupervised MT and very low resource MT development, because automatic metrics worked well at this (relatively low) translation quality level in the past.

IMPORTANT DATES

~~Release of training/dev/test data~~	~~May 6, 2021~~
~~Blind test data released~~	~~July 15, 2021~~
~~Translation submission deadline~~	~~July 23, 2021 (17:00 UTC)~~
System description paper submission deadline	August 5, 2021
Camera ready	September 15, 2021

DATA

Parallel Lower Sorbian Data

There are no parallel training data for Lower Sorbian.

Development and development test data: devtest.dsb-de.tgz. Development and development test data may not be used for training.

Monolingual Lower Sorbian Data

The only allowed Lower Sorbian data for training is monolingual: mono.dsb.gz

Upper Sorbian parallel data

Training data from 2020: train.hsb-de.hsb.gz, train.hsb-de.de.gz
Additional data for 2021: train2021.hsb-de.hsb.gz, train2021.hsb-de.de.gz
Dev and test data (the same as 2020): devtest.tar.gz Please use dev to tune system parameters, and test to measure progress. The dev data are sampled from the same distribution as the training data.

Please do not use the blind test data from the last year.

Monolingual Upper Sorbian Data

sorbian_institute_monolingual.hsb.gz Upper Sorbian monolingual data provided by the Sorbian Institute (contains a high-quality corpus and some medium quality data which are mixed together).
sorbian_institute_monolingual.hsb.gz" Upper Sorbian monolingual data provided by the Witaj Sprachzentrum (high quality).
web_monolingual.hsb.gz Upper Sorbian monolingual data scraped from the web by CIS, LMU (thanks to Alina Fastowski). Use with caution, probably noisy, might erroneously contain some data from related languages.

Chuvash parallel data

Training data: train.chv-ru.chv.gz, train.chv-ru.ru.gz and a Chuvash-Russian dictionary
Dev and test data: devtest.chv-ru.tgz

The dev and test data are from the same distribution as the training data, but unlike the training data, they were manually filtered.

Monolingual Chuvash data

monocorpus_chv.zip contains monolingual Chuvash data from various sources (Wikipedia, web crawl, fiction).

Note that several characters used in Chuvash are in two different UTF-8 encodings. Please use this script to normalize Cyrillic script data (other than the parallel and monolingual Chuvash data linked from this page).

Monolingual German, Czech, Polish, Russian and Kazakh data

See the news translation task web page (also previous years) for monolingual data. All monolingual data for the listed languages are available for WMT tasks (now or in the past) are allowed.

BLIND TEST SET SUBMISSION

The blind test sets:

German-Upper Sorbian: blind_test_2021.de-hsb.de
Upper Sorbian-German: blind_test_2021.hsb-de.hsb
German-Lower Sorbian: blind_test.de-dsb.de
Lower Sorbian-German: blind_test.dsb-de.dsb
Russian-Chuvash: blind_test.ru-chv.ru
Chuvash-Russian: blind_test.chv-ru.chv

Use Ocelot to submit the translated test sets. Teams must get verified after registration. After registering your team, please email Jindřich Libovický (surname at cis.lmu.de) to verify your registration. Note this is a different instance of Ocelot than was used for the News task (so if you already registered your team for the News Task, you need to register again). The same rules as for the News Task apply. (Many thanks to Tom Kocmi and Christian Federmann for making Ocelot work.)

The deadline is: July 23, 2021 (17:00 UTC)

ORGANIZERS

Alexander Fraser - CIS, LMU Munich
Jindřich Libovický - CIS, LMU Munich
Hauke Bartels - Sorbian Institute
Olaf Langner - Witaj Sprachzentrum
Marcin Szczepanski - Sorbian Institute
Alexander Antonov - Chuvash Language Laboratory

Questions or comments can be posted for discussion at wmt-tasks@googlegroups.com.

Organizational issues can be directed to Jindřich Libovický and Alexander Fraser.

ACKNOWLEDGMENTS

This work has received funding from the European Research Council (ERC) under grant agreement No. 640550. This work was also supported by DFG (grant FR 2829/4-1).