There is no machine translation available for most of the ~7000 languages spoken on the planet Earth. This is because very limited or no parallel corpora are available. Research on unsupervised and very low resource machine translation is important for alleviating this problem. Unsupervised machine translation requires only monolingual data, while very low resource supervised machine translation uses very limited parallel data.
At WMT 2018 and WMT 2019, the first shared task and second shared task on Unsupervised Machine Translation (UMT), were held as part of the news translation track. In 2018, the language pairs were Turkish-English, Estonian-English and German-English. In 2019, we also tested "simulated" unsupervised systems for German to Czech unsupervised translation (where no German/Czech parallel data was allowed).
We now propose a third edition on UMT, which aims at a more realistic scenario, German to Upper Sorbian (and Upper Sorbian to German) translation. Upper Sorbian is a minority language of Germany that is in the Slavic language family (e.g., related to Lower Sorbian, Czech and Polish), and we provide here most of the digital data that is available, as far as we know.
As we were very recently able to obtain a very small amount of parallel data for this language pair, we also offer a very low resource supervised translation task.
The tasks are:
We will work on two tasks for the very low resource language Upper Sorbian (a Slavic minority language spoken in the Eastern part of Germany).
Working with the Sorbian Institute, we initially prepared an unsupervised MT task. We expect this task to become a standard benchmark for unsupervised MT development. This task particularly relies on having high quality Upper Sorbian monolingual data, which we obtained from the Sorbian Institute and the Witaj Sprachzentrum. We offer data obtained through web crawling. The Sorbian Institute has also provided medium quality data which has not been quality checked.
The Witaj Sprachzentrum (Witaj Language Center) recently provided a small training corpus of German/Upper Sorbian parallel data, which will be used in the Very Low Resource Supervised Machine Translation task. We expect this task to become a standard task for very low resource scenarios, and note that the results are important as they will directly inform efforts to create state-of-the-art machine translation systems for use by the Upper Sorbian community.
The Witaj Sprachzentrum provides development and "development test" (not blind test) sets for German to Upper Sorbian and Upper Sorbian to German translation. The development set is used to tune parameters, while the devtest set is used to measure progress using automatic metrics.
The Witaj Sprachzentrum provides the blind test sets which will be released for the evaluation on June 16.
Also, in the near future, CIS (LMU Munich), the Sorbian Institute and the Witaj Sprachzentrum hope to organize a task for Unsupervised Lower Sorbian translation!
|Release of training/dev/test data||March 10, 2020|
|Blind test data released|
|Translation submission deadline|
|System description paper submission deadline||August 15, 2020|
We release training data for the two scenarios, Unsupervised and Very Low Resource Supervised.
We allow the use of all German data released for WMT, except that the German side of the small parallel German/Upper Sorbian training corpus may not be used. All Upper Sorbian data we release may be used. No other language data may be used (no parallel, no other monolingual data sets for any language except those listed here as usable for Unsupervised).
Very Low Resource Supervised
We allow the use of all German and Upper Sorbian data released for WMT, including the 60000 sentence parallel German/Upper Sorbian training corpus. Other WMT 2020 data for other languages may be used. Upper Sorbian is a Slavic language which is related to Czech, so the German/Czech parallel data below may be of particular interest for building multilingual systems. Thank you to the Opus project for the German/Czech parallel data.
Monolingual Upper Sorbian Data
sorbian_institute_monolingual.hsb.gz Upper Sorbian monolingual data provided by the Sorbian Institute (contains a high quality corpus and some medium quality data which are mixed together).
witaj_monolingual.hsb.gz Upper Sorbian monolingual data provided by the Witaj Sprachzentrum (high quality).
web_monolingual.hsb.gz Upper Sorbian monolingual data scraped from the web by CIS, LMU (thanks to Alina Fastowski). Use with caution, probably noisy, might erroneously contain some data from related languages.
Monolingual German Data
See the news translation task web page for monolingual German data. All monolingual German sets are allowed in both scenarios.
Upper Sorbian side of parallel training corpus.
train.hsb-de.hsb.gz (note that this file is usable for both the Unsupervised and Very Low Resource Supervised scenarios. In the Unsupervised scenario, it is used as a small high quality monolingual corpus)
German side of parallel training corpus.
train.hsb-de.de.gz (note that this file is NOT usable for Unsupervised!)
Dev and Test Sets
devtest.tar.gz (please use dev to tune system parameters, and test to measure progress. These files are allowed in both tracks)
German/Czech Parallel Data
This data is not allowed for Unsupervised. For Very Low Resource MT we allow all German/Czech parallel corpora obtainable from the Opus project. The de-cs corpora we particularly recommend using are: Europarl v8 and JW300 v1. These two corpora may be somewhat similar to the de-hsb parallel training and test data (but maybe not, take this with a grain of salt).
The blindtest source is now available for both directions: blindtest_updated.tar.gz, updated to additionally contain the two sgm files.
Translation output should be submitted as real case, detokenized, and in SGML format.
We will use the Matrix for submission: matrix (thanks Barry Haddow). Please use wrap-xml.perl to create the sgm files.
Deadline is: July 22, 2020 (17:00 UTC)
Questions or comments can be posted for discussion at email@example.com.
Organizational issues can be directed to Alexander Fraser