The recurring translation task of the WMT workshops focuses on news text and (mostly) European language pairs. For this year the language pairs are:
The goals of the shared translation task are:
|Release of training data for shared tasks (by)||31 January, 2019|
|Test suite source texts must reach us||March 24, 2019|
|Test data released||April 8, 2019|
|Translation submission deadline||April 16, 2019 (10am UK)|
|Translated test suites shipped back to test suites authors||April 26, 2019|
|Start of manual evaluation||April 29, 2019|
|End of manual evaluation||May 27, 2019|
We provide training data for all language pairs, and a common framework. The task is to improve current methods. We encourage a broad participation -- if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. Participants will use their systems to translate a test set of unseen sentences in the source language. The translation quality is measured by a manual evaluation and various automatic evaluation metrics. Participants agree to contribute to the manual evaluation about eight hours of work, per system submission.
You may participate in any or all of the eight language pairs. For all language pairs we will test translation in both directions. To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide a common training set, and a pre-processed version (TBC). You are not limited to this training set, and you are not limited to the training set provided for your target language pair. This means that multilingual systems are allowed, and classed as constrained as long as they use only data released for WMT19 (or older WMT Hindi-English and Turkish-English corpora, as listed below).
If you use additional training data (not provided by the WMT19 organisers) or existing translation systems, you must flag that your system uses additional data. We will distinguish system submissions that used the provided training data (constrained) from submissions that used significant additional data resources. Note that basic linguistic tools such as taggers, parsers, or morphological analyzers are allowed in the constrained condition.
Your submission report should highlight in which ways your own methods and data differ from the standard task. You should make it clear which tools you used, and which training sets you used.
The following two aspects of the task are new for 2019: Unsupervised learning and Document-level MT.
For 2019, we also have an unsupervised subtrack: German to Czech translations, using monolingual German and Czech training data only, as well as last years' parallel dev and test sets for bootstrapping. The training data should come from the the constrained monolingual sets of WMT news translation data.
No German-Czech parallel data is provided, and the participants cannot use any monolingual or parallel data for other languages and language pairs (thus zero-shot, transfer-learning and pivoting-based systems will be treated as part of the general news translation track).
In 2019, we are particularly interested in approaches which consider the whole document. We invite submissions of such approaches for English to German and Czech, and for Chinese to English. We will perform document-level human evaluation for these pairs.
For English to German, we will be releasing as much of the training data as possible with document boundaries intact.
For English to Czech, CzEng 1.7 (unchanged from last year) does already offer cross-sentential context for most of its "domains". No complete documents are available but all sentences in a "block" (i.e. those with the same "-bNUM-" number in the ID, e.g. subtitlesM-b15-00train-f000001-s*) formed a consecutive sequence in the original text. Sometimes the block is very short (just 1 sentence), and it is always limited to 13 or 15 sentences. No context information is available for the domains "techdoc", "navajo" and "tweets". The best context-aware domains are "news", "eu", "subtitles*" (well, subtitles) and "fiction".
At no additional burden on the News Translation Task participants (aside from having to translate much larger input data), we will again collectively provide a deeper analysis of various qualities of the translations. See the corresponding section of Findings 2018 for an inspiration.
See WMT19 Test Suites Google Document for more details.
Authors of additional test suites will be invited to report on their evaluation method and its results in a separate paper
The data released for the WMT19 news translation task can be freely used for research purposes, we just ask that you cite the WMT19 shared task overview paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.
We aim to use publicly available sources of data wherever possible. Our main sources of training data are the Europarl corpus, the UN corpus, the news-commentary corpus and the ParaCrawl corpus. We also release a monolingual News Crawl corpus. Other language-specific corpora will be made available.
We have added suitable additional training data to some of the language pairs.You may also use the following monolingual corpora released by the LDC:
Note that the released data is not tokenized and includes sentences of any length (including empty sentences). All data is in Unicode (UTF-8) format. The following Moses tools allow the processing of the training data into tokenized format:
To evaluate your system during development, we suggest using the 2018 test set. The data is provided in raw text format and in an SGML format that suits the NIST scoring tool. We also release other dev and test sets from previous years. For the new language pairs, we release dev sets in January, prepared in the same way as the test sets.
The 2019 test sets will be created from a sample of online newspapers from September-November 2018. For the established languages (i.e. English to/from Chinese, Czech, German, Finnish and Russian) the English-X and X-English test sets will be distinct, and only consist of documents created originally in the source language. For the new languages (i.e English to/from Gujarati, Kazakh and Lithuanian) the test sets include 50% English-X translation, and 50% X-English translation. In previous recent tasks, all the test data was created using the latter method.
We have released development data for the tasks that are new this year. It is created in the same way as the test set and included in the development tarball.
The news-test2011 set has three additional Czech translations that you may want to use. You can download them from Charles University.
|Europarl v9||✓||✓||✓||✓||✓ *||New: Re-extracted to include document boundaries. *: europarl-v7 for fr-de|
|✓||✓||✓||✓||✓||✓||New version for 2019 (except en-ru). Please use the bicleaner filtered version.||✓||✓||✓||✓ *||Same as last year. *: new for fr-de|
|News Commentary v14||✓||✓||✓||✓||✓||✓||Updated, and now with document boundaries. NB For the kk-en task, we include part of this data in the dev set, and have created -wmt19 versions of the corpora, which have the dev set removed.|
|CzEng 1.7||✓||Register and download CzEng 1.7. (cross-sentential context available for some domains)|
|✓||✓||✓||✓||✓||✓||✓||✓||New release for 2019|
|✓||✓||Register and download|
|✓||✓||This is part of the Tilde Model Corpus|
|✓||New A recrawled version of the Rapid corpus, with document boundaries intact. Also prepared by Tilde.|
The only gu-en corpus listed above is Wikititles. In addition, we propose the following data-sets, as well as specifically encouraging unconstrained submissions (i.e. bring your own data).
In addition to the wikititles and news-commentary above, we provide:
|News crawl||✓||✓||✓||✓||✓||✓||✓||✓||✓||✓||Updated Large corpora of crawled news, collected since 2007. Versions up to 2017 are as before, except they are re-filtered and re-shuffled. For de and en, document-split versions are available.|
|News discussions||✓||✓||Updated Corpora crawled from comment sections of online newspapers. Available in English and French.|
|Europarl||✓||✓||✓||✓||✓||✓ *||Monolingual version of European parliament crawl. Superset of the parallel version. *: europarl-v7 for fr|
|News Commentary||✓||✓||✓||✓||✓||✓||✓||Updated Monolingual text from news-commentary crawl. Superset of parallel version. Use v14. NB For the kk-en task, we include part of this data in the dev set, and have created -wmt19 versions of the corpora, which have the dev set removed.|
|Common Crawl||✓||✓||✓||✓||✓||✓||✓||✓||✓||✓||Deduplicated with development and evaluation sentences removed. English was updated 31 January 2016 to remove bad UTF-8. Downloads can be verified with SHA512 checksums. More English is available for unconstrained participants.|
|Wiki dumps||✓||✓||✓||New Monolingual text wikipedia, extracted using WikiExtractor.|
To submit your results, please first convert into into SGML format as required by the NIST BLEU scorer, and then upload it to the website matrix.statmt.org.
For Chinese output, you should submit unsegmented text, since our primary measure is human evaluation. For automatic scoring (in the matrix) we use BLEU4 computed on characters, scoring with v1.3 of the NIST scorer only. A script to convert a Chinese SGM file to characters can be found here.
Each submitted file has to be in a format that is used by standard scoring scripts such as NIST BLEU or TER.
This format is similar to the one used in the source test set files that were released, except for:
<tstset trglang="en" setid="newstest2019" srclang="any">, with trglang set to either
ru. Important: srclang is always
The script wrap-xml.perl makes the conversion of a output file in one-segment-per-line format into the required SGML file very easy:
wrap-xml.perl LANGUAGE SRC_SGML_FILE SYSTEM_NAME < IN > OUT
wrap-xml.perl en newstest2019-src.de.sgm Google < decoder-output > decoder-output.sgm
Upload happens in three easy steps:
You can use the matrix to list all your systems, and edit the metadata. This is important since after the test week ends, you need to decide which are your primary systems (that get included in the human evaluation, and the overview paper) and to ensure that you are happy with the system naming.
To access your system list, log in and select Account -> my current systems. You should see a list of all your systems, along with their metadata, and an edit button. Some instructions are included on this screen.
Evaluation will be done both automatically as well as by human judgement.