Shared Task: Machine Translation of News

The recurring translation task of the WMT workshops focuses on news text and (mostly) European language pairs. For this year the language pairs are:

We provide parallel corpora for all languages as training data, and additional resources for download.

GOALS

The goals of the shared translation task are:

We hope that both beginners and established research groups will participate in this task.

IMPORTANT DATES

Release of training data for shared tasks (by)31 January, 2019
Test suite source texts must reach us March 24, 2019
Test data released April 8, 2019
Translation submission deadlineApril 16, 2019 (10am UK)
Translated test suites shipped back to test suites authors April 26, 2019
Start of manual evaluationApril 29, 2019
End of manual evaluation May 27, 2019

TASK DESCRIPTION

We provide training data for all language pairs, and a common framework. The task is to improve current methods. We encourage a broad participation -- if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. Participants will use their systems to translate a test set of unseen sentences in the source language. The translation quality is measured by a manual evaluation and various automatic evaluation metrics. Participants agree to contribute to the manual evaluation about eight hours of work, per system submission.

You may participate in any or all of the eight language pairs. For all language pairs we will test translation in both directions. To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide a common training set, and a pre-processed version (TBC). You are not limited to this training set, and you are not limited to the training set provided for your target language pair. This means that multilingual systems are allowed, and classed as constrained as long as they use only data released for WMT19 (or older WMT Hindi-English and Turkish-English corpora, as listed below).

If you use additional training data (not provided by the WMT19 organisers) or existing translation systems, you must flag that your system uses additional data. We will distinguish system submissions that used the provided training data (constrained) from submissions that used significant additional data resources. Note that basic linguistic tools such as taggers, parsers, or morphological analyzers are allowed in the constrained condition.

Your submission report should highlight in which ways your own methods and data differ from the standard task. You should make it clear which tools you used, and which training sets you used.

The following two aspects of the task are new for 2019: Unsupervised learning and Document-level MT.

Unsupervised learning

For 2019, we also have an unsupervised subtrack: German to Czech translations, using monolingual German and Czech training data only, as well as last years' parallel dev and test sets for bootstrapping. The training data should come from the the constrained monolingual sets of WMT news translation data.

No German-Czech parallel data is provided, and the participants cannot use any monolingual or parallel data for other languages and language pairs (thus zero-shot, transfer-learning and pivoting-based systems will be treated as part of the general news translation track).

Document-level MT

In 2019, we are particularly interested in approaches which consider the whole document. We invite submissions of such approaches for English to German and Czech, and for Chinese to English. We will perform document-level human evaluation for these pairs.

For English to German, we will be releasing as much of the training data as possible with document boundaries intact.

For English to Czech, CzEng 1.7 (unchanged from last year) does already offer cross-sentential context for most of its "domains". No complete documents are available but all sentences in a "block" (i.e. those with the same "-bNUM-" number in the ID, e.g. subtitlesM-b15-00train-f000001-s*) formed a consecutive sequence in the original text. Sometimes the block is very short (just 1 sentence), and it is always limited to 13 or 15 sentences. No context information is available for the domains "techdoc", "navajo" and "tweets". The best context-aware domains are "news", "eu", "subtitles*" (well, subtitles) and "fiction".

Additional Test Suites Linked to News Translation Task

At no additional burden on the News Translation Task participants (aside from having to translate much larger input data), we will again collectively provide a deeper analysis of various qualities of the translations. See the corresponding section of Findings 2018 for an inspiration.

See WMT19 Test Suites Google Document for more details.

Authors of additional test suites will be invited to report on their evaluation method and its results in a separate paper

DATA

LICENSING OF DATA

The data released for the WMT19 news translation task can be freely used for research purposes, we just ask that you cite the WMT19 shared task overview paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.

TRAINING DATA

We aim to use publicly available sources of data wherever possible. Our main sources of training data are the Europarl corpus, the UN corpus, the news-commentary corpus and the ParaCrawl corpus. We also release a monolingual News Crawl corpus. Other language-specific corpora will be made available.

We have added suitable additional training data to some of the language pairs.

You may also use the following monolingual corpora released by the LDC:

Note that the released data is not tokenized and includes sentences of any length (including empty sentences). All data is in Unicode (UTF-8) format. The following Moses tools allow the processing of the training data into tokenized format:

These tools are available in the Moses git repository.

DEVELOPMENT DATA

To evaluate your system during development, we suggest using the 2018 test set. The data is provided in raw text format and in an SGML format that suits the NIST scoring tool. We also release other dev and test sets from previous years. For the new language pairs, we release dev sets in January, prepared in the same way as the test sets.

Year CS-EN DE-EN FI-EN GU-EN KK-EN LT-EN RU-EN ZH-EN FR-DE
2008            
2009            
2010            
2011            
2012          
2013          
2014          
2015          
2016          
2017        
2018        

The 2019 test sets will be created from a sample of online newspapers from September-November 2018. For the established languages (i.e. English to/from Chinese, Czech, German, Finnish and Russian) the English-X and X-English test sets will be distinct, and only consist of documents created originally in the source language. For the new languages (i.e English to/from Gujarati, Kazakh and Lithuanian) the test sets include 50% English-X translation, and 50% X-English translation. In previous recent tasks, all the test data was created using the latter method.

We have released development data for the tasks that are new this year. It is created in the same way as the test set and included in the development tarball.

The news-test2011 set has three additional Czech translations that you may want to use. You can download them from Charles University.

DOWNLOAD

PREPROCESSED DATA

We will provide preprocessed versions of all training and development data (by mid-February). These are preprocesed with standard Moses tools and ready for use in MT training. This preprocessed data is distributed with the intention that it will be useful as a standard data set for future research. The preprocessed data can be obtained here

TEST SET SUBMISSION

To submit your results, please first convert into into SGML format as required by the NIST BLEU scorer, and then upload it to the website matrix.statmt.org.

For Chinese output, you should submit unsegmented text, since our primary measure is human evaluation. For automatic scoring (in the matrix) we use BLEU4 computed on characters, scoring with v1.3 of the NIST scorer only. A script to convert a Chinese SGM file to characters can be found here.

SGML Format

Each submitted file has to be in a format that is used by standard scoring scripts such as NIST BLEU or TER.

This format is similar to the one used in the source test set files that were released, except for:

The script wrap-xml.perl makes the conversion of a output file in one-segment-per-line format into the required SGML file very easy:

Format: wrap-xml.perl LANGUAGE SRC_SGML_FILE SYSTEM_NAME < IN > OUT
Example: wrap-xml.perl en newstest2019-src.de.sgm Google < decoder-output > decoder-output.sgm

Upload to Website

Upload happens in three easy steps:

  1. Go to the website matrix.statmt.org.
  2. Create an account under the menu item Account -> Create Account.
  3. Go to Account -> upload/edit content, and follow the link "Submit a system run"

You can use the matrix to list all your systems, and edit the metadata. This is important since after the test week ends, you need to decide which are your primary systems (that get included in the human evaluation, and the overview paper) and to ensure that you are happy with the system naming.

To access your system list, log in and select Account -> my current systems. You should see a list of all your systems, along with their metadata, and an edit button. Some instructions are included on this screen.

EVALUATION

Evaluation will be done both automatically as well as by human judgement.

ACKNOWLEDGEMENTS

This task would not have been possible without the sponsorship of test sets from Microsoft, Yandex, Tilde, LinguaCustodia, the University of Helsinki, Charles University Prague, Le Mans University and funding from the European Union's Horizon 2020 research and innovation programme under grant agreements 825299 (GOURMET) and EU CHIST-ERA M2CR project.