Translation Task - ACL 2019 fourth Conference on Machine Translation

Shared Task: Machine Translation of News

The recurring translation task of the WMT workshops focuses on news text and (mostly) European language pairs. For this year the language pairs are:

Chinese-English
Czech-English (this year English-to-Czech only)
Finnish-English
German-English
Gujarati-English
Kazakh-English
Lithuanian-English
Russian-English
German-Czech (only into Czech, only unsupervised MT / without parallel data)
NEW French-German (topic: EU elections)

We provide parallel corpora for all languages as training data, and additional resources for download.

GOALS

The goals of the shared translation task are:

To investigate the applicability of current MT techniques when translating into languages other than English
To examine special challenges in translating between European languages, including word order differences and morphology
To investigate the translation of low-resource, morphologically rich languages
To create publicly available corpora for machine translation and machine translation evaluation
To generate up-to-date performance numbers in order to provide a basis of comparison in future research
To offer newcomers a smooth start with hands-on experience in state-of-the-art statistical machine translation methods
To investigate the usefulness of multilingual and third language resources
To compare unsupervised MT in a controlled environment
To assess the effectiveness of document-level approaches

We hope that both beginners and established research groups will participate in this task.

IMPORTANT DATES

Release of training data for shared tasks (by)	31 January, 2019
Test suite source texts must reach us	March 24, 2019
Test data released	April 8, 2019
Translation submission deadline	April 16, 2019 (10am UK)
Translated test suites shipped back to test suites authors	April 26, 2019
Start of manual evaluation	April 29, 2019
End of manual evaluation	May 27, 2019

TASK DESCRIPTION

We provide training data for all language pairs, and a common framework. The task is to improve current methods. We encourage a broad participation -- if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. Participants will use their systems to translate a test set of unseen sentences in the source language. The translation quality is measured by a manual evaluation and various automatic evaluation metrics. Participants agree to contribute to the manual evaluation about eight hours of work, per system submission.

You may participate in any or all of the eight language pairs. For all language pairs we will test translation in both directions. To have a common framework that allows for comparable results, and also to lower the barrier to entry, we provide a common training set, and a pre-processed version (TBC). You are not limited to this training set, and you are not limited to the training set provided for your target language pair. This means that multilingual systems are allowed, and classed as constrained as long as they use only data released for WMT19 (or older WMT Hindi-English and Turkish-English corpora, as listed below).

If you use additional training data (not provided by the WMT19 organisers) or existing translation systems, you must flag that your system uses additional data. We will distinguish system submissions that used the provided training data (constrained) from submissions that used significant additional data resources. Note that basic linguistic tools such as taggers, parsers, or morphological analyzers are allowed in the constrained condition.

Your submission report should highlight in which ways your own methods and data differ from the standard task. You should make it clear which tools you used, and which training sets you used.

The following two aspects of the task are new for 2019: Unsupervised learning and Document-level MT.

Unsupervised learning

For 2019, we also have an unsupervised subtrack: German to Czech translations, using monolingual German and Czech training data only, as well as last years' parallel dev and test sets for bootstrapping. The training data should come from the the constrained monolingual sets of WMT news translation data.

No German-Czech parallel data is provided, and the participants cannot use any monolingual or parallel data for other languages and language pairs (thus zero-shot, transfer-learning and pivoting-based systems will be treated as part of the general news translation track).

Document-level MT

In 2019, we are particularly interested in approaches which consider the whole document. We invite submissions of such approaches for English to German and Czech, and for Chinese to English. We will perform document-level human evaluation for these pairs.

For English to German, we will be releasing as much of the training data as possible with document boundaries intact.

For English to Czech, CzEng 1.7 (unchanged from last year) does already offer cross-sentential context for most of its "domains". No complete documents are available but all sentences in a "block" (i.e. those with the same "-bNUM-" number in the ID, e.g. subtitlesM-b15-00train-f000001-s*) formed a consecutive sequence in the original text. Sometimes the block is very short (just 1 sentence), and it is always limited to 13 or 15 sentences. No context information is available for the domains "techdoc", "navajo" and "tweets". The best context-aware domains are "news", "eu", "subtitles*" (well, subtitles) and "fiction".

Additional Test Suites Linked to News Translation Task

At no additional burden on the News Translation Task participants (aside from having to translate much larger input data), we will again collectively provide a deeper analysis of various qualities of the translations. See the corresponding section of Findings 2018 for an inspiration.

See WMT19 Test Suites Google Document for more details.

System developers may want to learn in advance what their systems will be tested on.
Everyone is welcome to contribute additional test suites.

Authors of additional test suites will be invited to report on their evaluation method and its results in a separate paper

DATA

LICENSING OF DATA

The data released for the WMT19 news translation task can be freely used for research purposes, we just ask that you cite the WMT19 shared task overview paper, and respect any additional citation requirements on the individual data sets. For other uses of the data, you should consult with original owners of the data sets.

TRAINING DATA

We aim to use publicly available sources of data wherever possible. Our main sources of training data are the Europarl corpus, the UN corpus, the news-commentary corpus and the ParaCrawl corpus. We also release a monolingual News Crawl corpus. Other language-specific corpora will be made available.

We have added suitable additional training data to some of the language pairs.

You may also use the following monolingual corpora released by the LDC:

LDC2011T07 English Gigaword Fifth Edition
LDC2009T13 English Gigaword Fourth Edition
LDC2007T07 English Gigaword Third Edition
LDC2009T27 Chinese Gigaword Fourth Edition

Note that the released data is not tokenized and includes sentences of any length (including empty sentences). All data is in Unicode (UTF-8) format. The following Moses tools allow the processing of the training data into tokenized format:

Tokenizer tokenizer.perl
Detokenizer detokenizer.perl
Lowercaser lowercase.perl
SGML Wrapper wrap-xml.perl

These tools are available in the Moses git repository.

DEVELOPMENT DATA

To evaluate your system during development, we suggest using the 2018 test set. The data is provided in raw text format and in an SGML format that suits the NIST scoring tool. We also release other dev and test sets from previous years. For the new language pairs, we release dev sets in January, prepared in the same way as the test sets.

Year	CS-EN	DE-EN	FI-EN	RU-EN	ZH-EN	FR-DE
2008	✓	✓				✓
2009	✓	✓				✓
2010	✓	✓				✓
2011	✓	✓				✓
2012	✓	✓		✓		✓
2013	✓	✓		✓		✓
2014	✓	✓		✓		✓
2015	✓	✓	✓	✓
2016	✓	✓	✓	✓
2017	✓	✓	✓	✓	✓
2018	✓	✓	✓	✓	✓

The 2019 test sets will be created from a sample of online newspapers from September-November 2018. For the established languages (i.e. English to/from Chinese, Czech, German, Finnish and Russian) the English-X and X-English test sets will be distinct, and only consist of documents created originally in the source language. For the new languages (i.e English to/from Gujarati, Kazakh and Lithuanian) the test sets include 50% English-X translation, and 50% X-English translation. In previous recent tasks, all the test data was created using the latter method.

We have released development data for the tasks that are new this year. It is created in the same way as the test set and included in the development tarball.

The news-test2011 set has three additional Czech translations that you may want to use. You can download them from Charles University.

DOWNLOAD

Parallel data:

File	CS-EN	DE-EN	FI-EN	GU-EN	KK-EN	LT-EN	RU-EN	ZH-EN	FR-DE	Notes
Europarl v9	✓	✓	✓			✓			✓ *	New: Re-extracted to include document boundaries. *: europarl-v7 for fr-de
ParaCrawl v3	✓	✓	✓			✓	✓		✓	New version for 2019 (except en-ru). Please use the bicleaner filtered version.
Common Crawl corpus	✓	✓					✓		✓ *	Same as last year. *: new for fr-de
News Commentary v14	✓	✓			✓		✓	✓	✓	Updated, and now with document boundaries. NB For the kk-en task, we include part of this data in the dev set, and have created -wmt19 versions of the corpora, which have the dev set removed.
CzEng 1.7	✓									Register and download CzEng 1.7. (cross-sentential context available for some domains)
Yandex Corpus							✓			ru-en
Wiki Titles v1	✓	✓	✓	✓	✓	✓	✓	✓		New release for 2019
UN Parallel Corpus V1.0							✓	✓		Register and download
Rapid corpus of EU press releases			✓			✓				This is part of the Tilde Model Corpus
Document-split Rapid corpus		✓								New A recrawled version of the Rapid corpus, with document boundaries intact. Also prepared by Tilde.
CWMT Corpus								✓

Additional training data for Gujarati-English

The only gu-en corpus listed above is Wikititles. In addition, we propose the following data-sets, as well as specifically encouraging unconstrained submissions (i.e. bring your own data).

The Bible Corpus, extracted from data available here
A Localisation extracted from OPUS, and consisting mainly of open-source software localisation data.
The Emille Corpus, available from ELRA, free for academic use. The Emille corpus is not actually parallel, but does contain some parallel text.
Some small corpora which seem to be only available to Indian citizens
Crowd-sourced bilingual dictionaries collected by Ellie Pavlick and collaborators
The HindEnCorp created by CUNI, or the larger one created by IIT Bombay for the Workshop on Asian Language Translation shared task. If pivoting through Hindi is feasible, then these would be useful.
A parallel corpus extracted from wikipedia and contributed by Alexander Molchanov of PROMT.
A crawled corpus produced for this task. It is very noisy, but contains some parallel data. A cleaned version is also available, cleaned using language detection and simple length heuristics. We recommened that you either use the cleaned version, or apply your own cleaning to the raw version.
You can use the wikipedia en and gu dumps as a comparable corpus.

Additional training data for Kazakh-English

In addition to the wikititles and news-commentary above, we provide:

An English-Kazakh crawled corpus of about 100k sentences, prepared by Bagdat Myrzakhmetov of Nazarbayev University. The corpus is distributed as a tsv file with the original URLs included, as well as an alignment score.
A crawled Russian-Kazakh corpus of about 5M sentences, also prepared by Bagdat Myrzakhmetov.
A an additional English-Kazakh crawled corpus of about 500k sentences, also prepared by Bagdat Myrzakhmetov. NB: this was not part of the task training data
We created a -wmt19 version of the news-commentary corpus, which has the dev set removed.
You may also use any of the data previously released for the English-Turkish task.

Monolingual training data:

Corpus	CS	DE	EN	FI	GU	KK	LT	RU	ZH	FR	Notes
News crawl	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	Updated Large corpora of crawled news, collected since 2007. Versions up to 2017 are as before, except they are re-filtered and re-shuffled. For de and en, document-split versions are available.
News discussions			✓							✓	Updated Corpora crawled from comment sections of online newspapers. Available in English and French.
Europarl	✓	✓	✓	✓			✓			✓ *	Monolingual version of European parliament crawl. Superset of the parallel version. *: europarl-v7 for fr
News Commentary	✓	✓	✓			✓		✓	✓	✓	Updated Monolingual text from news-commentary crawl. Superset of parallel version. Use v14. NB For the kk-en task, we include part of this data in the dev set, and have created -wmt19 versions of the corpora, which have the dev set removed.
Common Crawl	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	Deduplicated with development and evaluation sentences removed. English was updated 31 January 2016 to remove bad UTF-8. Downloads can be verified with SHA512 checksums. More English is available for unconstrained participants.
Wiki dumps					✓	✓	✓				New Monolingual text wikipedia, extracted using WikiExtractor.

Development sets
Test sets
Test sets (including additional test suites)

PREPROCESSED DATA

We will provide preprocessed versions of all training and development data (by mid-February). These are preprocesed with standard Moses tools and ready for use in MT training. This preprocessed data is distributed with the intention that it will be useful as a standard data set for future research. The preprocessed data can be obtained here

TEST SET SUBMISSION

To submit your results, please first convert into into SGML format as required by the NIST BLEU scorer, and then upload it to the website matrix.statmt.org.

For Chinese output, you should submit unsegmented text, since our primary measure is human evaluation. For automatic scoring (in the matrix) we use BLEU4 computed on characters, scoring with v1.3 of the NIST scorer only. A script to convert a Chinese SGM file to characters can be found here.

SGML Format

Each submitted file has to be in a format that is used by standard scoring scripts such as NIST BLEU or TER.

This format is similar to the one used in the source test set files that were released, except for:

First line is <tstset trglang="en" setid="newstest2019" srclang="any">, with trglang set to either en, de, fr, es, cs or ru. Important: srclang is always any.
Each document tag also has to include the system name, e.g. sysid="uedin".
CLosing tag (last line) is </tstset>

The script wrap-xml.perl makes the conversion of a output file in one-segment-per-line format into the required SGML file very easy:

Format: wrap-xml.perl LANGUAGE SRC_SGML_FILE SYSTEM_NAME < IN > OUT
Example: wrap-xml.perl en newstest2019-src.de.sgm Google < decoder-output > decoder-output.sgm

Upload to Website

Upload happens in three easy steps:

Go to the website matrix.statmt.org.
Create an account under the menu item Account -> Create Account.
Go to Account -> upload/edit content, and follow the link "Submit a system run"
- select as test set "newstest2019" and the language pair you are submitting
- select "create new system"
- click "continue"
- on the next page, upload your file and add some description

You can use the matrix to list all your systems, and edit the metadata. This is important since after the test week ends, you need to decide which are your primary systems (that get included in the human evaluation, and the overview paper) and to ensure that you are happy with the system naming.

To access your system list, log in and select Account -> my current systems. You should see a list of all your systems, along with their metadata, and an edit button. Some instructions are included on this screen.

EVALUATION

Evaluation will be done both automatically as well as by human judgement.

Manual Scoring: We will collect subjective judgments about translation quality from human annotators. If you participate in the shared task, we ask you to perform a defined amount of evaluation per language pair submitted. The amount of manual evaluation will be approximately 8 hours.
As in previous years, we expect the translated submissions to be in recased, detokenized, XML format, just as in most other translation campaigns (NIST, TC-Star).

ACKNOWLEDGEMENTS

This task would not have been possible without the sponsorship of test sets from Microsoft, Yandex, Tilde, LinguaCustodia, the University of Helsinki, Charles University Prague, Le Mans University and funding from the European Union's Horizon 2020 research and innovation programme under grant agreements 825299 (GOURMET) and EU CHIST-ERA M2CR project.