Shared Task: Parallel Corpus Filtering for Low-Resource Conditions

Following WMT18 shared task on parallel corpus filtering, we now pose the problem under more challenging low-resource conditions. Instead of German-English, this year there are two language pairs, Nepali-English and Sinhala-English.

Otherwise, the shared task follows the same set-up. Given a noisy parallel corpus (crawled from the web), participants develop methods to filter it to a smaller size of high-quality sentence pairs.

DETAILS

Specifically, we provide a very noisy 40.6 million-word (English token count) Nepali-English corpus and a 59.6 million-word Sinhala-English corpus crawled from the web as part of the Paracrawl project. We ask participants to provide scores for each sentence in each of the noisy parallel sets. The scores will be used to subsample sentence pairs that amount to 5 million and 1 million English words. The quality of the resulting subsets is determined by the quality of a statistical machine translation (Moses, phrase-based) and neural machine translation system (FAIRseq) trained on this data. The quality of the machine translation system is measured by BLEU score (sacrebleu) on a held-out test set of Wikipedia translations for Sinhala-English and Nepali-English.

We also provide links to training data for the two language pairs. This existing data comes from a variety of sources and is of mixed quality and relevance. We provide a script to fetch and compose the training data.

Note that the task addresses the challenge of data quality and not domain-relatedness of the data for a particular use case. While we provide a development and development test set that are also drawn from Wikipedia articles, these may be very different from the final official test set in terms of topics.

The provided raw parallel corpora are the outcome of a processing pipeline that aimed from high recall at the cost of precision, so they are very noisy. They exhibit noise of all kinds (wrong language in source and target, sentence pairs that are not translations of each other, bad language, incomplete of bad translations, etc.).

IMPORTANT DATES

Release of raw parallel dataFebruary 8, 2019
Submission deadline for subsampled sets   May 10, 2019
System descriptions dueMay 17, 2019
Announcement of resultsJune 3, 2019
Paper notificationJune 7, 2019
Camera-ready for system descriptionsJune 17, 2019

REGISTRATION

It is not necessary to register to the shared task before the submission deadline, but it is highly recommended to subscribe to the general WMT 2019 mailing list to get information about revisions and clarifications for the shared task definition.

RAW CORPUS DOWNLOAD

The raw corpora consist 40.6 million words (English token count, Nepali-English) and 59.6 million words (Sinhala-English) — crawled using the Paracrawl pipeline.

Download raw corpus (327 MB)

UPDATE: Download improved version of Nepali corpus (165M).

The provided tar ball contains the Nepalese-English and Sinhala-English corpus in Moses format, i.e., one sentence pair lines, with corresponding lines in the English and foreign file.

PARALLEL AND MONOLINGUAL TRAINING DATA

We are providing links to the permissible third-party sources of data to be used in the competition in the table below. You can use the script to obtain the clean data. Use of this data may be subject to terms and conditions specified by the third-party source.

Nepali

CorpusSentence pairsEnglish wordsSource FilesComment
Bible (two translations)61,6451,507,905 English.xml English-WEB.xml Nepali.xml The extraction script can be found here
Global Voices2,89275,197 Global Voices (all) Contains many languages. Only use En-Ne
Penn Tree Bank4,19988,758 NepaliTaggedCorpus.zip Corpus needs realigning. Apply patch found here
GNOME / KDE / Ubuntu494,9942,018,631 GNOME KDE4 Ubuntu
Nepali Dictionary9,91625,058 dictionaries.tar.gz Link contains all languages

Sinhala

CorpusSentence pairsEnglish wordsSource FilesComment
Open Subtitles601,1643,594,769 OPUS-OpenSubtitles18
GNOME / KDE / Ubuntu45,617150,513 GNOME KDE4 Ubuntu

Monolingual Data

Here we provide the allowed Wikipedia Monolingual data that has been filtered not to contain any of the documents from the dev/devtest/testsets. You may also use monolingual data from CommonCrawl or monolingual English data from the WMT 2019 News Translation shared task.
CorpusSentencesWordsSource files
Filtered Sinhala Wikipedia 155,9464,695,602 wikipedia.si_filtered.gz
Filtered Nepali Wikipedia92,2962,804,439 wikipedia.ne_filtered.gz
Filtered English Wikipedia67,796,9351,985,175,324 wikipedia.en_filtered.gz
Filtered Sinhala Common Crawl5,178,491110,270,445 commoncrawl.deduped.si.xz
Filtered Nepali Common Crawl3,562,373102,988,609 commoncrawl.deduped.ne.xz
Filtered English Common Crawl380,409,8918,894,266,960 commoncrawl.deduped.en.xz

Multi-lingual Data

Some might find useful to use additional parallel data coming from related languages (e.g. Hindi). Here we point to additional resources that can be used for this task.
CorpusSentencesWordsSource Files
Parallel IITB Hindi-English Corpus1,492,82720,667,240 parallel.tgz
Monolingual IITB Hindi Corpus67,796,9351,985,175,324 monolingual.hi.tgz

Acknowledgments

SUBMISSIONS

To participate in the shared task, you have to submit a file with quality scores, one per line, corresponding to the sentence pairs. The scores do not have to be meaningful, except that higher scores indicate better quality.

Upload the file to the Google Drive folder. Please indicate in the file name clearly your affiliation and send an email to phi@jhu.edu to announce your submission.

DEVELOPMENT ENVIRONMENT

Evaluation of the quality scores will be done by subsampling 5m word corpora based on these scores, training statistical and neural machine translation systems with these corpora, and evaluation translation quality on blind test sets using the BLEU score (sacrebleu).

For development purposes, we release configuration files and scripts that mirror the official testing procedure with a development test set.

Download development pack

The development pack consists of

Subsampling the corpus

Given your file with sentence-level quality scores, the script subselect.perl allows you to subsample sets with 5 million and 1 million English tokens.

The syntax to use the script is:

subselect.perl FILE_SCORE FILE_F FILE_E OUT

This will typically look something like this for Sinhala-English:

subselect.perl my-score-file.txt clean-eval-wmt19-raw.si clean-eval-wmt19-raw.en out
resulting in files with roughly the following properties
% wc out.5000000*
   279503   5000052  25967107 out.5000000.en
   279503   3456614  41708480 out.5000000.si
 
For Nepali-English the stats are:
% wc out.5000000*
   248765   5000018  31748929 out.5000000.en
   248765   3327811  48824341 out.5000000.ne
 

Building a Moses system

Training of a Moses system is done with experiment.perl. For detailed documentation on how to build machine translation systems with this script, please refer to the relevant Moses web page.

You will have to change the following configurations at the top of the ems-config.ne (or ems-config.si) configuration file, but everything else may stay the same.

These settings are full path names:

With these changes, training a system is done via
$MOSES/scripts/ems/experiment.perl -config ems-config.ne -exec &> OUT &
and the resulting BLEU score is in the file evaluation/report.1.

Building a FAIRseq system

To build a FAIRseq baseline you can follow the instructions in the FLoRes MT Benchmark. There, you'll find an end-to-end script that will download, tokenize, build vocabularies and train the baseline system reported in this paper. To train and test a system on subsampled data, first preprocess the data with
$DEV_TOOLS/nmt/prepare.sh LANGUAGE DIR SUBSET_STEM FLORES
where Then, train the system by executing the following command from the DIR directory specificed above.
$DEV_TOOLS/nmt/train.sh LANGUAGE
After training, you can test performance on the development test set with
$DEV_TOOLS/nmt/translate.sh LANGUAGE

BASELINE RESULTS

We trained a Zipporah model on the provided clean data and obtained the following BLEU scores (not case sensitive).

Language1 million5 million
 SMTNMTSMTNMT
Sinhala4.164.654.773.74
Nepali3.405.234.221.85

TEST SETS AND RESULTS

Results will be made available on June 3, 2019. The official results will be published in an overview paper at the WMT 2019 Conference for Machine Translation.

FREQUENTLY ASKED QUESTIONS

What data resources and tools can be used?

Any standard linguistic tools (POS taggers, parsers, etc.) may be used. But no additional parallel and monolingual data is allowed - only the data referred to above .

Should sentences be scored in isolation?

It is not required to score each sentence independent from others. You may consider scoring that take data redundancy into account, i.e., scores the second occurrence of a very similar sentence pair lower.

ORGANIZERS

Philipp Koehn, Johns Hopkins University
Francisco (Paco) Guzmán, Facebook
Vishrav Chaudhary, Facebook
Juan Pino, Facebook

ACKNOWLEDGEMENTS

This shared task is partially supported by Facebook, Paracrawl, and IARPA MATERIAL.