Shared Task: Parallel Corpus Filtering and Alignment for Low-Resource Conditions

This is the third instance of a shared task on assessing the quality of sentence pairs in a parallel corpus.

This year, we pose two different language pairs, Khmer-English and Pashto-English. In addition to the task of computing quality scores for the purpose of filtering, we also allow for the re-alignment of sentence pairs from document pairs.

DETAILS

We provide a very noisy 58.3 million-word (English token count) Khmer-English corpus and a 11.6 million-word Pashto-English corpus. These corpora were partly crawled from the web as part of the Paracrawl project, and partly extracted from the CommonCrawl data set. We ask participants to provide scores for each sentence in each of the noisy parallel sets. The scores will be used to subsample sentence pairs that amount to 5 million English words. The quality of the resulting subsets is determined by the quality of a neural machine translation system (fairseq) trained on this data. The quality of the machine translation system is measured by BLEU score (sacrebleu) on a held-out test set of Wikipedia translations for Khmer-English and Pashto-English.

We also provide clean parallel and monolingual training data for the two language pairs. This existing data comes from a variety of sources and is of mixed quality and relevance.

Note that the task addresses the challenge of data quality and not domain-relatedness of the data for a particular use case. While we provide a development and development test set that are also drawn from Wikipedia articles, these may be very different from the final official test set in terms of topics.

The provided raw parallel corpora are the outcome of a processing pipeline that aimed from high recall at the cost of precision, so they are very noisy. They exhibit noise of all kinds (wrong language in source and target, sentence pairs that are not translations of each other, bad language, incomplete of bad translations, etc.).

This year, we also provide the document pairs from which the sentence pairs were extracted (using Hunalign and LASER). You may align sentences yourself from these document pairs, thus producing your own set of sentence pairs. If you opt to do this, you have to submit all aligned sentence pairs and their quality scores.

IMPORTANT DATES

Release of raw parallel dataMarch 28, 2020
Submission deadline for subsampled sets   August 1, 2020
System descriptions dueAugust 15, 2020
Announcement of resultsAugust 29, 2020
Paper notificationSeptember 29, 2020
Camera-ready for system descriptionsOctober 10, 2020

REGISTRATION

It is not necessary to register to the shared task before the submission deadline, but it is highly recommended to subscribe to the general WMT 2020 mailing list to get information about revisions and clarifications for the shared task.

RAW CORPUS DOWNLOAD

The raw parallel sentence-aligned corpora consist of 58.3 million words (English token count, Khmer-English) and 11.6 million words (Pashto-English). There are 391,250 document pairs for Khmer-English and 45,312 document pairs for Pashto-English. We also provide LASER similarity scores for the sentence-aligned corpus, a sentence-embedding method that worked well last year.

Sentence-aligned corpora

Language PairSentence PairsEnglish tokensSentence PairsBaseline LASER scores
Khmer-English4,169,57458,347,212 wmt20-sent.en-km.xz (201MB) wmt20-sent.en-km.laser-score.xz (16MB)
Pashto-English1,022,88311,551,009 wmt20-sent.en-ps.xz (45MB) wmt20-sent.en-ps.laser-score.xz (3MB)

The format of the parallel corpora is one sentence pair per line, with the English sentence and the Khmer/Pashto sentence separated by a TAB character.

Document Pairs

Language PairDocument Pairs  
Khmer-English391,250 wmt20-docs.en-km.xz (578MB) wmt20-sent-missing-in-docs.en-km.xz (2.8MB)
Pashto-English45,312 wmt20-docs.en-ps.xz (88MB) wmt20-sent-missing-in-docs.en-ps.xz (3.0MB)

Unfortunately, some of the document pairs are missing for sentence pairs included in the sentence aligned set. So, if you are running your own document alignment include the sentence pairs from the files wmt20-sent-missing-in-docs.en-km.xz and wmt20-sent-missing-in-docs.en-ps.xz to the sentence pairs that you extract yourself from the document pairs.

The format of the document pairs is one document pair per line, with four fields separated by a TAB character. The four fields are

The text is encoded in base64 decoding. It can be decoded into Unicode text with the Unix command base64 -d < IN > OUT. The resulting Unicode text contains line breaks, but participants may apply additional sentence-splitting.

CLEAN PARALLEL AND MONOLINGUAL TRAINING DATA

Parallel Data

We allow the use of the parallel data in the table below.

Language PairNameSentence PairsEnglish tokensComment
Khmer-English
km-parallel.tgz (18MB)
GNOME 56 233from OPUS, open source software localization
GlobalVoices 793 14,294from OPUS, citizen journalism
KDE4 120,087 767,919from OPUS, open source software localization
Tatoeba 748 3,491from OPUS, crowd-sourced phrases
Ubuntu 6,987 27,413from OPUS, open source software localization
Bible 54,2221,176,418alignment of 2 English with 4 Khmer Bibles
JW300 107,1561,827,348 originally from OPUS, but re-done sentence alignment with Vecalign, religious texts
Pashto-English
ps-parallel.tgz (2.1MB)
GNOME 95,312277,188from OPUS, open source software localization
KDE4 3,377 8,881from OPUS, open source software localization
Tatoeba 31 239from OPUS, crowd-sourced phrases
Ubuntu 9,645 26,626from OPUS, open source software localization
Bible 13,432298,522alignment of an English with a Pashto Bible
TED Talks 664 11,157created for this task, crawled from TED web site, sentence alignment with Vecalign
Wikimedia 737 37,566from OPUS, Wikipedia translations from Wikimedia foundation

The corpora are broken up by type, and come in Moses format (two files, aligned sentences at the same line number).

Monolingual Data

You may also use the following monolingual data from CommonCrawl.

LanguageCorpusSentences 
English CommonCrawl1,806,450,728 cc60_with_url_v2.en_XX_filtered.xz (72 GB)
Wikipedia67,796,935 wikipedia.en.lid_filtered.test_filtered.xz (3.2 GB)
Pashto CommonCrawl6,558,180 cc60_with_url_v2.ps_AF_filtered.xz (277 MB)
Wikipedia76,557 wikipedia.ps.lid_filtered.test_filtered.xz (5.9 MB)
Khmer CommonCrawl13,832,947 cc60_with_url_v2.km_XX_filtered.xz (614 MB)
Wikipedia132,666 wikipedia.km.lid_filtered.test_filtered.xz (12 MB)

Acknowledgements

Bibles were provided by Arya McCarthy and David Yarowsky.

DEVELOPMENT ENVIRONMENT

Evaluation of the quality scores will be done by subsampling 5m word corpora based on these scores, training a neural machine translation systems with these subsets, and evaluating translation quality on blind test sets using the BLEU score (sacrebleu).

For development purposes, we release configuration files and scripts that mirror the official testing procedure with a development test set.

Download development tools

The development tools consists of

In the following code examples, we assumed that you downloaded and extracted the development tools, and then set the environment variable DEV_TOOLS to that directory, e.g.,
wget http://data.statmt.org/wmt20/filtering-task/dev-tools.tgz
tar xzf dev-tools.tgz
export DEV_TOOLS=`pwd`/dev-tools

Subsampling the corpus

Given your file with sentence-level quality scores, the script subselect.perl allows you to subsample sets with 5 million English tokens.

The syntax to use the script is:

subselect.perl FILE_SCORE FILE_F FILE_E OUT

This will typically look something like this for Pashto-English:

subselect.perl my-score-file.txt wmt20-sent.en-ps.ps wmt20-sent.en-ps.en subsample
resulting in files with roughly the following properties
% wc subsample.5000000*
  225725  4979904 31063226 subsample.5000000.en
  225725   550988 44420879 subsample.5000000.ps
 
To try this on the provided LASER scores (this should result in the file sizes above), execute the following commands.
wget http://data.statmt.org/wmt20/filtering-task/wmt20-sent.en-ps.laser-score.xz
xz -d wmt20-sent.en-ps.laser-score.xz
wget http://data.statmt.org/wmt20/filtering-task/ps-km/wmt20-sent.en-ps.xz
xzcat wmt20-sent.en-ps.xz | cut -f 1 > wmt20-sent.en-ps.en
xzcat wmt20-sent.en-ps.xz | cut -f 2 > wmt20-sent.en-ps.ps
$DEV_TOOLS/subselect.perl wmt20-sent.en-ps.laser-score wmt20-sent.en-ps.ps wmt20-sent.en-ps.en subsample

Building a fairseq system

To install fairseq you can follow the instructions in the fairseq github repository. Once this is done, set the environment variable FAIRSEQ to the directory into which you cloned the github repository, e.g.,
git clone https://github.com/pytorch/fairseq.git
cd fairseq
export FAIRSEQ=`pwd`
To train and test a system on subsampled data, first preprocess the data with
$DEV_TOOLS/train-from-scratch/prepare.sh LANGUAGE SYSTEM_DIR SUBSET_STEM
where Then, train the system by executing the following command in DIR directory specificed above.
cd $SYSTEM_DIR
bash train.sh
After training, you can test performance on the development test set with
cd $SYSTEM_DIR
bash translate.sh
Here is the sequence of commands for the example corpus:
$DEV_TOOLS/train-from-scratch/prepare.sh ps example-system subsample.5000000
cd example-system 
bash train.sh
bash translate.sh

Fine-tuning an MBART model

As an alternative evaluation method for the filtered parallel corpus, we provide a pre-trained model that needs to be fine-tuned with the filtered parallel corpus. The pre-training was done on the monolingual data using a method called MBART. For more details on this pre-training, please consult the arxiv paper.

The evaluation via fine-tuning is faster and yields higher BLEU scores. To carry this out with the provided development tools (which include the Khmer and Pashto pre-trained MBART models), simply uses the corresponding scripts in the directory train-mbart instead of train-from-scratch, e.g.,

$DEV_TOOLS/train-mbart/prepare-mbart.sh ps example-mbart-system subsample.5000000
cd example-mbart-system 
bash train-mbart.sh
bash translate-mbart.sh

Baseline results

With the provided LASER-based scores, you should obtain the following BLEU scores on the development test set.

LanguageTraining from scratchMBART fine tuning
Khmer7.110.4
Pashto9.612.2

We noticed that with different GPU hardware, different scores (±1 BLEU point) are obtained from these sets. There is also some variance with different seeds. While you may observe different numbers, all final scoring will be done on identical hardware for all participants to ensure fair assessments.

SUBMISSIONS

To participate in the shared task, you have two choices

Upload the file to the Google Drive folder. Please indicate in the file name clearly your affiliation and send an email to phi@jhu.edu to announce your submission.

TEST SETS AND RESULTS

Preliminary results will be announced on July 29, 2020. The official results will be published in an overview paper at the WMT 2020 Conference for Machine Translation.

FREQUENTLY ASKED QUESTIONS

What data resources and tools can be used?

Any standard linguistic tools (POS taggers, parsers, etc.) may be used. This includes tools with pre-trained models (BERT, LASER, etc.). But no additional parallel and monolingual data is allowed - only the data referred to above.

Should sentences be scored in isolation?

It is not required to score each sentence independent from others. You may consider scoring that take data redundancy into account, i.e., scores the second occurrence of a very similar sentence pair lower.

ORGANIZERS

Philipp Koehn, Johns Hopkins University
Francisco (Paco) Guzmán, Facebook
Vishrav Chaudhary, Facebook
Ahmed Kishky, Facebook
Naman Goyal, Facebook
Peng-Jen Chen, Facebook

ACKNOWLEDGEMENTS

This shared task is partially supported by Facebook and Paracrawl.