Similar Language Translation Task - EMNLP 2020 Fifth Conference on Machine Translation

Shared Task: Similar Language Translation

UPDATES

August 8, 2020 - All results available.

July 24, 2020 - Hindi - Marathi, Marathi - Hindi results available.

July 9, 2020 - Submission instructions available. Please see below.

June 3, 2020 - The task dates have been updated. Please check the new datest below.

TASK DESCRIPTION

Within the MT and NLP communities, English is by far the most resource-rich language. MT systems are most often trained to translate texts from and to English or they use English as a pivot language to translate between resource-poorer languages. The interest in English is reflected, for example, in the WMT translation tasks (e.g. News, Biomedical) which have always included language pairs in which texts are translated to and/or from English. With the widespread use of MT technology, there is more and more interest in training systems to translate between languages other than English. One evidence of this is the need of directly translating between pairs of similar languages. The main challenge here is how to take advantage of the similarity between languages to overcome the limitation given the low amount of available parallel data to produce an accurate output.

Given the interest of the community in this topic we organize, for the second time at WMT, the shared task on "Similar Language Translation" to evaluate the performance of state-of-the-art translation systems on translating between pairs of languages from the same language family. This year we provide participants with training and testing data in five language pairs from three language families listed below. Evaluation will be carried out using automatic evaluation metrics and human evaluation.

In the previous edition of this task in WMT 2019, we included three language pairs: Spanish - Portuguese Czech - Polish, and Hindi - Nepali. Check the 2019 task website and the WMT 2019 report for more information.

Language Pairs

This year we have five pairs of similar languages from three different language families: Indo-Aryan, Romance, and South-Slavic. Translations will be evaluated in both directions (e.g. from Spanish to Catalan and from Catalan to Spanish).

Indo-Aryan Languages
- Hindi - Marathi
Romance Languages
- Spanish - Catalan
- Spanish - Portuguese
South-Slavic Languages
- Slovene - Croatian
- Slovene - Serbian

Utilizing parallel data

No additional parallel data is allowed for training. Constrained submissions only.

Utilizing monolingual data

You are encouraged to develop novel solutions to utilize monolingual corpora to improve translation quality.

DATA

The training data is available here. Last update July 7 2020.

To participate and receive the password, please fill out the registration form.

SUBMISSIONS

The test data is available at the same repository as the training data (here) and it can be accessed using the same password sent via e-mail. You are allowed to submit 1 PRIMARY and up to 2 CONTRASTIVE systems for each language pair/translation direction.

You should submit your results by July 15 2020 (anywhere in the world) in a zip file to wmt.similarlanguagetranslation(at)gmail.com. Your zip file should contain your submission files and a brief description of your approach(es) as follows:

1) A txt file for each of your submissions with one instance per line IN THE SAME ORDER as the test set. Your should name your file(s) as follows:
TEAMNAME_SOURCELANGUAGECODE_TARGETLANGUAGECODE_PRIMARYORCONTRASTIVE.txt

If, for example, Team X participated in Spanish - Catalan and submits a primary AND a contrastive submission, Team X will be sending us the following files:
TEAMX_ES_CA_PRIMARY.txt
TEAMX_ES_CA_CONTRASTIVE.txt

2) A single txt file containing one or two paragraph(s) describing your system(s). Please make this as complete as possible, as we will be using this information in the shared task report, but also concise so that it focuses on the most important information about your approach (max. 250 words).

EVALUATION

The evaluation will be carried out automatically using BLEU (Papieni et al., 2002) and TER (Snover et al., 2006), and RIBES (Isozaki et al., 2010).

Here you can find the results of the Hindi - Marathi and Marathi - Hindi tracks ranked by BLEU score.

Here you can find the results of the Catalan - Spanish, Spanish - Catalan, Spanish - Portuguese, and Portuguese - Spanish tracks ranked by BLEU score.

Here you can find the results of the Croatian - Slovene, Slovene - Croatian, Slovene - Serbian, and Serbian - Slovene tracks ranked by BLEU score.

Paper Submission

Your system paper submission should be prepared according to the WMT instructions and upload to START before August 24, 2020.

IMPORTANT DATES

Release of training/dev data	April 15, 2020
Test data released	~~June 8, 2020~~ July 8,2020
Submission deadline	~~June 15, 2020~~ July 15,2020
System description paper deadline	~~July 15, 2020~~ August 15, 2020 is the deadline for metadata on START (title, authors, abstract, etc.). Your PDF can be sent by August 18 at 5pm EST to wmt.similarlanguagetranslation@gmail.com.
Notifications	~~August 17, 2020~~ September 29, 2020
Camera-ready	~~August 31, 2020~~ October 10, 2020
Conference	~~November 11-12, 2020~~ November 19-20, 2020

ORGANIZERS

Marta Costa-jussà, Universitat Politècnica de Catalunya
Magdalena Biesialska, Universitat Politècnica de Catalunya
Santanu Pal, Wipro AI Lab
Nikola Ljubešić, Jožef Stefan Institute and University of Zagreb
Marcos Zampieri, Rochester Institute of Technology

CONTACT

martaruizcostajussa(at)gmail.com

ACKNOWLEDGEMENT

The organizers would like to thank Ciklopea and Bisnode for the Croatian, Serbian, and Slovene data. We further thank Pangeanic for the Catalan, Portuguese, and Spanish data.