Similar Language Translation Task - EMNLP 2021 Sixth Conference on Machine Translation

Shared Task: Similar Language Translation

UPDATES

July 28, 2021 - Results released.

June 7, 2021 - French - Manding languages data available.

May 17, 2021 - Dravidian and Romance languages data available.

May 6, 2021 - Website released!

TASK DESCRIPTION

Within the MT and NLP communities, English is by far the most resource-rich language. MT systems are most often trained to translate texts from and to English or they use English as a pivot language to translate between resource-poorer languages. The interest in English is reflected, for example, in the WMT translation tasks (e.g. News, Biomedical) which have always included language pairs in which texts are translated to and/or from English. With the widespread use of MT technology, there is more and more interest in training systems to translate between languages other than English. One evidence of this is the need of directly translating between pairs of similar languages. The main challenge here is how to take advantage of the similarity between languages to overcome the limitation given the low amount of available parallel data to produce an accurate output.

Given the interest of the community in this topic we organize, for the third time at WMT, the shared task on "Similar Language Translation" to evaluate the performance of state-of-the-art translation systems on translating between pairs of languages from the same language family. This year we provide participants with training and testing data in five language pairs from three language families listed below. Evaluation will be carried out using automatic evaluation metrics and human evaluation.

In the previous two editions of this task in WMT 2019 and 2020, we included language pairs such as Spanish - Portuguese, Spanish - Catalan, Czech - Polish, Hindi - Nepali, and Hindi - Marathi. Check the 2019 website, the 2020 website or the WMT 2019 report, WMT 2020 report.

Language Pairs

This year we have multiple pairs of similar languages from three language families.

Dravidian languages: Tamil - Telugu
Romance languages: Catalan, Spanish, Portuguese, and Romanian.
French to two similar low-resource Manding languages: Bambara and Maninka.

Utilizing parallel data

No additional parallel data is allowed for training. Constrained submissions only.

Utilizing monolingual data

You are encouraged to develop novel solutions to utilize monolingual corpora to improve translation quality.

DATA

The training data for Dravidian and Romance languages is available here. This link is password protected.

To participate please register using this form. The dataset password is displayed when you complete the form.

SUBMISSIONS

The test data is available at the same repository as the training data here and it can be accessed using the same password sent via e-mail. You are allowed to submit 1 PRIMARY and up to 2 CONTRASTIVE systems for each language pair/translation direction.

You should submit your results by July 19, 2021 (anywhere in the world) in a zip file to wmt.similarlanguagetranslation(at)gmail.com. Your zip file should contain your submission files and a brief description of your approach(es) as follows:

1) A txt file for each of your submissions with one instance per line IN THE SAME ORDER as the test set. Your should name your file(s) as follows:
TEAMNAME_SOURCELANGUAGECODE_TARGETLANGUAGECODE_PRIMARYORCONTRASTIVE.txt

If, for example, Team X participated in Spanish - Catalan and submits a primary AND a contrastive submission, Team X will be sending us the following files:
TEAMX_ES_CA_PRIMARY.txt
TEAMX_ES_CA_CONTRASTIVE.txt

2) A single txt file containing one or two paragraph(s) describing your system(s). Please make this as complete as possible, as we will be using this information in the shared task report, but also concise so that it focuses on the most important information about your approach (max. 250 words).

EVALUATION

The evaluation was carried out automatically using BLEU (Papieni et al., 2002) and TER (Snover et al., 2006), and RIBES (Isozaki et al., 2010).

Here you can find the results of the Bambara - French and French - Bambara tracks ranked by BLEU score.

Here you can find the results of the Tamil - Telugu and Telugu - Tamil tracks ranked by BLEU score.

Here you can find the results of the Catalan - Spanish, Spanish - Catalan, Spanish - Portuguese, and Portuguese - Spanish tracks ranked by BLEU score.

Paper Submission

Your system paper submission should be prepared according to the WMT instructions and uploaded to START before August 5, 2021.

IMPORTANT DATES

Release of training/dev data	May 17, 2021
Test data release	July 12, 2021
Submission deadline	July 19, 2021
System description paper deadline	August 5, 2021
Camera-ready	September 15, 2021
Conference	November 10-11, 2021

ORGANIZERS

Farhad Akhbardeh, Rochester Institute of Technology
Marta Costa-jussà, Universitat Politècnica de Catalunya
Magdalena Biesialska, Universitat Politècnica de Catalunya
Christopher Homan, Rochester Institute of Technology
Santanu Pal, Wipro AI Lab
Allahsera Tapo, Rochester Institute of Technology
Valentin Vydrin, Institut National des Langues et Civilisations Orientales (INALCO)
Marcos Zampieri, Rochester Institute of Technology

CONTACT

martaruizcostajussa(at)gmail.com

ACKNOWLEDGEMENT

We would like to thank Pangeanic for the Spanish, Catalan, Portuguese, and Romanian data and the Directorate-General for Language Policy at the Ministry of Culture, Government of Catalonia for the Catalan translations.

We further thank the AI Journal - Funding Opportunities for Promoting AI Research for supporting the French - Maninka data collection.