June 7, 2021 - French - Manding languages data available.
May 17, 2021 - Dravidian and Romance languages data available.
May 6, 2021 - Website released!
Within the MT and NLP communities, English is by far the most resource-rich language. MT systems are most often trained to translate texts from and to English or they use English as a pivot language to translate between resource-poorer languages. The interest in English is reflected, for example, in the WMT translation tasks (e.g. News, Biomedical) which have always included language pairs in which texts are translated to and/or from English. With the widespread use of MT technology, there is more and more interest in training systems to translate between languages other than English. One evidence of this is the need of directly translating between pairs of similar languages. The main challenge here is how to take advantage of the similarity between languages to overcome the limitation given the low amount of available parallel data to produce an accurate output.
Given the interest of the community in this topic we organize, for the third time at WMT, the shared task on "Similar Language Translation" to evaluate the performance of state-of-the-art translation systems on translating between pairs of languages from the same language family. This year we provide participants with training and testing data in five language pairs from three language families listed below. Evaluation will be carried out using automatic evaluation metrics and human evaluation.
In the previous two editions of this task in WMT 2019 and 2020, we included language pairs such as Spanish - Portuguese, Spanish - Catalan, Czech - Polish, Hindi - Nepali, and Hindi - Marathi. Check the 2019 website, the 2020 website or the WMT 2019 report, WMT 2020 report.
The training data for Dravidian and Romance languages is available here. This link is password protected.
To participate please register using this form. The dataset password is displayed when you complete the form.
The test data is available at the same repository as the training data here and it can be accessed using the same password sent via e-mail. You are allowed to submit 1 PRIMARY and up to 2 CONTRASTIVE systems for each language pair/translation direction.
You should submit your results by July 19, 2021 (anywhere in the world) in a zip file to wmt.similarlanguagetranslation(at)gmail.com. Your zip file should contain your submission files and a brief description of your approach(es) as follows:
1) A txt file for each of your submissions with one instance per line IN THE SAME ORDER as the test set. Your should name your file(s) as follows:
If, for example, Team X participated in Spanish - Catalan and submits a primary AND a contrastive submission, Team X will be sending us the following files:
2) A single txt file containing one or two paragraph(s) describing your system(s). Please make this as complete as possible, as we will be using this information in the shared task report, but also concise so that it focuses on the most important information about your approach (max. 250 words).
The evaluation will be carried out automatically using BLEU (Papieni et al., 2002) and TER (Snover et al., 2006), and RIBES (Isozaki et al., 2010).
Your system paper submission should be prepared according to the WMT instructions and uploaded to START before August 5, 2021.
|Release of training/dev data||May 17, 2021|
|Test data release||July 12, 2021|
|Submission deadline||July 19, 2021|
|System description paper deadline||August 5, 2021|
|Camera-ready||September 15, 2021|
|Conference||November 10-11, 2021|
We further thank the AI Journal - Funding Opportunities for Promoting AI Research for supporting the French - Maninka data collection.