Shared Task: Translation Suggestion



Overview

This shared task focuses on the automatic methods for translation suggestion (TS), which provides alternatives for the incorrect span of the MT sentence automatically. Translation suggestion is an important tool for computer-aided translation and has proven its ability in improving the efficiency of post-editing (PE). There are two main pitfalls for conventional works in this area:

This shared task provides: (i) two sub-tasks for TS, these two tasks are both designed based on our applications; (ii) the human-labeled golden corpus for four different translation directions; (iii) the strong baselines and our code base.

Our specific goals are:

For all tasks, the datasets and NMT models that generate the translations are publicly available.

Participants are also allowed to use the publicly available pre-trained models and explore any corpus (monolingual or bilingual) provided by WMT22 general translation task, but the resources should be disclosed in their system descriptions.

Important dates

Release of training and dev data April 25th, 2022
Release of test data June 29th, 2022
Submission deadline July 8th, 2022
System descriptions deadline September 1st, 2022
Paper notification October 6th, 2022
Camera-ready deadline October 15th, 2022

Note: The system description papers should follow the paper submission policy in WMT, plese see the section of paper submission information in WMT homepage for more details. All deadlines are 11:59 PM UTC+8.


Task 1: Naive Translation Suggestion

This task offers the human-labeled golden data for 4 translation directions: Chinese-English (Zh-En), English-Chinese (En-Zh), English-German (En-De) and German-English (De-En). The datasets are collected by translating the sampled source sentences with the SOTA Transformer NMT model and then annotated by professional translators. The detailed descriptions about the data collection can be found at WeTs setup. Each sample includes the source sentence, MT sentence, the incorrect span of the MT sentence, and the top-1 suggestion.

Training and dev data: Download the training, development data .

Test data: The participants are expected to submit their results of the test set. You can download the test data here. The following is a quick guide to the statistics of the corpus

Train Dev Test
En-De 12000 2000 1000
De-En 10000 2000 1000
En-Zh 15000 2700 1000
Zh-En 15000 2700 1000

Baselines: The baseline system is a conventional Transformer model implemented by fairseq toolkit. For the baseline system, the input to the Transformer encoder is the concatenation of the source and MT sentences where the incorrect span of the MT sentence is replaced with a special placeholder token.

Evaluation: Each submission will be evaluated in terms of the document-level BLEU scores for the top-1 suggestion against the reference sentences. We use the official evaluation tool scarebleu . For Chinese, the BLEU score is calculated on the characters with the default tokenizer for Chinese; For English and German, the BLEU score is calculated on the case-sensitive words with the default tokenizer 13a.

Task 2: Translation Suggestion with Hints

Compared to task 1, the difference is that we also provide the model with some hints, which can be useful for the model to give more correct suggestions. For this task, each sample includes the source sentence, MT sentence, the incorrect span of the MT sentence, hints for top-1 suggestion, and the top-1 suggestion. The hints are generated automatically following WeTs setup. Note: The hints used here are somewhat different from that used in WeTs. We only take the first-k initial characters as the hints and the k is randomly sampled.

Training and dev data: Download the training, development data .

Test data: The participants are expected to submit their results of the test set. You can download the test data here.

Baselines: The baseline system is a conventional Transformer model implemented by fairseq toolkit. For the baseline system, the input to the Transformer encoder is the concatenation of the source sentence, MT sentence and the hint, where the incorrect span of the MT sentence is replaced with a special placeholder token.

Evaluation: Each submission will be evaluated in terms of the document-level BLEU scores for the top-1 suggestion against the reference sentences. We use the official evaluation tool scarebleu . We only provide corpus for the translation directions of English2Chinese and Chinese2English. For Chinese, the BLEU score is calculated on the characters with the default tokenizer for Chinese; For English, the BLEU score is calculated on the case-sensitive words with the default tokenizer 13a.

Attention: All training, dev and test sets are subject to the corpus provided by this website. If it helps, you can download the NMT models which were used to generate the MT sentences of our corpus.

Additional Resources

These are the parallel data which may be useful for the participants.

Submission Requirements

Each participating team can submit at most 15 systems for each of the translation directions of each subtask. The participants can submit their results and scan scores by the Website . Before submitting, the participants are required to sign up by dropping an email to the organizers, which should includes the following information: user name, passwd, team name, organization, and email.


Organizers

Contact

Feel free to contact us for any questions by dropping an email to Zhen Yang.