Shared Task: Multilingual Low-Resource Translation for Indo-European Languages



HUMAN EVALUATION

Metric. We perform sentence level evaluation with document context. Each sentence is evaluated in a Likert-like scale [1,5] answering the question of direct assessments (DA). Source DA (Romance family) also allows the evaluation of selected terms. We select 60 terms (mostly named entities, dates and locations) and annotate them as well translated, not translated and mistranslated by majority voting among the annotators.

ROMANCE FAMILY (Wikipedia)

ca2it ca2oc
z-score raw z-score raw
HUMAN 0.8±0.4 4.8±0.6 0.8±0.7 4.0±1.0
CUNI-Primary 0.5±0.7 4.4±0.9 0.5±0.8 3.6±1.1
M2M-100 (baseline) 0.4±0.7 4.2±1.0 -0.7±0.8 2.0±1.0
TenTrans-Primary 0.0±0.8 3.8±1.1 0.3±0.8 3.4±1.2
BSC-Primary -0.1±0.8 3.7±1.1 0.3±0.9 3.4±1.2
UBCNLP-Primary -0.5±1.0 3.1±1.3 0.0±0.9 3.0±1.2
mT5-devFinetuned (baseline) -1.2±0.9 2.3±1.2 -1.0±0.7 1.7±0.9


Term translation:
ca2it ca2oc
well mis no Σ well mis no Σ
HUMAN (reference) 53 0 3 56 40 0 2 42
CUNI-Primary 39 3 5 47 30 7 1 38
M2M-100 (baseline) 33 2 6 41 26 9 0 35
TenTrans-Primary 37 0 9 46 32 4 1 37
BSC-Primary 27 7 5 39 33 4 0 37
UBCNLP-Primary 29 16 1 46 19 1 0 20
mT5-devFinetuned (baseline) 20 17 10 47 25 11 4 40

NORTH-GERMANIC FAMILY (Europeana)

nb2sv is2sv
z-score raw z-score raw
M2M-100 (baseline) 0.7±0.6 4.2±0.8 0.1±1.0 2.0±1.1
EdinSaar-Primary 0.2±0.7 3.6±1.1 -0.1±0.8 1.9±1.0
UBCNLP-Primary 0.2±0.8 3.5±1.2 -0.4±1.0 1.6±1.1
mT5-devFinetuned (baseline) -1.2±0.7 1.5±1.1 0.4±1.1 2.4±1.2