Multilingual Low-Resource Translation for Indo-European Languages - EMNLP sixth Conference on Machine Translation

Shared Task: Multilingual Low-Resource Translation for Indo-European Languages

HUMAN EVALUATION

Metric. We perform sentence level evaluation with document context. Each sentence is evaluated in a Likert-like scale [1,5] answering the question of direct assessments (DA).

is2sv and nb2sv: reference DA
ca2it and ca2oc: source DA

Source DA (Romance family) also allows the evaluation of selected terms. We select 60 terms (mostly named entities, dates and locations) and annotate them as well translated, not translated and mistranslated by majority voting among the annotators.

ROMANCE FAMILY (Wikipedia)

	ca2it		ca2oc
	z-score	raw	z-score	raw
HUMAN	0.8±0.4	4.8±0.6	0.8±0.7	4.0±1.0
CUNI-Primary	0.5±0.7	4.4±0.9	0.5±0.8	3.6±1.1
M2M-100 (baseline)	0.4±0.7	4.2±1.0	-0.7±0.8	2.0±1.0
TenTrans-Primary	0.0±0.8	3.8±1.1	0.3±0.8	3.4±1.2
BSC-Primary	-0.1±0.8	3.7±1.1	0.3±0.9	3.4±1.2
UBCNLP-Primary	-0.5±1.0	3.1±1.3	0.0±0.9	3.0±1.2
mT5-devFinetuned (baseline)	-1.2±0.9	2.3±1.2	-1.0±0.7	1.7±0.9

Term translation:

	ca2it				ca2oc
	well	mis	no	Σ	well	mis	no	Σ
HUMAN (reference)	53	0	3	56	40	0	2	42
CUNI-Primary	39	3	5	47	30	7	1	38
M2M-100 (baseline)	33	2	6	41	26	9	0	35
TenTrans-Primary	37	0	9	46	32	4	1	37
BSC-Primary	27	7	5	39	33	4	0	37
UBCNLP-Primary	29	16	1	46	19	1	0	20
mT5-devFinetuned (baseline)	20	17	10	47	25	11	4	40

NORTH-GERMANIC FAMILY (Europeana)

	nb2sv		is2sv
	z-score	raw	z-score	raw
M2M-100 (baseline)	0.7±0.6	4.2±0.8	0.1±1.0	2.0±1.1
EdinSaar-Primary	0.2±0.7	3.6±1.1	-0.1±0.8	1.9±1.0
UBCNLP-Primary	0.2±0.8	3.5±1.2	-0.4±1.0	1.6±1.1
mT5-devFinetuned (baseline)	-1.2±0.7	1.5±1.1	0.4±1.1	2.4±1.2