Tuning

Overview

During decoding, Moses scores translation hypotheses using a linear model. In the traditional approach, the features of the model are the probabilities from the language models, phrase/rule tables, and reordering models, plus word, phrase and rule counts. Recent versions of Moses support the augmentation of these core features with sparse features, which may be much more numerous.

Tuning refers to the process of finding the optimal weights for this linear model, where optimal weights are those which maximise translation performance on a small set of parallel sentences (the tuning set). Translation performance is usually measured with Bleu, but the tuning algorithms all support (at least in principle) the use of other performance measures. Currently (July 2013) only the MERT implementation supports any metrics other than Bleu - it has support for TER, PER CDER and others as well as support for interpolations of metrics. The interest in sparse features has led to the development of new tuning algorithms, and Moses contains implementations of some of these.

For an extensive survey of tuning methods in MT, see Neubig and Watanabe, 2016

There are essentially two classes of tuning algorithms used in statistical MT: batch and online. Examples of each of these classes of algorithms are listed in the following sections.

Batch tuning algorithms

Here the whole tuning set is decoded, usually generating an n-best list or a lattice, then the model weights are updated based on this decoder output. The tuning set is then re-decoded with the new weights, the optimisation repeated, and this iterative process continues until some convergence criterion is satisfied. All the batch algorithms in Moses are controlled by the inaccurately named mert-moses.pl, which runs the 'outer loop' (i.e. the repeated decodes). Running this script with no arguments displays usage information.

MERT

Minimum error rate training (MERT) was introduced by Och (2003). For details on the Moses implementation, see Bertoldi et al, (2009). This line-search based method is probably still the most widely used tuning algorithm, and the default option in Moses. It does not support the use of more than about 20-30 features, so for sparse features you should use one of the other algorithms.

Lattice MERT

A variant of MERT which uses lattices instead of n-best lists. This was implemented by Kārlis Goba and Christian Buck at the Fourth Machine Translation Marathon in January 2010. It is based on the work of Macherey et al. (2008) and is available here.

PRO

Pairwise ranked optimization (Hopkins and May, 2011) works by learning a weight set that ranks translation hypotheses in the same order as the metric (e.g. Bleu). Passing the argument --pairwise-ranked to mert-moses.pl enables PRO.

Batch MIRA

Also known as k-best MIRA (Cherry and Foster, 2012), this is a version of MIRA (a margin-based classification algorithm) which works within a batch tuning framework. To use batch MIRA, you need to pass the --batch-mira argument to mert-moses.pl. See below for more detail.

Online tuning algorithms

These methods requires much tighter integration with the decoder. Each sentence in the tuning set is decoded in turn, and based on the results of the decode the weights are updated before decoding the next sentence. The algorithm may iterate through the tuning set multiple times.

MIRA

The MIRA tuning algorithm (Chiang, 2012); (Hasler et al, 2011) was inspired by the passive-aggressive algorithms of Koby Crammer, and their application to structured prediction by Ryan MacDonald. After decoding each sentence, MIRA updates the weights only if the metric ranks the output sentence with respect to a (pseudo-)reference translation differently from the model.

Metrics

By default, tuning is optimizing the BLEU score of translating the specified tuning set. You can also use other metrics, and even combinations of metrics.

For instance,

 mert-moses.pl [...] --mertargs="--sctype TER,BLEU --scconfig weights:0.6+0.4"

optimizes based on both the TER score and the BLEU score with a balance of 60% TER and 40% BLEU.

The following metrics are supported:

BLEU - the popular bilingual evaluation understudy (Papineni et al., 2001)
BLEUDOC
TER - edit distance with moves (Snover et al., 2006)
PER - position-independent word error rate (number of matching words)
WER - word error rate (cannot deal with moves)
CDER - word error rate with block movement (Leusch et al,, 2006)
METEOR - recall oriented metric with stem / synonym matching (Lavie et al., 2007)

Tuning in Practice

Multiple references

To specify multiple references to mert-moses.pl, name each reference file with a prefix followed by a number. Pass the prefix as the reference and ensure that the prefix does not exist.

ZMERT Tuning

Kamil Kos created contrib/zmert-moses.pl, a Java replacement for mert-moses.pl for those who wish to use ZMERT. The zmert-moses.pl script supports most of the mert-moses.pl parameters, therefore the transition to the new zmert version should be relatively easy. For more details on supported parameters run zmert-moses.pl --help.

ZMERT can support multiple metrics ZMERT homepage. For instance, SemPOS which is based on the tectogrammatical layer, see TectoMT.

ZMERT JAR, version 1.41 needs to be downloaded from Omar Zaidan's website. If you would like to add a new metric, please, modify the zmert/zmert.jar file in the following way:

extract zmert.jar content by typing jar xf zmert.jar
modify the files (probably a copy of NewMetric.java.template)
recompile java files by javac *.java
create a new version of zmert.jar by typing jar cvfM zmert.jar *.java* *.class

k-best batch MIRA Tuning

This is hope-fear MIRA built as a drop-in replacement for MERT; it conducts online training using aggregated k-best lists as an approximation to the decoder's true search space. This allows it to handle large features, and it often out-performs MERT once feature counts get above 10.

You can tune using this system by adding --batch-mira to your mert-moses.pl command. This replaces the normal call to the mert executable with a call to kbmira.

I recommend also adding the flag --return-best-dev to mert-moses.pl. This will copy the moses.ini file corresponding to the highest-scoring development run (as determined by the evaluator executable using BLEU on run*.out) into the final moses.ini. This can make a fairly big difference for MIRA's test-time accuracy.

You can also pass through options to kbmira by adding --batch-mira-args 'whatever' to mert-moses.pl. Useful kbmira options include:

-J n : changes the number of inner MIRA loops to n passes over the data. Increasing this value to 100 or 300 can be good for working with small development sets. The default, 60, is ideal for development sets with more than 1000 sentences.
-C n : changes MIRA's C-value to n. This controls regularization. The default, 0.01, works well for most situations, but if it looks like MIRA is over-fitting or not converging, decreasing C to 0.001 or 0.0001 can sometimes help.
--streaming : stream k-best lists from disk rather than load them into memory. This results in very slow training, but may be necessary in low-memory environments or with very large development sets.

Run kbmira --help for a full list of options.

So, a complete call might look like this:

 $MOSES_SCRIPTS/training/mert-moses.pl work/dev.fr work/dev.en \
    $MOSES_BIN/moses work/model/moses.ini --mertdir $MOSES_BIN \
    --rootdir $MOSES_SCRIPTS --batch-mira --return-best-dev \
    --batch-mira-args '-J 300' --decoder-flags '-threads 8 -v 0'

Please give it a try. If it's not working as advertised, send Colin Cherry an e-mail.

For more information on batch MIRA, check out the paper:

Colin Cherry and George Foster: "Batch Tuning Strategies for Statistical Machine Translation", NAACL, June 2012, pdf

Anticipating some questions:

[Q: Does it only handle BLEU?] [A: Yes, for now. There's nothing stopping people from implementing other metrics, so long as a reasonable sentence-level version of the metric can be worked out. Note that you generally need to retune kbmira's C-value for different metrics. I'd also change --return-best-dev to use the new metric as well.]

[Q: Have you tested this on a cluster?] [A: No, I don't have access to a Sun Grid cluster - I would love it if someone would test that scenario for me. But it works just fine using multi-threaded decoding. Since training happens in a batch, decoding is embarrassingly parallel.]

Tree-to-string and tree-to-tree tuning

When tuning with tree input, make sure you set the inputtype argument to the mert script

   mert-moses.pl --inputtype 3 ...

Moses
statistical
machine translation
system

1. Moses

2. Getting Started

3. Tutorials

4. Training

5. User Documentation

6. Development

7. Background

Tuning

Overview

Batch tuning algorithms

MERT

Lattice MERT

PRO

Batch MIRA

Online tuning algorithms

MIRA

Metrics

Tuning in Practice

Multiple references

ZMERT Tuning

k-best batch MIRA Tuning

Tree-to-string and tree-to-tree tuning

Mosesstatisticalmachine translationsystem

1. Moses

2. Getting Started

3. Tutorials

4. Training

5. User Documentation

6. Development

7. Background

Tuning

Overview

Batch tuning algorithms

MERT

Lattice MERT

PRO

Batch MIRA

Online tuning algorithms

MIRA

Metrics

Tuning in Practice

Multiple references

ZMERT Tuning

k-best batch MIRA Tuning

Tree-to-string and tree-to-tree tuning

Moses
statistical
machine translation
system