Moses
statistical
machine translation
system

Tuning

The training script train-factored-model.perl produces a configuration file moses.ini which has default weights of questionable quality. That's why we need to obtain better weights by optimizing translation performance on a development set.

This is done with the tuning script mert-moses-new.pl. This new version of the minimum error rate trining script is based on a new C++ software. Details about the new implementations are given in Bertoldi, Haddow, Fouet, "Improved Minimum Error Rate Training in Moses", In Proc. of 3rd MT Marathon, Prague, Czech Republic. The new mert implementation is a standalone open-source software. The only interaction between Moses and the new software is given by the script mert-moses-new.pl itself.

This new implementation of mert stores feature scores and error statistics in separate files )possibly in a binary format) for each nbest-list (at each iteration), and use (some of) these files to optimize weights. At the moment weight optimization can be based on either BLEU or PER.

Most features of the old code mert-moses.pl are maintained, and some new ones are added.

The script are run as follows:

 mert-moses-new.pl input-text references decoder-executable decoder.ini

Parameters:

  • input-text and references are the development set, on which translation performance is optimized. The tuning script tries to find translations for input-text that resemble best the reference translations in references. The script works also with multiple output reference files, these have to be called [references]0, [references]1, [references]2, etc.
  • decoder-executable is the location of the decoder binary to be used
  • decoder.ini is the location of the configuration file to be used

Options:

  • --working-dir=STRING (default mert-dir) directory that contains all files generated during the tuning process. Upon conclusion, it will contain a new moses.ini with better weights
  • --nbest=NUM (default 100) size of n-best list to be generated at each run of the decoder
  • --jobs=NUM if the script is run a cluster, this specifies how many jobs to submit (default: serial execution, does not use qsub)
  • --queue-flags=STRING additional switches to pass to the parallelizer, eg. '-qsub-prefix logname'
  • --decoder-flags=STRING additional parameters for the decoder
  • --lambdas=STRING default values and ranges for lambdas, a complex string such as 'd:1,0.5-1.5 lm:1,0.5-1.5 ...' (see below)
  • --average use the average (not the default, closest) reference length as effective reference length for BLEU score computation
  • --closest use the closest (default) reference length as effective reference length for BLEU score computation
  • --shortest use the shortest (not the default, closest) reference length as effective reference length for BLEU score computation
  • --nocase perform a case-insensitive evaluation between hypos and refs (default is false)
  • --activate-features=STRING perform optimization on a specified subset of features (default is the optimization of all features); see below for details and for the correct syntax
  • --continue continue the iterative optimization process from the last finished step (default is false); this is useful to recover a not-terminated optimization process (for example due to any system failure) without losing the first well-completed steps (see the note below for more details
  • --prev-aggregate-nbestlist=INT number of previous steps to consider when loading data (default =-1). -1 means all previous, i.e. from iteration 1; 0 means no previous data, i.e. from actual iteration; 1 means 1 previous data , i.e. from the actual iteration and from the previous one; and so on.
  • --mertdir=STRING path to the new implementation of mert software
  • --mertargs=STRING extra arguments for mert, eg to specify score type (which is BLEU by default)
  • --help gives a full list of options

Note: the optimized final weights are L1-normalized to 1 (i.e. sum_i |w_i| =1).

Note: the policy for computing the effective reference length in the BLEU score has changed (from revision 2461).

Note: the policy for case-sensitive/insensitive evaluation has changed (from revision 2461); now the default is case-sensitive.

Note: the --continue option relies on several files produced in the well-completed previous steps of the optimization process:

  • "finished_step.txt"
  • "runX.features.dat" for X=1,..,T
  • "runX.scores.dat" for X=1,..,T
  • "runT.weights.txt"
  • "runT.mert.log"
  • "runT.names.txt"

where T is the last well-completed step The file "finished_step.txt" should contain the value T.

Example: to store feature scores and error statistics in binary files and to use PER (instead of the default). Quotation marks are required.

 --mertargs "--binary --sctype PER"

Example: to use only the nbest lists produced in the last 3 iterations of the mert process (plus the actual one):

 --prev-aggregate-nbestlist=3

More on the lambda settings:

If you wish to optimize weights of all models your moses.ini mentions, and you want to use default values and intervals, you do not need to specify --lambdas at all.

--lambdas=STRING specifies the starting values and randomization ranges for the weights in a somwhat obstuse format. Each weight is specified as start,min-max, for instance 0.5,0.25-0.75 for a starting weight of 0.5 (used in the initial decoder run), and the randomized values between 0.25 and 0.75 during the parameter search. Weights have to be defined for reordering (d), language model (lm), translation model (tm), generation model (g), and word penalty (w), for instance by d:1,0.5-1.5 for the reordering model. If there are multiple weights per component, these weights are specified in sequence separated by semicolons ;.

Example: d:1,0.5-1.5 lm:1,0.5-1.5 tm:0.3,0.25-0.75;0.2,0.25-0.75;0.2,0.25-0.75;0.3,0.25-0.75;0,-0.5-0.5 w:0,-0.5-0.5 sets

  • one weight for the distortion model, starting with 1, then randomized from 0.5-1.5
  • one weight for the language model, starting with 1, then randomized from 0.5-1.5
  • five weights for the translation model:
    • the first starting at 0.3, then randomized from 0.25-0.75
    • the first starting at 0.2, then randomized from 0.25-0.75
    • the first starting at 0.2, then randomized from 0.25-0.75
    • the first starting at 0.3, then randomized from 0.25-0.75
    • the first starting at 0, then randomized from -0.5 to 0.5
  • one weight for the word penalty, starting with 0, then randomized from -0.5 to 0.5

Tuning on a subset of features

Sometimes it could be useful to optimize a subset of the feature weights. mert-moses.pl allows this through the parameter --activate=list, where list is a non-empty comma-separated list of features. Features are identified by name_index, where name is the group name and index is the position of the feature inside the group. The group name are:

 d   distortion model
 lm  language models
 tm  translation models
 w   word penalty
 I   posterior probability for confusion network

and the index starts from 0; the index is mandatory even if only one feature occurs in a group. If no features are specified (--activate-features='') or the parameter --activate-features is not set, mert-moses-new.pl perform optimization over all available features.

For instance, setting the option as follows

 --activate-features=d_0,d_4,lm_0,tm_3,w_0

only the following features will be optimized: the first (d_0) and the fifth (d_4) distortion model weights, the first language model weight (lm_0), the fourth (tm_3) translation model weight, and the first (w_0) word penalty.

IMPORTANT:

  • The configuration file used for the first iteration takes the feature weights from the --lambdas parameter (or from their defaults), and not from the configuration file passed through the command line. Hence, if you want to assign specific (already optimized) values for NOT-activated features, please set them by means of the --lambdas parameter. For these NOT-activated features, you must also specify their ranges in the --lambdas parameter to maintain the correct syntax although they are not exploited.
  • Please, pay attention that mert-moses-new.pl fails if a wrong feature is specified; for instance lm_2 is not allowed if only one or two language models are used.
  • There are some differences with respect to the old version of the script (mert-moses.pl):
    • you can no more specify a group of features
    • features are kept fixed to their initial values (specified with --lambdas)

Tuning on a subset of features (Old version)

mert-moses.pl allows this through the parameter --activate=list, where list is a non-empty comma-separated list of features. Features are identified by their names. The feature names can be found in the nbest list, in the file names.txt in the working directory during the minimum error training, and in the moses help (moses --help). Main features are:

 d   distortion model
 lm  language models
 tm  translation models
 w   word penalty
 I   posterior probability for confusion network

The parameter --activate=list activates the optimization of only the listed feature weights. The ratios among the remaining ones are fixed, and are taken from the --lambdas parameter.

If a feature have more scores (eg. tm with 5 scores), optimization can be performed on:

  • all of them (--activate=tm)
  • any of them by specifying its index (--activate=tm_2,tm_3)

The optimization of a subset of feature weights works as follows:

  • not-active features are weighted summed to create an extra feature.
  • the active features and this extra feature are optimized (and normalized)
  • the initial not-active weights are multiplied by the optimal weight of the extra feature
  • the full set of weights is normalized to 1.

Note that if parameter --activate is not set, all weights are optimized.

Example:
Features are: d lm tm tm tm tm tm w

Parameters are set as follows:
--lambdas="d:1,0.5-1.5 lm:1,0.5-1.5 tm:0.3,0.25-0.75;0.2,0.25-0.75;0.2,0.25-0.75;0.3,0.25-0.75;0,-0.5-0.5 w:0,-0.5-0.5"
--activate=d,tm_1,tm_5

Features d, the first tm and the fifth tm will be optimized. Features lm, the second, third and fourth tm and w will NOT be optimized. The ratios among them comes from their initial values: 1, 0.2, 0.2, 0.3, and 0, respectively.

If optimal weights are 0.11, 0.54, and 0.09 respectively, and 0.26 for the extra feature, the final set of weights is

 0.11 0.26(=0.26*1) 
 0.54 0.052(=0.26*0.2) 
 0.052(=0.26*0.2) 
 0.078(=0.26*0.3) 
 0.09 0(=0.26*0)

and after normalization

 0.093 0.220 0.457 0.044 0.044 0.066 0.076 0

You can also use the old version of MERT script mert-moses.pl, but some feature are not available.

Running the script on a cluster

If the script is run as a multi-job process (--jobs) on a Grid Engine cluster, you will run the script on the head node. The script submits all compute-heavy parts as jobs to the cluster. In other words, you do not submit the script itself as a job to the cluster.

Removal of cmert and zmert

Since there are currently three mert implementations in the moses trunk, after a short discussion on the moses developers' list in July 2010, it was decided that zmert and cmert would be removed, leaving "new mert" as the only mert implementation. The plan for this work is as follows (feel free to add comments)

  • Add options to moses for conversion of ini file to feature list (Hieu)
  • Update mert-moses-new.perl to use moses to get its feature list, and check

that this works with syntax moses (Barry, assisted by Hieu and Phil W)

  • Clean up mert-moses-new.perl, get rid of pythonpath etc (Barry or Nicola)
  • Delete mert-moses.perl and cmert, and rename mert-moses-new.perl to

mert-moses.perl. The command line arguments are the same, except that you need to specify -mertdir (Barry)

  • Remove zmert from moses, retain zmert-moses.perl for now (Barry)
  • Merge zmert-moses.perl and mert-moses.perl (Ondrej + students). This is

probably a low priority.

  • Update mert-moses.perl so that it can also train the weights used in lattice MBR (Barry)
  • Merge in the lattice mert code (?)
  • Add new scorers to mert (eg TER), improve documentation of scorer interface

and scorers. Make a video. (this could be a small mt marathon project)

print
Page last modified on July 31, 2010, at 11:41 AM