Moses
statistical
machine translation
system

Support Tools

Overview

Scripts are in the scripts subdirectory in the source release in the Git repository.

The following basic tools are described elsewhere:

Content

Converting Pharaoh configuration files to Moses configuration files

Moses is a successor to the Pharaoh decoder, so you can use the same models that work for Pharaoh and use them with Moses. The following script makes the necessary changes to the configuration file:

 exodus.perl < pharaoh.ini > moses.ini

Moses decoder in parallel

Since decoding large amounts of text takes a long time, you may want to split up the text into blocks of a few hundred sentences (or less), and distribute the task across a Sun GridEngine cluster. This is supported by the script moses-parallel.pl, which is run as follows:

 moses-parallel.pl -decoder decoder -config cfgfile -i input -jobs N [options]

Use absolute paths for your parameters (decoder, configuration file, models, etc.).

  • decoder is the file location of the binary of Moses used for decoding
  • cfgfile is the configuration fileofthe decoder
  • input is the file to translate
  • N is the number of processors you require
  • options are used to overwrite parameters provided in cfgfile
    Among them, overwrite the following two parameters for nbest generation (NOTE: they differ from standard Moses)
    • -n-best-file output file for nbest list
    • -n-best-size size of nbest list

Filtering phrase tables for Moses

Phrase tables easily get too big, but for the translation of a specific set of text only a fraction of the table is needed. So, you may want to filter the translation table, and this is possible with the script:

 filter-model-given-input.pl filter-dir config input-file

This creates a filtered translation table with new configuration file in the directory filter-dir from the model specified with the configuration file config (typically named moses.ini), given the (tokenized) input from the file input-file.

In the advanced feature section, you find the additional option of binarizing translation and reordering table, which allows these models to be kept on disk and queried by the decoder. If you want to both filter and binarize these tables, you can use the script:

 filter-model-given-input.pl filter-dir config input-file -Binarizer binarizer

The additional binarizer option points to the appropriate version of processPhraseTable.

Reducing and Extending the Number of Factors

Instead of the two following scripts, this one does both at the same time, and is better suited for our directory structure and factor naming conventions:

 reduce_combine.pl \
    czeng05.cs \
    0,2 pos lcstem4 \
    > czeng05_restricted_to_0,2_and_with_pos_and_lcstem4_added

Scoring translations with BLEU

A simple BLEU scoring tool is the script multi-bleu.perl:

 multi-bleu.perl reference < mt-output

Reference file and system output have to be sentence-aligned (line X in the reference file corresponds to line X in the system output). If multiple reference translation exist, these have to be stored in seperate files and named reference0, reference1, reference2, etc. All the texts need to be tokenized.

A popular script to score translations with BLEU is the NIST mteval script. It requires that text is wrapped into a SGML format. This format is used for instance by the NIST evaluation and the WMT Shared Task evaluations. See the latter for more details on using this script.

Missing and Extra N-Grams

Missing n-grams are those that all reference translations wanted but MT system did not produce. Extra n-grams are those that the MT system produced but none of the references approved.

 missing_and_extra_ngrams.pl hypothesis reference1 reference2 ...

Making a Full Local Clone of Moses Model + ini File

Assume you have a moses.ini file already and want to run an experiment with it. Some months from now, you might still want to know what exactly did the model (incl. all the tables) look like, but people tend to move files around or just delete them.

To solve this problem, create a blank directory, go in there and run:

 clone_moses_model.pl ../path/to/moses.ini

close_moses_model.pl will make a copy of the moses.ini file and local symlinks (and if possible also hardlinks, in case someone deleted the original file) to all the tables and language models needed.

It will be now safe to run moses locally in the fresh directory.

Absolutizing Paths in moses.ini

Run:

  absolutize_moses_model.pl  ../path/to/moses.ini > moses.abs.ini

to build an ini file where all paths to model parts are absolute. (Also checks the existence of the files.)

Printing Statistics about Model Components

The script

 analyse_moses_model.pl moses.ini

Prints basic statistics about all components mentioned in the moses.ini. This can be useful to set the order of mapping steps to avoid explosion of translation options or just to check that the model components are as big/detailed as we expect.

Sample output lists information about a model with 2 translation and 1 generation step. The three language models over three factors used and their n-gram counts (after discounting) are listed, too.

 Translation 0 -> 1 (/fullpathto/phrase-table.0-1.gz):
   743193        phrases total
   1.20  phrases per source phrase
 Translation 1 -> 2 (/fullpathto/phrase-table.1-2.gz):
   558046        phrases total
   2.75  phrases per source phrase
 Generation 1,2 -> 0 (/fullpathto/generation.1,2-0.gz):
   1.04  outputs per source token
 Language model over 0 (/fullpathto/lm.1.lm):
   1     2       3   
   49469 245583  27497
 Language model over 1 (/fullpathto/lm.2.lm):
   1     2       3   
   25459 199852  32605
 Language model over 2 (/fullpathto/lm.3.lm):
   1     2       3       4       5       6       7  
   709   20946   39885   45753   27964   12962   7524

Recaser

Often, we train machine translation systems on lowercased data. If we want to present the output to a user, we need to re-case (or re-capitalize) the output. Moses provides a simple tool to recase data, which essentially runs Moses without reordering, using a word-to-word translation model and a cased language model.

The recaser requires a model (i.e., the word mapping model and language model mentioned above), which is trained with the command:

 train-recaser.perl --dir MODEL --corpus CASED [--train-script TRAIN]

The script expects a cased (but tokenized) training corpus in the file CASED, and creates a recasing model in the directory MODEL. KenLM's lmplz is used to train language models by default; pass --lm to change the toolkit.

To recase output from the Moses decoder, you run the command

 recase.perl --in IN --model MODEL/moses.ini --moses MOSES [--lang LANGUAGE] [--headline SGML] > OUT

The input is in file IN, the output in file OUT. You also need to specify a recasing model MODEL. Since headlines are capitalized different from regular text, you may want to provide an SGML file that contains information about headline. This file uses the NIST format, and may be identical to source test sets provided by the NIST or other evluation campaigns. A language LANGUAGE may also be specified, but only English (en) is currently supported.

By default, EMS trains a truecaser (see below). To use a recaser, you have to make the following changes:

  • Comment out output-truecaser and detruecaser and add instead output-lowercaser and EVALUATION:recaser.
  • Add IGNORE to the [TRUECASING] section, and remove it from the [RECASING] section
  • Specify in the [RECASING] section, which training corpus should be used for the recaser. This is typically the target side of the parallel corpus or a large language model corpus. You can directly link to a corpus already specified to the config file, e.g., tokenized = [LM:europarl:tokenized-corpus]

Truecaser

Instead of lowercasing all training and test data, we may also want to keep words in their natural case, and only change the words at the beginning of their sentence to their most frequent form. This is what we mean by truecasing. Again, this requires first the training of a truecasing model, which is a list of words and the frequency of their different forms.

 train-truecaser.perl --model MODEL --corpus CASED

The model is trained from the cased (but tokenized) training corpus CASED and stored in the file MODEL.

Input to the decoder has to be truecased with the command

 truecase.perl --model MODEL < IN > OUT

Output from the decoder has to be restored into regular case. This simply uppercases words at the beginning of sentences:

 detruecase.perl < in > out [--headline SGML]

An SGML file with headline information may be provided, as done with the recaser.

Searchgraph to DOT

This small tool converts Moses searchgraph (-output-search-graph FILE option) to dot format. The dot format can be rendered using the graphviz tool dot.

 moses ... --output-search-graph temp.graph -s 3
    # we suggest to use a very limited stack size, -s 3
 sg2dot.perl [--organize-to-stacks] < temp.graph > temp.dot
 dot -Tps temp.dot > temp.ps

Using --organize-to-stacks makes nodes in the same stack appear in the same column (this slows down the rendering, off by default).

Caution: the input must contain the searchgraph of one sentence only.

Threshold Pruning of Phrase Table

The phrase table trained by Moses contains by default all phrase pairs encountered in the parallel training corpus. This often includes 100,000 different translations for the word "the" or the comma ",". These may clog up various processing steps down the road, so it is helpful to prune the phrase table to the reasonable choices.

Threshold pruning is currently implemented at two different stages: You may filter the entire phrase table file, or use threshold pruning as an additional filtering criterion when filtering the phrase table for a given test set. In either case, phrase pairs are thrown out when their phrase translation probability p(e|f) falls below a specified threshold. A safe number for this threshold may be 0.0001, in the sense that it hardly changes any phrase translation while ridding the table of a lot of junk.

Pruning the full phrase table file

The script scripts/training/threshold-filter.perl operates on any phrase table file:

 cat PHRASE_TABLE | \
  threshold-filter.perl 0.0001 > PHRASE_TABLE.reduced

If the phrase table is zipped, then:

 zcat PHRASE_TABLE.gz | \
  threshold-filter.perl 0.0001 | \
  gzip - > PHRASE_TABLE.reduced.gz

While this often does not remove much of the phrase table (which contains to a large part singleton phrase pairs with p(e|f)=1), it may nevertheless be helpful to also reduce the reordering model. This can be done with a second script:

 cat REORDERING_TABLE | \
  remove-orphan-phrase-pairs-from-reordering-table.perl PHRASE_TABLE \
  > REORDERING_TABLE.pruned

Again, this also works for zipped files:

 zcat REORDERING_TABLE.gz | \
  remove-orphan-phrase-pairs-from-reordering-table.perl PHRASE_TABLE | \ 
  gzip - > REORDERING_TABLE.pruned.gz

Pruning during test/tuning set filtering

In the typical experimental setup, the phrase table is filtered for a tuning or test set using the script. During this process, we can also remove low-probability phrase pairs. This can be done simply by adding the switch -MinScore, which takes a specification of the following form:

 filter-model-given-input.pl [...]  \
  -MinScore FIELD1:THRESHOLD2[,FIELD2:THRESHOLD2[,FIELD3:THRESHOLD3]]

where FIELDn is the position of the score (typically 2 for the direct phrase probability p(e|f), or 0 for the indirect phrase probability p(f|e)) and THRESHOLD the maximum probability allowed.

Edit - History - Print
Page last modified on February 06, 2016, at 10:11 PM