Scripts are in the
scripts subdirectory in the source release in the Git repository.
The following basic tools are described elsewhere:
Moses is a successor the the Pharaoh decoder, so you can use the same models that work for Pharaoh and use them with Moses. The following script makes the necessary changes to the configuration file:
exodus.perl < pharaoh.ini > moses.ini
Since decoding large amounts of text takes a long time, you may want to split up the text into blocks of a few hundred sentences (or less), and distribute the task across a Sun GridEngine cluster. This is supported by the script
moses-parallel.pl, which is run as follows:
moses-parallel.pl -decoder decoder -config cfgfile -i input -jobs N [options]
Use absolute paths for your parameters (decoder, configuration file, models, etc.).
decoderis the file location of the binary of Moses used for decoding
cfgfileis the configuration fileofthe decoder
inputis the file to translate
Nis the number of processors you require
optionsare used to overwrite parameters provided in
-n-best-fileoutput file for nbest list
-n-best-sizesize of nbest list
Phrase tables easily get too big, but for the translation of a specific set of text only a fraction of the table is needed. So, you may want to filter the translation table, and this is possible with the script:
filter-model-given-input.pl filter-dir config input-file
This creates a filtered translation table with new configuration file in the directory
filter-dir from the model specified with the configuration file
config (typically named
moses.ini), given the (tokenized) input from the file
In the advanced feature section, you find the additional option of binarizing translation and reordering table, which allows these models to be kept on disk and queried by the decoder. If you want to both filter and binarize these tables, you can use the script:
filter-model-given-input.pl filter-dir config input-file -Binarizer binarizer
binarizer option points to the appropriate version of
Instead of the two following scripts, this one does both at the same time, and is better suited for our directory structure and factor naming conventions:
reduce_combine.pl \ czeng05.cs \ 0,2 pos lcstem4 \ > czeng05_restricted_to_0,2_and_with_pos_and_lcstem4_added
A simple BLEU scoring tool is the script
multi-bleu.perl reference < mt-output
Reference file and system output have to be sentence-aligned (line X in the reference file corresponds to line X in the system output). If multiple reference translation exist, these have to be stored in seperate files and named
reference2, etc. All the texts need to be tokenized.
A popular script to score translations with BLEU is the NIST mteval script. It requires that text is wrapped into a SGML format. This format is used for instance by the NIST evaluation and the WMT Shared Task evaluations. See the latter for more details on using this script.
Missing n-grams are those that all reference translations wanted but MT system did not produce. Extra n-grams are those that the MT system produced but none of the references approved.
missing_and_extra_ngrams.pl hypothesis reference1 reference2 ...
Assume you have a
moses.ini file already and want to run an experiment with it. Some months from now, you might still want to know what exactly did the model (incl. all the tables) look like, but people tend to move files around or just delete them.
To solve this problem, create a blank directory, go in there and run:
close_moses_model.pl will make a copy of the
moses.ini file and local symlinks (and if possible also hardlinks, in case someone deleted the original file) to all the tables and language models needed.
It will be now safe to run moses locally in the fresh directory.
absolutize_moses_model.pl ../path/to/moses.ini > moses.abs.ini
to build an ini file where all paths to model parts are absolute. (Also checks the existence of the files.)
Prints basic statistics about all components mentioned in the moses.ini. This can be useful to set the order of mapping steps to avoid explosion of translation options or just to check that the model components are as big/detailed as we expect.
Sample output lists information about a model with 2 translation and 1 generation step. The three language models over three factors used and their n-gram counts (after discounting) are listed, too.
Translation 0 -> 1 (/fullpathto/phrase-table.0-1.gz): 743193 phrases total 1.20 phrases per source phrase Translation 1 -> 2 (/fullpathto/phrase-table.1-2.gz): 558046 phrases total 2.75 phrases per source phrase Generation 1,2 -> 0 (/fullpathto/generation.1,2-0.gz): 1.04 outputs per source token Language model over 0 (/fullpathto/lm.1.lm): 1 2 3 49469 245583 27497 Language model over 1 (/fullpathto/lm.2.lm): 1 2 3 25459 199852 32605 Language model over 2 (/fullpathto/lm.3.lm): 1 2 3 4 5 6 7 709 20946 39885 45753 27964 12962 7524
Often, we train machine translation systems on lowercased data. If we want to present the output to a user, we need to re-case (or re-capitalize) the output. Moses provides a simple tool to recase data, which essentially runs Moses without reordering, using a word-to-word translation model and a cased language model.
The recaser requires a model (i.e., the word mapping model and language model mentioned above), which is trained with the command:
train-recaser.perl --dir MODEL --corpus CASED [--train-script TRAIN]
The script expects a cased (but tokenized) training corpus in the file
CASED, and creates a recasing model in the directory
MODEL. KenLM's lmplz is used to train language models by default; pass --lm to change the toolkit.
To recase output from the Moses decoder, you run the command
recase.perl --in IN --model MODEL/moses.ini --moses MOSES [--lang LANGUAGE] [--headline SGML] > OUT
The input is in file
IN, the output in file
OUT. You also need to specify a recasing model
MODEL. Since headlines are capitalized different from regular text, you may want to provide an
SGML file that contains information about headline. This file uses the NIST format, and may be identical to source test sets provided by the NIST or other evluation campaigns. A language
LANGUAGE may also be specified, but only English (
en) is currently supported.
By default, EMS trains a truecaser (see below). To use a recaser, you have to make the following changes:
detruecaserand add instead
[TRUECASING]section, and remove it from the
[RECASING]section, which training corpus should be used for the recaser. This is typically the target side of the parallel corpus or a large language model corpus. You can directly link to a corpus already specified to the config file, e.g., tokenized = [LM:europarl:tokenized-corpus]
Instead of lowercasing all training and test data, we may also want to keep words in their natural case, and only change the words at the beginning of their sentence to their most frequent form. This is what we mean by truecasing. Again, this requires first the training of a truecasing model, which is a list of words and the frequency of their different forms.
train-truecaser.perl --model MODEL --corpus CASED
The model is trained from the cased (but tokenized) training corpus
CASED and stored in the file
Input to the decoder has to be truecased with the command
truecase.perl --model MODEL < IN > OUT
Output from the decoder has to be restored into regular case. This simply uppercases words at the beginning of sentences:
detruecase.perl < in > out [--headline SGML]
An SGML file with headline information may be provided, as done with the recaser.
This small tool converts Moses searchgraph (
-output-search-graph FILE option) to dot format. The dot format can be rendered using the graphviz tool dot.
moses ... --output-search-graph temp.graph -s 3 # we suggest to use a very limited stack size, -s 3 sg2dot.perl [--organize-to-stacks] < temp.graph > temp.dot dot -Tps temp.dot > temp.ps
--organize-to-stacks makes nodes in the same stack appear in the same column (this slows down the rendering, off by default).
Caution: the input must contain the searchgraph of one sentence only.
The phrase table trained by Moses contains by default all phrase pairs encountered in the parallel training corpus. This often includes 100,000 different translations for the word "the" or the comma ",". These may clog up various processing steps down the road, so it is helpful to prune the phrase table to the reasonable choices.
Threshold pruning is currently implemented at two different stages: You may filter the entire phrase table file, or use threshold pruning as an additional filtering criterion when filtering the phrase table for a given test set. In either case, phrase pairs are thrown out when their phrase translation probability p(e|f) falls below a specified threshold. A safe number for this threshold may be 0.0001, in the sense that it hardly changes any phrase translation while ridding the table of a lot of junk.
scripts/training/threshold-filter.perl operates on any phrase table file:
cat PHRASE_TABLE | \ threshold-filter.perl 0.0001 > PHRASE_TABLE.reduced
If the phrase table is zipped, then:
zcat PHRASE_TABLE.gz | \ threshold-filter.perl 0.0001 | \ gzip - > PHRASE_TABLE.reduced.gz
While this often does not remove much of the phrase table (which contains to a large part singleton phrase pairs with p(e|f)=1), it may nevertheless be helpful to also reduce the reordering model. This can be done with a second script:
cat REORDERING_TABLE | \ remove-orphan-phrase-pairs-from-reordering-table.perl PHRASE_TABLE \ > REORDERING_TABLE.pruned
Again, this also works for zipped files:
zcat REORDERING_TABLE.gz | \ remove-orphan-phrase-pairs-from-reordering-table.perl PHRASE_TABLE | \ gzip - > REORDERING_TABLE.pruned.gz
In the typical experimental setup, the phrase table is filtered for a tuning or test set using the
script. During this process, we can also remove low-probability phrase pairs. This can be done simply by adding the switch
-MinScore, which takes a specification of the following form:
filter-model-given-input.pl [...] \ -MinScore FIELD1:THRESHOLD2[,FIELD2:THRESHOLD2[,FIELD3:THRESHOLD3]]
FIELDn is the position of the score (typically 2 for the direct phrase probability p(e|f), or 0 for the indirect phrase probability p(f|e)) and
THRESHOLD the maximum probability allowed.