Moses Installation and Training Run-Through

NB: This guide is no longer actively maintained. See the installation and baseline documentation.

The purpose of this guide is to offer a step-by-step example of downloading, compiling, and runing the Moses decoder and related support tools. We make no claims that all of the steps here will work perfectly on every machine you try it on, or that things will stay the same as the software changes. Please remember that Moses is research software under active development.


PART I - Download and Configure Tools and Data

Support Tools Background

Moses has a number of scripts designed to aid training, and they rely on GIZA++ and mkcls to function. More information on the origins of these tools is available at:

A Google Code project has been set up, and the code is being maintained:

Moses uses SRILM-style language models. SRILM is available from:

(Optional) The IRSTLM tools provide the ability to use quantized and disk memory-mapped language models. It's optional, but we'll be using it in this tutorial:

Support Tools Installation

Before we start building and using the Moses codebase, we have to download and compile all of these tools. See the list of versions to double-check that you are using the same code.

I'll be working under /home/jschroe1/demo in these examples. I assume you've set up some appropriately named directory in your own system. I'm installing these tools under an FC6 distro.

Changes to run the same setup under Mac OS X 10.5 are highlighted. For the Mac I'm running under /Users/josh/demo. The Mac steps aren't that complete after the compilation stage, but should work.

Machine Translation Marathon changes are highlighted. We probably won't have time to train a full model today.

mkdir tools
cd tools

Get The Latest Moses Version

Moses is available via GitHub. From the tools/ directory:

git clone git://github.com/moses-smt/mosesdecoder.git moses

This will copy all of the Moses source code to your local machine.

Compile Moses

Within the Moses folder structure are projects for Eclipse, Xcode, and Visual Studio -- though these are not well maintained and may not be up to date. I'll focus on the linux command-line method, which is the preferred way to compile.

cd moses
./bjam --with-srilm=/home/jschroe1/demo/tools/srilm --with-irstlm=/home/jschroe1/demo/tools/irstlm --with-giza=/home/jschroe1/demo/tools/bin -j2

(The -j2 is optional. ./bjam -jX where X is number of simultaneous tasks is a speedier option for machines with multiple processors)

This creates several files we will be using:

Confirm Setup Success

A sample model capable of translating one sentence is available on the Moses website. Download it and translate the sample input file.

cd /home/jschroe1/demo/
mkdir data
cd data
wget http://www.statmt.org/moses/download/sample-models.tgz
curl -O http://www.statmt.org/moses/download/sample-models.tgz
tar -xzvf sample-models.tgz
cd sample-models/phrase-model/
../../../tools/moses/dist/bin/moses -f moses.ini < in > out

The input has "das ist ein kleines haus" listed twice, so the output file (out) should contain "this is a small house" twice.

At this point, it might be wise for you to experiment with the command line options of the Moses decoder. A tutoral using this example model is available at http://www.statmt.org/moses/?n=Moses.Tutorial.

Set script environment variables

Most scripts should autodetect their paths but some use an environment variable:

export SCRIPTS_ROOTDIR=/Users/josh/demo/tools/moses/scripts

Additional Scripts

There are few scripts not included with moses which are useful for preparing data. These were originally made available as part of the WMT08 Shared Task and Europarl v3 releases, I've consolidated some of them into one set.

cd ../../
wget http://homepages.inf.ed.ac.uk/jschroe1/how-to/scripts.tgz
curl -O http://homepages.inf.ed.ac.uk/jschroe1/how-to/scripts.tgz
tar -xzvf scripts.tgz

We'll also get a NIST scoring tool.

wget ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v11b.pl
On the Mac, use ftp or a web browser to get the file. curl and I had a fight about it.
chmod +x mteval-v11b.pl

PART II - Build a Model

We'll used the WMT08 News Commentary data set, about 55k sentences. This should be good enough for moderate quality but still be doable in a reasonable amount of time on most machines. For this example we'll use FR-EN.

cd ../data
wget http://www.statmt.org/wmt08/training-parallel.tar
curl -O http://www.statmt.org/wmt08/training-parallel.tar
tar -xvf training-parallel.tar --wildcards training/news-commentary08.fr-en.*

If you're low on disk space, remove the full tar.
rm training-parallel.tar 

cd ../

Prepare Data

First we'll set up a working directory where we'll store all the data we prepare.

mkdir work

Build Language Model

Language models are concerned only with n-grams in the data, so sentence length doesn't impact training times as it does in GIZA++. So, we'll lowercase the full 55,030 tokenized sentences to use for language modeling. Many people incorporate extra target language monolingual data into their language models.

mkdir work/lm
tools/scripts/lowercase.perl < work/corpus/news-commentary.tok.en > work/lm/news-commentary.lowercased.en

We will use SRILM to build a tri-gram language model.

tools/srilm/bin/i686/ngram-count -order 3 -interpolate -kndiscount -unk -text work/lm/news-commentary.lowercased.en -lm work/lm/news-commentary.lm
tools/srilm/bin/macosx/ngram-count -order 3 -interpolate -kndiscount -unk -text work/lm/news-commentary.lowercased.en -lm work/lm/news-commentary.lm

We can see how many n-grams were created

head -n 5 work/lm/news-commentary.lm


\data\
ngram 1=36035
ngram 2=411595
ngram 3=118368

Train Phrase Model

Moses' toolkit does a great job of wrapping up calls to mkcls and GIZA++ inside a training script, and outputting the phrase and reordering tables needed for decoding. The script that does this is called train-model.perl

If you want to skip this step, you can use the pre-prepared model and ini files located at /afs/ms/u/m/mtm52/BIG/work/model/moses.ini and /afs/ms/u/m/mtm52/BIG/work/model/moses-bin.ini instead of the local references used in this tutorial. Move on to sanity checking your setup.

We'll run this in the background and nice it since it'll peg the CPU while it runs. It may take up to an hour, so this might be a good time to run through the tutorial page mentioned earlier using the sample-models data.

nohup nice $SCRIPTS_ROOTDIR/training/train-model.perl -scripts-root-dir $SCRIPTS_ROOTDIR -root-dir work -corpus work/corpus/news-commentary.lowercased -f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:/home/jschroe1/demo/work/lm/news-commentary.lm >& work/training.out &
nohup nice $SCRIPTS_ROOTDIR/training/train-model.perl -scripts-root-dir $SCRIPTS_ROOTDIR -root-dir work -corpus work/corpus/news-commentary.lowercased -f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:/Users/josh/demo/work/lm/news-commentary.lm >& work/training.out &

You can tail -f work/training.out file to watch the progress of the tuning script. The last step will say something like:

(9) create moses.ini @ Tue Jan 27 19:40:46 CET 2009

Now would be a good time to look at what we've done.

cd work
ls
corpus  giza.en-fr  giza.fr-en  lm  model

We'll look in the model directory. The three files we really care about are in bold.

cd model
ls -l
total 192554
-rw-r--r-- 1 jschroe1 people  5021309 Jan 27 19:23 aligned.grow-diag-final-and
-rw-r--r-- 1 jschroe1 people 27310991 Jan 27 19:24 extract.gz
-rw-r--r-- 1 jschroe1 people 27043024 Jan 27 19:25 extract.inv.gz
-rw-r--r-- 1 jschroe1 people 21069284 Jan 27 19:25 extract.o.gz
-rw-r--r-- 1 jschroe1 people  6061767 Jan 27 19:23 lex.e2f
-rw-r--r-- 1 jschroe1 people  6061767 Jan 27 19:23 lex.f2e
-rw-r--r-- 1 jschroe1 people     1032 Jan 27 19:40 moses.ini
-rw-r--r-- 1 jschroe1 people 67333222 Jan 27 19:40 phrase-table.gz
-rw-r--r-- 1 jschroe1 people 26144298 Jan 27 19:40 reordering-table.gz

Memory-Map LM and Phrase Table (Recommended for large data sets or computers with minimal RAM)

The language model and phrase table can be memory-mapped on disk to minimize the amount of RAM they consume. This isn't really necessary for this size of model, but we'll do it just for the experience.

If Moses segfaults when you try using a larger model than the one in this example, then you should try this step for sure.

More information is available on the Moses' web site at: http://www.statmt.org/moses/?n=Moses.AdvancedFeatures and http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel.

Performing these steps can lead to heavy disk use during decoding - you're basically using your hard drive as RAM. Proceed at your own risk, especially if you're using a (slow) networked drive.

Sanity Check Trained Model

We haven't tuned yet, but let's just check that the decoder works, and output a lot of logging data with -v 2.

Here's an excerpt of moses initializing with binary files in place (note bold lines, and recall the IRSTLM TMP issue):

echo "c' est une petite maison ." | TMP=/tmp tools/moses/dist/bin/moses -f work/model/moses-bin.ini
Loading lexical distortion models...
have 1 models
Creating lexical reordering...
weights: 0.300 0.300 0.300 0.300 0.300 0.300 
binary file loaded, default OFF_T: -1
Created lexical orientation reordering
Start loading LanguageModel /home/jschroe1/demo/work/lm/news-commentary.blm.mm : [0.000] seconds
In LanguageModelIRST::Load: nGramOrder = 3
Loading LM file (no MAP)
blmt
loadbin()
mapping 36035 1-grams
mapping 411595 2-grams
mapping 118368 3-grams
done
OOV code is 1468
IRST: m_unknownId=1468
Finished loading LanguageModels : [0.000] seconds
Start loading PhraseTable /amd/nethome/jschroe1/demo/work/model/phrase-table.0-0 : [0.000] seconds
using binary phrase tables for idx 0
reading bin ttable
size of OFF_T 8
binary phrasefile loaded, default OFF_T: -1
Finished loading phrase tables : [1.000] seconds
IO from STDOUT/STDIN

And here's one if you skipped the memory mapping steps:

echo "c' est une petite maison ." | tools/moses/dist/bin/moses -f work/model/moses.ini
Loading lexical distortion models...
have 1 models
Creating lexical reordering...
weights: 0.300 0.300 0.300 0.300 0.300 0.300 
Loading table into memory...done.
Created lexical orientation reordering
Start loading LanguageModel /home/jschroe1/demo/work/lm/news-commentary.lm : [47.000] seconds
/home/jschroe1/demo/work/lm/news-commentary.lm: line 1476: warning: non-zero probability for <unk> in closed-vocabulary LM
Finished loading LanguageModels : [49.000] seconds
Start loading PhraseTable /amd/nethome/jschroe1/demo/work/model/phrase-table.0-0.gz : [49.000] seconds
Finished loading phrase tables : [259.000] seconds
IO from STDOUT/STDIN

Again, while these short load times and small memory footprint are nice, decoding times will be slower with memory-mapped models due to disk access.


PART III - Prepare Tuning and Test Sets

Prepare Data

We'll use some of the dev and devtest data from WMT08. We'll stick with news-commentary data and use dev2007 and test2007. We only need to look at the input (FR) side of our testing data.


PART IV - Tuning

Note that this step can take many hours, even days, to run on large phrase tables and tuning sets. We'll use the non-memory-mapped versions for decoding speed. The training script controls for large phrase and reordering tables by filtering them to include only data relevant to the tuning set (we'll do this ourselves for the test data later).

nohup nice $SCRIPTS_ROOTDIR/training/mert-moses.pl work/tuning/nc-dev2007.lowercased.fr work/tuning/nc-dev2007.lowercased.en tools/moses/dist/bin/moses work/model/moses.ini --working-dir work/tuning/mert --mertdir /home/jschroe1/demo/tools/moses/mert --rootdir $SCRIPTS_ROOTDIR --decoder-flags "-v 0" >& work/tuning/mert.out &

Since this can take so long, we can instead make a small, 100 sentence tuning set just to see if the tuning process works. This won't generate very good weights, but it will let us confirm that our tools work.

head -n 100 work/tuning/nc-dev2007.lowercased.fr > work/tuning/nc-dev2007.lowercased.100.fr
head -n 100 work/tuning/nc-dev2007.lowercased.en > work/tuning/nc-dev2007.lowercased.100.en
nohup nice $SCRIPTS_ROOTDIR/training/mert-moses.pl work/tuning/nc-dev2007.lowercased.100.fr work/tuning/nc-dev2007.lowercased.100.en tools/moses/dist/bin/moses work/model/moses.ini --working-dir work/tuning/mert --rootdir $SCRIPTS_ROOTDIR --decoder-flags "-v 0" >& work/tuning/mert.out &

(Note that the scripts rootdir path needs to be absolute).

While this runs, check out the contents of work/tuning/mert. You'll see a set of runs, n-best lists for each, and run*.moses.ini files showing the weights used for each file. You can see the score each run is getting by looking at the last line of each run*.cmert.log file

cd work/tuning/mert
tail -n 1 run*.cmert.log

==> run1.cmert.log <==
Best point: 0.028996 0.035146 -0.661477 -0.051250 0.001667 0.056762 0.009458 0.005504 -0.006458 0.029992 0.009502 0.012555 0.000000 -0.091232 => 0.282865

==> run2.cmert.log <==
Best point: 0.056874 0.039994 0.046105 -0.075984 0.032895 0.020815 -0.412496 0.018823 -0.019820 0.038267 0.046375 0.011876 -0.012047 -0.167628 => 0.281207

==> run3.cmert.log <==
Best point: 0.041904 0.030602 -0.252096 -0.071206 0.012997 0.516962 0.001084 0.010466 0.001683 0.008451 0.001386 0.007512 -0.014841 -0.028811 => 0.280953

==> run4.cmert.log <==
Best point: 0.088423 0.118561 0.073049 0.060186 0.043942 0.293692 -0.147511 0.037605 0.008851 0.019371 0.015986 0.018539 0.001918 -0.072367 => 0.280063

==> run5.cmert.log <==
Best point: 0.059100 0.049655 0.187688 0.010163 0.054140 0.077241 0.000584 0.101203 0.014712 0.144193 0.219264 -0.005517 -0.047385 -0.029156 => 0.280930

This gives you an idea if the system is improving or not. You can see that in this case it isn't, because we don't have enough data in our system and we haven't let tuning run for enough iterations. Kill mert-moses.pl after a few iterations just to get some weights to use.

If mert were to finish successfully, it would create a file named work/tuning/mert/moses.ini containing all the weights we needed. Since we killed mert, copy the best moses.ini config to be the one we'll use. Note that the weights calculated in run1.cmert.log were used to make the config file for run2, so we want run2.moses.ini

If you want to use the weights from a finished mert run, try /afs/ms/u/m/mtm52/BIG/work/tuning/mert/moses.ini

cp run2.moses.ini moses.ini

Insert weights into configuration file

cd ../../../
tools/scripts/reuse-weights.perl work/tuning/mert/moses.ini < work/model/moses.ini > work/tuning/moses-tuned.ini
tools/scripts/reuse-weights.perl work/tuning/mert/moses.ini < work/model/moses-bin.ini > work/tuning/moses-tuned-bin.ini

PART V - Filtering Test Data

Filtering is another way, like binarizing, to help reduce memory requirements. It makes smaller phrase and reordering tables that contain only entries that will be used for a particular test set. Binarized models don't need to be filtered since they don't take up RAM when used. Moses has a script that does this for us, which we'll apply to the evaluation test set we prepared earlier:

$SCRIPTS_ROOTDIR/training/filter-model-given-input.pl  work/evaluation/filtered.nc-test2007 work/tuning/moses-tuned.ini work/evaluation/nc-test2007.lowercased.fr 

There is also a filter-and-binarize-model-given-input.pl script if your filtered table would still be too large to load into memory.


PART VI - Run Tuned Decoder on Development Test Set

We'll try this a few ways.

All three of these outputs should be identical, but they will take different amounts of time and memory to compute.

If you don't have time to run a full decoding session, you can use an output located at /afs/ms/u/m/mtm52/BIG/work/evaluation/nc-test2007.tuned-filtered.output


PART VII - Evaluation

Train Recaser

Now we'll train a recaser. It uses a statistical model to "translate" between lowercased and cased data.

mkdir work/recaser
$SCRIPTS_ROOTDIR/recaser/train-recaser.perl -train-script $SCRIPTS_ROOTDIR/training/train-model.perl -ngram-count tools/srilm/bin/i686/ngram-count -corpus work/corpus/news-commentary.tok.en -dir /home/jschroe1/demo/work/recaser -scripts-root-dir $SCRIPTS_ROOTDIR

This goes through a whole GIZA and LM training run to go from lowercase sentences to cased sentences. Note that the -dir flag needs to be absolute.

Recase the output

$SCRIPTS_ROOTDIR/recaser/recase.perl -model work/recaser/moses.ini -in work/evaluation/nc-test2007.tuned-filtered.output -moses tools/moses/dist/bin/moses > work/evaluation/nc-test2007.tuned-filtered.output.recased

Detokenize the output

tools/scripts/detokenizer.perl -l en < work/evaluation/nc-test2007.tuned-filtered.output.recased > work/evaluation/nc-test2007.tuned-filtered.output.detokenized

Wrap the output in XML

tools/scripts/wrap-xml.perl data/devtest/nc-test2007-ref.en.sgm en my-system-name < work/evaluation/nc-test2007.tuned-filtered.output.detokenized > work/evaluation/nc-test2007.tuned-filtered.output.sgm

Score with NIST-BLEU

tools/mteval-v11b.pl -s data/devtest/nc-test2007-src.fr.sgm -r data/devtest/nc-test2007-ref.en.sgm -t work/evaluation/nc-test2007.tuned-filtered.output.sgm -c

  Evaluation of any-to-en translation using:
    src set "nc-test2007" (1 docs, 2007 segs)
    ref set "nc-test2007" (1 refs)
    tst set "nc-test2007" (1 systems)

NIST score = 6.9126  BLEU score = 0.2436 for system "my-system-name"

We got a BLEU score of 24.4! Hooray! Best translations ever! Let's all go to the pub!

Appendix A - Versions