Moses
statistical
machine translation
system

Baseline System

Overview

This guide assumes that you have successfully installed Moses, and would like to see how to use parallel data to build a real phrase-based translation system. The process requires some familiarity with UNIX and, ideally, access to a Linux server. It can be run on a laptop, but could take about a day and requires at least 2G of RAM, and about 10G of free disk space (these requirements are just educated guesses, so if you have a different experience then please mail support).

If you want to save the effort of typing in all the commands on this page (and see how the pros manage their experiments), then skip straight to the experiment management system instructions below. But I'd recommend that you follow through the process manually, at least once, just to see how it all works.

Installation

The minimum software requirements are:

  • Moses (obviously!)
  • GIZA++, for word-aligning your parallel corpus
  • IRSTLM, SRILM, or KenLM for language model estimation.

IRSTLM and KenLM are LGPL licensed (like Moses) and therefore available for commercial use. The Moses tool-chain defaults to SRILM, but it requires an expensive licence for non-academic use.

For the purposes of this guide, I will assume that you're going to install all the tools and data in your home directory (i.e. ~/), and that you've already downloaded and compiled Moses into ~/mosesdecoder. And you're going to run Moses from there.

Installing GIZA++

GIZA++ is hosted at Google Code, and a mirror of the original documentation can be found here. I recommend that you download the latest version via svn:

 svn checkout http://giza-pp.googlecode.com/svn/trunk/ giza-pp
 cd giza-pp
 make

This should create the binaries ~/giza-pp/GIZA++-v2/GIZA++, ~/giza-pp/GIZA++-v2/snt2cooc.out and ~/giza-pp/mkcls-v2/mkcls. These need to be copied to somewhere that Moses can find them as follows

 cd ~/mosesdecoder
 mkdir tools
 cp ~/giza-pp/GIZA++-v2/GIZA++ ~/giza-pp/GIZA++-v2/snt2cooc.out \
   ~/giza-pp/mkcls-v2/mkcls tools

When you come to run the training, you need to tell the training script where GIZA++ was installed using the -external-bin-dir argument.

 train-model.perl -external-bin-dir $HOME/mosesdecoder/tools

UPDATE - GIZA++ only compiles with gcc. If you're using OSX Mavericks, you'll have to install gcc yourself. I (Hieu) recommend using MGIZA instead

Installing IRSTLM

IRSTLM is a language modelling toolkit from FBK, and is hosted on sourceforge. Again, you should download the latest version. I used version 5.80.03 for this guide so assuming you downloaded the tarball into your home directory (and making the obvious changes if you download a later version) the following commands should build and install IRSTLM:

 tar zxvf irstlm-5.80.03.tgz
 cd irstlm-5.80.03
 ./regenerate-makefiles.sh
 ./configure --prefix=$HOME/irstlm
 make install

You should now have several binaries and scripts in ~/irstlm/bin, in particular build-lm.sh

Corpus Preparation

To train a translation system we need parallel data (text translated into two different languages) which is aligned at the sentence level. Luckily there's plenty of this data freely available, and for this system I'm going to use a small (only 130,000 sentences!) data set released for the 2013 Workshop in Machine Translation. To get the data we want, we have to download the tarball and unpack it (into a corpus directory in our home directory) as follows

 cd
 mkdir corpus
 cd corpus 
 wget http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz
 tar zxvf training-parallel-nc-v8.tgz

If you look in the ~/corpus/training directory you'll see that there's data from news-commentary (news analysis from project syndicate) in various languages. We're going to build a French-English (fr-en) translation system using the news commentary data set, but feel free to use one of the other language pairs if you prefer.

To prepare the data for training the translation system, we have to perform the following steps:

  • tokenisation: This means that spaces have to be inserted between (e.g.) words and punctuation.
  • truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.
  • cleaning: Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously mis-aligned sentences are removed.

The tokenisation can be run as follows:

 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
    < ~/corpus/training/news-commentary-v8.fr-en.en    \
    > ~/corpus/news-commentary-v8.fr-en.tok.en
 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \ 
    < ~/corpus/training/news-commentary-v8.fr-en.fr    \
    > ~/corpus/news-commentary-v8.fr-en.tok.fr

The truecaser first requires training, in order to extract some statistics about the text:

 ~/mosesdecoder/scripts/recaser/train-truecaser.perl \
     --model ~/corpus/truecase-model.en --corpus     \
     ~/corpus/news-commentary-v8.fr-en.tok.en
 ~/mosesdecoder/scripts/recaser/train-truecaser.perl \
     --model ~/corpus/truecase-model.fr --corpus     \
     ~/corpus/news-commentary-v8.fr-en.tok.fr

Truecasing uses another script from the Moses distribution:

 ~/mosesdecoder/scripts/recaser/truecase.perl \
   --model ~/corpus/truecase-model.en         \
   < ~/corpus/news-commentary-v8.fr-en.tok.en \
   > ~/corpus/news-commentary-v8.fr-en.true.en
 ~/mosesdecoder/scripts/recaser/truecase.perl \
   --model ~/corpus/truecase-model.fr         \ 
   < ~/corpus/news-commentary-v8.fr-en.tok.fr \
   > ~/corpus/news-commentary-v8.fr-en.true.fr

Finally we clean, limiting sentence length to 80:

 ~/mosesdecoder/scripts/training/clean-corpus-n.perl \
    ~/corpus/news-commentary-v8.fr-en.true fr en \
    ~/corpus/news-commentary-v8.fr-en.clean 1 80

Notice that the last command processes both sides at once.

Language Model Training

The language model (LM) is used to ensure fluent output, so it is built with the target language (i.e English in this case). The IRSTLM documentation gives a full explanation of the command-line options, but the following will build an appropriate 3-gram language model, removing singletons, smoothing with improved Kneser-Ney, and adding sentence boundary symbols:

 mkdir ~/lm
 cd ~/lm
 ~/irstlm/bin/add-start-end.sh                 \
   < ~/corpus/news-commentary-v8.fr-en.true.en \
   > news-commentary-v8.fr-en.sb.en
 export IRSTLM=$HOME/irstlm; ~/irstlm/bin/build-lm.sh \
   -i news-commentary-v8.fr-en.sb.en                  \
   -t ./tmp  -p -s improved-kneser-ney -o news-commentary-v8.fr-en.lm.en
 ~/irstlm/bin/compile-lm  \
   --text=yes \
   news-commentary-v8.fr-en.lm.en.gz \
   news-commentary-v8.fr-en.arpa.en

Then you should binarise (for faster loading) the *.arpa.en file using KenLM:

 ~/mosesdecoder/bin/build_binary \
   news-commentary-v8.fr-en.arpa.en \
   news-commentary-v8.fr-en.blm.en

(Note that IRSTLM also has a binary format, which Moses supports. See the IRSTLM documentation for more information. For simplicity we only describe one approach here)

You can check the language model by querying it, e.g.

 $ echo "is this an English sentence ?"                       \
   | ~/mosesdecoder/bin/query news-commentary-v8.fr-en.blm.en
 Loading statistics:
 Name:query      VmPeak:46788 kB VmRSS:30828 kB  RSSMax:0 kB  \
      user:0  sys:0   CPU:0   real:0.012207
 is=35 2 -2.6704 this=287 3 -0.889896    an=295 3 -2.25226    \
     English=7286 1 -5.27842 sentence=4470 2 -2.69906         \
     ?=65 1 -3.32728 </s>=21 2 -0.0308115    Total: -17.1481 OOV: 0

 After queries:
 Name:query      VmPeak:46796 kB VmRSS:30828 kB  RSSMax:0 kB  \   
      user:0  sys:0   CPU:0   real:0.0129395
 Total time including destruction:
 Name:query      VmPeak:46796 kB VmRSS:1532 kB   RSSMax:0 kB  \   
      user:0  sys:0   CPU:0   real:0.0166016

Training the Translation System

Finally we come to the main event - training the translation model. To do this, we run word-alignment (using GIZA++), phrase extraction and scoring, create lexicalised reordering tables and create your Moses configuration file, all with a single command. I recommend that you create an appropriate directory as follows, and then run the training command, catching logs:

 mkdir ~/working
 cd ~/working
 nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir train \
 -corpus ~/corpus/news-commentary-v8.fr-en.clean                             \
 -f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe \ 
 -lm 0:3:$HOME/lm/news-commentary-v8.fr-en.blm.en:8                          \
 -external-bin-dir ~/mosesdecoder/tools >& training.out &

If you have a multi-core machine it's worth using the -cores argument to encourage as much parallelisation as possible.

This took about 1.5 hours using 2 cores on a powerful laptop (Intel i7-2640M, 8GB RAM, SSD). Once it's finished there should be a moses.ini file in the directory ~/working/train/model. You can use the model specified by this ini file to decode (i.e. translate), but there's a couple of problems with it. The first is that it's very slow to load, but we can fix that by binarising the phrase table and reordering table, i.e. compiling them into a format that can be load quickly. The second problem is that the weights used by Moses to weight the different models against each other are not optimised - if you look at the moses.ini file you'll see that they're set to default values like 0.2, 0.3 etc. To find better weights we need to tune the translation system, which leads us on to the next step...

Tuning

This is the slowest part of the process, so you might want to line up something to read whilst it's progressing. Tuning requires a small amount of parallel data, separate from the training data, so again we'll download some data kindly provided by WMT. Run the following commands (from your home directory again) to download the data and put it in a sensible place.

 cd ~/corpus
 wget http://www.statmt.org/wmt12/dev.tgz
 tar zxvf dev.tgz

We're going to use news-test2008 for tuning, so we have to tokenise and truecase it first (don't forget to use the correct language if you're not building a fr->en system)

 cd ~/corpus
 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
   < dev/news-test2008.en > news-test2008.tok.en
 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
   < dev/news-test2008.fr > news-test2008.tok.fr
 ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en \
   < news-test2008.tok.en > news-test2008.true.en
 ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.fr \ 
   < news-test2008.tok.fr > news-test2008.true.fr

Now go back to the directory we used for training, and launch the tuning process:

 cd ~/working
 nohup nice ~/mosesdecoder/scripts/training/mert-moses.pl \
  ~/corpus/news-test2008.true.fr ~/corpus/news-test2008.true.en \
  ~/mosesdecoder/bin/moses train/model/moses.ini --mertdir ~/mosesdecoder/bin/ \
  &> mert.out &

If you have several cores at your disposal, then it'll be a lot faster to run Moses multi-threaded. Add --decoder-flags="-threads 4" to the last line above in order to run the decoder with 4 threads. With this setting, tuning took about 4 hours for me.

The end result of tuning is an ini file with trained weights, which should be in ~/working/mert- work/moses.ini if you've used the same directory structure as me.

Testing

You can now run Moses with

 ~/mosesdecoder/bin/moses -f ~/working/mert-work/moses.ini

and type in your favourite French sentence to see the results. You'll notice, though, that the decoder takes at least a couple of minutes to start-up. In order to make it start quickly, we can binarise the phrase-table and lexicalised reordering models. To do this, create a suitable directory and binarise the models as follows:

 mkdir ~/working/binarised-model
 cd ~/working
 ~/mosesdecoder/bin/processPhraseTable \
   -ttable 0 0 train/model/phrase-table.gz \
   -nscores 5 -out binarised-model/phrase-table
 ~/mosesdecoder/bin/processLexicalTable \
   -in train/model/reordering-table.wbe-msd-bidirectional-fe.gz \
   -out binarised-model/reordering-table

Then make a copy of the ~/working/mert-work/moses.ini in the binarised-model directory and change the phrase and reordering tables to point to the binarised versions, as follows:

  1. Change PhraseDictionaryMemory to PhraseDictionaryBinary
  2. Set the path of the PhraseDictionary feature to point to $HOME/working/binarised- model/phrase-table
  3. Set the path of the LexicalReordering feature to point to $HOME/working/binarised- model/reordering-table

Loading and running a translation is pretty fast (for this I supplied the French sentence "faire revenir les militants sur le terrain et convaincre que le vote est utile .") :

 Defined parameters (per moses.ini or switch):
 config: binarised-model/moses.ini 
 distortion-limit: 6 
 feature: UnknownWordPenalty WordPenalty PhraseDictionaryBinary          \
 name=TranslationModel0 table-limit=20 num-features=5                    \
 path=/home/bhaddow/working/binarised-model/phrase-table                 \
 input-factor=0 output-factor=0 
 LexicalReordering name=LexicalReordering0                               \
 num-features=6 type=wbe-msd-bidirectional-fe-allff                      \
 input-factor=0 output-factor=0                                          \
 path=/home/bhaddow/working/binarised-model/reordering-table 
 Distortion KENLM lazyken=0 name=LM0                                     \
 factor=0 path=/home/bhaddow/lm/news-commentary-v8.fr-en.blm.en order=3 
 input-factors: 0 
 mapping: 0 T 0 
 weight: LexicalReordering0= 0.119327 0.0221822 0.0359108                \
 0.107369 0.0448086 0.100852 Distortion0= 0.0682159                      \
 LM0= 0.0794234 WordPenalty0= -0.0314219 TranslationModel0= 0.0477904    \ 
 0.0621766 0.0931993 0.0394201 0.147903 
 /home/bhaddow/mosesdecoder/bin
 line=UnknownWordPenalty
 FeatureFunction: UnknownWordPenalty0 start: 0 end: 0
 line=WordPenalty
 FeatureFunction: WordPenalty0 start: 1 end: 1
 line=PhraseDictionaryBinary name=TranslationModel0 table-limit=20       \
 num-features=5 path=/home/bhaddow/working/binarised-model/phrase-table  \
 input-factor=0 output-factor=0
 FeatureFunction: TranslationModel0 start: 2 end: 6  
 line=LexicalReordering name=LexicalReordering0 num-features=6           \
 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0      \
 path=/home/bhaddow/working/binarised-model/reordering-table
 FeatureFunction: LexicalReordering0 start: 7 end: 12
 Initializing LexicalReordering..
 line=Distortion
 FeatureFunction: Distortion0 start: 13 end: 13
 line=KENLM lazyken=0 name=LM0 factor=0 \
 path=/home/bhaddow/lm/news-commentary-v8.fr-en.blm.en order=3
 FeatureFunction: LM0 start: 14 end: 14
 binary file loaded, default OFF_T: -1
 IO from STDOUT/STDIN
 Created input-output object : [0.000] seconds
 Translating line 0  in thread id 140592965015296
 Translating: faire revenir les militants sur le terrain et              \
 convaincre que le vote est utile . 
 reading bin ttable
 size of OFF_T 8
 binary phrasefile loaded, default OFF_T: -1
 binary file loaded, default OFF_T: -1
 Line 0: Collecting options took 0.000 seconds
 Line 0: Search took 1.000 seconds
 bring activists on the ground and convince that the vote is useful . 
 BEST TRANSLATION: bring activists on the ground and convince that       \
 the vote is useful . [111111111111111]  [total=-8.127]                  \
 core=(0.000,-13.000,-10.222,-21.472,-4.648,-14.567,6.999,-2.895,0.000,  \
 0.000,-3.230,0.000,0.000,0.000,-76.142)  
 Line 0: Translation took 1.000 seconds total
 Name:moses VmPeak:214408 kB VmRSS:74748 kB                              \
 RSSMax:0 kB user:0.000 sys:0.000 CPU:0.000 real:1.031

The translation ("bring activists on the ground and convince that the vote is useful .")b is quite rough, but understandable - bear in mind this is a very small data set for general domain translation. Also note that your results may differ slightly due to non-determinism in the tuning process.

At this stage, your probably wondering how good the translation system is. To measure this, we use another parallel data set (the test set) distinct from the ones we've used so far. Let's pick newstest2011, and so first we have to tokenise and truecase it as before

 cd ~/corpus
 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
   < dev/newstest2011.en > newstest2011.tok.en
 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
   < dev/newstest2011.fr > newstest2011.tok.fr
 ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en \
   < newstest2011.tok.en > newstest2011.true.en
 ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.fr \
   < newstest2011.tok.fr > newstest2011.true.fr

The model that we've trained can then be filtered for this test set, meaning that we only retain the entries needed translate the test set. This will make the translation a lot faster.

 cd ~/working
 ~/mosesdecoder/scripts/training/filter-model-given-input.pl             \
   filtered-newstest2011 mert-work/moses.ini ~/corpus/newstest2011.true.fr \
   -Binarizer ~/mosesdecoder/bin/processPhraseTable

You can test the decoder by first translating the test set (takes a wee while) then running the BLEU script on it:

 nohup nice ~/mosesdecoder/bin/moses            \
   -f ~/working/filtered-newstest2011/moses.ini   \
   < ~/corpus/newstest2011.true.fr                \
   > ~/working/newstest2011.translated.en         \
   2> ~/working/newstest2011.out 
 ~/mosesdecoder/scripts/generic/multi-bleu.perl \
   -lc ~/corpus/newstest2011.true.en              \
   < ~/working/newstest2011.translated.en

This gives me a BLEU score of 23.5 (in comparison, the best result at WMT11 was 30.5, although it should be cautioned that this uses NIST BLEU, which does its own tokenisation, so there will be 1-2 points difference in the score anyway)

Experiment Management System (EMS)

If you've been through the effort of typing in all the commands, then by now you're probably wondering if there's an easier way. If you've skipped straight down here without bothering about the manual route then, well, you may have missed on a useful Moses "rite of passage".

The easier way is, of course, to use the EMS. To use EMS, you'll have to install a few dependencies, as detailed on the EMS page, and then you'll need this config file. Make a directory ~/working/experiments and place the config file in there. If you open it up, you'll see the home-dir variable defined at the top - then make the obvious change. If you set the home directory, download the train, tune and test data and place it in the locations described above, then this config file should work.

To run EMS from the experiments directory, you can use the command:

 nohup nice ~/mosesdecoder/scripts/ems/experiment.perl -config config -exec &> log &

then sit back and wait for the BLEU score to appear in evaluation/report.1

Edit - History - Print
Page last modified on November 19, 2014, at 09:49 AM