This guide assumes that you have successfully installed Moses, and would like to see how to use parallel data to build a real phrase-based translation system. The process requires some familiarity with UNIX and, ideally, access to a Linux server. It can be run on a laptop, but could take about a day and requires at least 2G of RAM, and about 10G of free disk space (these requirements are just educated guesses, so if you have a different experience then please mail support).
If you want to save the effort of typing in all the commands on this page (and see how the pros manage their experiments), then skip straight to the experiment management system instructions below. But I'd recommend that you follow through the process manually, at least once, just to see how it all works.
The minimum software requirements are:
KenLM is included in Moses and the default in the Moses tool-chain. IRSTLM and KenLM are LGPL licensed (like Moses) and therefore available for commercial use.
For the purposes of this guide, I will assume that you're going to install all the tools and data in your home directory (i.e.
~/), and that you've already downloaded and compiled Moses into
~/mosesdecoder. And you're going to run Moses from there.
git clone https://github.com/moses-smt/giza-pp.git cd giza-pp make
This should create the binaries
~/giza-pp/mkcls-v2/mkcls. These need to be copied to somewhere that Moses can find them as follows
cd ~/mosesdecoder mkdir tools cp ~/giza-pp/GIZA++-v2/GIZA++ ~/giza-pp/GIZA++-v2/snt2cooc.out \ ~/giza-pp/mkcls-v2/mkcls tools
When you come to run the training, you need to tell the training script where GIZA++ was installed using the
train-model.perl -external-bin-dir $HOME/mosesdecoder/tools
UPDATE - GIZA++ only compiles with gcc. If you're using OSX Mavericks, you'll have to install gcc yourself. I (Hieu) recommend using MGIZA instead
To train a translation system we need parallel data (text translated into two different languages) which is aligned at the sentence level. Luckily there's plenty of this data freely available, and for this system I'm going to use a small (only 130,000 sentences!) data set released for the 2013 Workshop in Machine Translation. To get the data we want, we have to download the tarball and unpack it (into a corpus directory in our home directory) as follows
cd mkdir corpus cd corpus wget http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz tar zxvf training-parallel-nc-v8.tgz
If you look in the
~/corpus/training directory you'll see that there's data from news-commentary (news analysis from project syndicate) in various languages. We're going to build a French-English (fr-en) translation system using the news commentary data set, but feel free to use one of the other language pairs if you prefer.
To prepare the data for training the translation system, we have to perform the following steps:
The tokenisation can be run as follows:
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \ < ~/corpus/training/news-commentary-v8.fr-en.en \ > ~/corpus/news-commentary-v8.fr-en.tok.en ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \ < ~/corpus/training/news-commentary-v8.fr-en.fr \ > ~/corpus/news-commentary-v8.fr-en.tok.fr
The truecaser first requires training, in order to extract some statistics about the text:
~/mosesdecoder/scripts/recaser/train-truecaser.perl \ --model ~/corpus/truecase-model.en --corpus \ ~/corpus/news-commentary-v8.fr-en.tok.en ~/mosesdecoder/scripts/recaser/train-truecaser.perl \ --model ~/corpus/truecase-model.fr --corpus \ ~/corpus/news-commentary-v8.fr-en.tok.fr
Truecasing uses another script from the Moses distribution:
~/mosesdecoder/scripts/recaser/truecase.perl \ --model ~/corpus/truecase-model.en \ < ~/corpus/news-commentary-v8.fr-en.tok.en \ > ~/corpus/news-commentary-v8.fr-en.true.en ~/mosesdecoder/scripts/recaser/truecase.perl \ --model ~/corpus/truecase-model.fr \ < ~/corpus/news-commentary-v8.fr-en.tok.fr \ > ~/corpus/news-commentary-v8.fr-en.true.fr
Finally we clean, limiting sentence length to 80:
~/mosesdecoder/scripts/training/clean-corpus-n.perl \ ~/corpus/news-commentary-v8.fr-en.true fr en \ ~/corpus/news-commentary-v8.fr-en.clean 1 80
Notice that the last command processes both sides at once.
The language model (LM) is used to ensure fluent output, so it is built with the target language (i.e English in this case). The KenLM documentation gives a full explanation of the command-line options, but the following will build an appropriate 3-gram language model.
mkdir ~/lm cd ~/lm ~/mosesdecoder/bin/lmplz -o 3 <~/corpus/news-commentary-v8.fr-en.true.en > news-commentary-v8.fr-en.arpa.en
Then you should binarise (for faster loading) the
*.arpa.en file using KenLM:
~/mosesdecoder/bin/build_binary \ news-commentary-v8.fr-en.arpa.en \ news-commentary-v8.fr-en.blm.en
(Note that you can also use IRSTLM which also has a binary format that Moses supports. See the IRSTLM documentation for more information. For simplicity we only describe one approach here)
You can check the language model by querying it, e.g.
$ echo "is this an English sentence ?" \ | ~/mosesdecoder/bin/query news-commentary-v8.fr-en.blm.en Loading statistics: Name:query VmPeak:46788 kB VmRSS:30828 kB RSSMax:0 kB \ user:0 sys:0 CPU:0 real:0.012207 is=35 2 -2.6704 this=287 3 -0.889896 an=295 3 -2.25226 \ English=7286 1 -5.27842 sentence=4470 2 -2.69906 \ ?=65 1 -3.32728 </s>=21 2 -0.0308115 Total: -17.1481 OOV: 0 After queries: Name:query VmPeak:46796 kB VmRSS:30828 kB RSSMax:0 kB \ user:0 sys:0 CPU:0 real:0.0129395 Total time including destruction: Name:query VmPeak:46796 kB VmRSS:1532 kB RSSMax:0 kB \ user:0 sys:0 CPU:0 real:0.0166016
Finally we come to the main event - training the translation model. To do this, we run word-alignment (using GIZA++), phrase extraction and scoring, create lexicalised reordering tables and create your Moses configuration file, all with a single command. I recommend that you create an appropriate directory as follows, and then run the training command, catching logs:
mkdir ~/working cd ~/working nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir train \ -corpus ~/corpus/news-commentary-v8.fr-en.clean \ -f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe \ -lm 0:3:$HOME/lm/news-commentary-v8.fr-en.blm.en:8 \ -external-bin-dir ~/mosesdecoder/tools >& training.out &
If you have a multi-core machine it's worth using the
-cores argument to encourage
as much parallelisation as possible.
This took about 1.5 hours using 2 cores on a powerful laptop (Intel i7-2640M, 8GB RAM, SSD). Once it's finished there should be a
moses.ini file in the directory
~/working/train/model. You can use the model specified by this ini file to decode (i.e. translate), but there's a couple of problems with it. The first is that it's very slow to load, but we can fix that by binarising the phrase table and reordering table, i.e. compiling them into a format that can be load quickly. The second problem is that the weights used by Moses to weight the different models against each other are not optimised - if you look at the
moses.ini file you'll see that they're set to default values like 0.2, 0.3 etc. To find better weights we need to tune the translation system, which leads us on to the next step...
This is the slowest part of the process, so you might want to line up something to read whilst it's progressing. Tuning requires a small amount of parallel data, separate from the training data, so again we'll download some data kindly provided by WMT. Run the following commands (from your home directory again) to download the data and put it in a sensible place.
cd ~/corpus wget http://www.statmt.org/wmt12/dev.tgz tar zxvf dev.tgz
We're going to use news-test2008 for tuning, so we have to tokenise and truecase it first (don't forget to use the correct language if you're not building a fr->en system)
cd ~/corpus ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \ < dev/news-test2008.en > news-test2008.tok.en ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \ < dev/news-test2008.fr > news-test2008.tok.fr ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en \ < news-test2008.tok.en > news-test2008.true.en ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.fr \ < news-test2008.tok.fr > news-test2008.true.fr
Now go back to the directory we used for training, and launch the tuning process:
cd ~/working nohup nice ~/mosesdecoder/scripts/training/mert-moses.pl \ ~/corpus/news-test2008.true.fr ~/corpus/news-test2008.true.en \ ~/mosesdecoder/bin/moses train/model/moses.ini --mertdir ~/mosesdecoder/bin/ \ &> mert.out &
If you have several cores at your disposal, then it'll be a lot faster to run Moses multi-threaded. Add
--decoder-flags="-threads 4" to the last line above in order to run the decoder with 4 threads. With this setting, tuning took about 4 hours for me.
The end result of tuning is an ini file with trained weights, which should be in
~/working/mert- work/moses.ini if you've used the same directory structure as me.
You can now run Moses with
~/mosesdecoder/bin/moses -f ~/working/mert-work/moses.ini
and type in your favourite French sentence to see the results. You'll notice, though, that the decoder takes at least a couple of minutes to start-up. In order to make it start quickly, we can binarise the phrase-table and lexicalised reordering models. To do this, create a suitable directory and binarise the models as follows:
mkdir ~/working/binarised-model cd ~/working ~/mosesdecoder/bin/processPhraseTableMin \ -in train/model/phrase-table.gz -nscores 4 \ -out binarised-model/phrase-table ~/mosesdecoder/bin/processLexicalTableMin \ -in train/model/reordering-table.wbe-msd-bidirectional-fe.gz \ -out binarised-model/reordering-table
Note: If you get the error
...~/mosesdecoder/bin/processPhraseTableMin: No such file or directory, please make sure to compile Moses with CMPH.
Then make a copy of the
~/working/mert-work/moses.ini in the binarised-model directory and change the phrase and reordering tables to point to the binarised versions, as follows:
PhraseDictionaryfeature to point to
LexicalReorderingfeature to point to
Loading and running a translation is pretty fast (for this I supplied the French sentence "faire revenir les militants sur le terrain et convaincre que le vote est utile .") :
Defined parameters (per moses.ini or switch): config: binarised-model/moses.ini distortion-limit: 6 feature: UnknownWordPenalty WordPenalty PhraseDictionaryCompact \ name=TranslationModel0 table-limit=20 num-features=5 \ path=/home/bhaddow/working/binarised-model/phrase-table \ input-factor=0 output-factor=0 LexicalReordering name=LexicalReordering0 \ num-features=6 type=wbe-msd-bidirectional-fe-allff \ input-factor=0 output-factor=0 \ path=/home/bhaddow/working/binarised-model/reordering-table Distortion KENLM lazyken=0 name=LM0 \ factor=0 path=/home/bhaddow/lm/news-commentary-v8.fr-en.blm.en order=3 input-factors: 0 mapping: 0 T 0 weight: LexicalReordering0= 0.119327 0.0221822 0.0359108 \ 0.107369 0.0448086 0.100852 Distortion0= 0.0682159 \ LM0= 0.0794234 WordPenalty0= -0.0314219 TranslationModel0= 0.0477904 \ 0.0621766 0.0931993 0.0394201 0.147903 /home/bhaddow/mosesdecoder/bin line=UnknownWordPenalty FeatureFunction: UnknownWordPenalty0 start: 0 end: 0 line=WordPenalty FeatureFunction: WordPenalty0 start: 1 end: 1 line=PhraseDictionaryCompact name=TranslationModel0 table-limit=20 \ num-features=5 path=/home/bhaddow/working/binarised-model/phrase-table \ input-factor=0 output-factor=0 FeatureFunction: TranslationModel0 start: 2 end: 6 line=LexicalReordering name=LexicalReordering0 num-features=6 \ type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 \ path=/home/bhaddow/working/binarised-model/reordering-table FeatureFunction: LexicalReordering0 start: 7 end: 12 Initializing LexicalReordering.. line=Distortion FeatureFunction: Distortion0 start: 13 end: 13 line=KENLM lazyken=0 name=LM0 factor=0 \ path=/home/bhaddow/lm/news-commentary-v8.fr-en.blm.en order=3 FeatureFunction: LM0 start: 14 end: 14 binary file loaded, default OFF_T: -1 IO from STDOUT/STDIN Created input-output object : [0.000] seconds Translating line 0 in thread id 140592965015296 Translating: faire revenir les militants sur le terrain et \ convaincre que le vote est utile . reading bin ttable size of OFF_T 8 binary phrasefile loaded, default OFF_T: -1 binary file loaded, default OFF_T: -1 Line 0: Collecting options took 0.000 seconds Line 0: Search took 1.000 seconds bring activists on the ground and convince that the vote is useful . BEST TRANSLATION: bring activists on the ground and convince that \ the vote is useful .  [total=-8.127] \ core=(0.000,-13.000,-10.222,-21.472,-4.648,-14.567,6.999,-2.895,0.000, \ 0.000,-3.230,0.000,0.000,0.000,-76.142) Line 0: Translation took 1.000 seconds total Name:moses VmPeak:214408 kB VmRSS:74748 kB \ RSSMax:0 kB user:0.000 sys:0.000 CPU:0.000 real:1.031
The translation ("bring activists on the ground and convince that the vote is useful .")b is quite rough, but understandable - bear in mind this is a very small data set for general domain translation. Also note that your results may differ slightly due to non-determinism in the tuning process.
At this stage, your probably wondering how good the translation system is. To measure this, we use another parallel data set (the test set) distinct from the ones we've used so far. Let's pick newstest2011, and so first we have to tokenise and truecase it as before
cd ~/corpus ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \ < dev/newstest2011.en > newstest2011.tok.en ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \ < dev/newstest2011.fr > newstest2011.tok.fr ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en \ < newstest2011.tok.en > newstest2011.true.en ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.fr \ < newstest2011.tok.fr > newstest2011.true.fr
The model that we've trained can then be filtered for this test set, meaning that we only retain the entries needed translate the test set. This will make the translation a lot faster.
cd ~/working ~/mosesdecoder/scripts/training/filter-model-given-input.pl \ filtered-newstest2011 mert-work/moses.ini ~/corpus/newstest2011.true.fr \ -Binarizer ~/mosesdecoder/bin/processPhraseTableMin
You can test the decoder by first translating the test set (takes a wee while) then running the BLEU script on it:
nohup nice ~/mosesdecoder/bin/moses \ -f ~/working/filtered-newstest2011/moses.ini \ < ~/corpus/newstest2011.true.fr \ > ~/working/newstest2011.translated.en \ 2> ~/working/newstest2011.out ~/mosesdecoder/scripts/generic/multi-bleu.perl \ -lc ~/corpus/newstest2011.true.en \ < ~/working/newstest2011.translated.en
This gives me a BLEU score of 23.5 (in comparison, the best result at WMT11 was 30.5, although it should be cautioned that this uses NIST BLEU, which does its own tokenisation, so there will be 1-2 points difference in the score anyway)
If you've been through the effort of typing in all the commands, then by now you're probably wondering if there's an easier way. If you've skipped straight down here without bothering about the manual route then, well, you may have missed on a useful Moses "rite of passage".
The easier way is, of course, to use the EMS. To use EMS, you'll have to install a few dependencies, as detailed on the EMS page, and then you'll
need this config file. Make a directory
~/working/experiments and place the config file in there. If you open it up, you'll see the
variable defined at the top - then make the obvious change. If you set the home directory, download the train, tune and test data and place it in the locations described above, then this config file should work.
To run EMS from the
experiments directory, you can use the command:
nohup nice ~/mosesdecoder/scripts/ems/experiment.perl -config config -exec &> log &
then sit back and wait for the BLEU score to appear in