Moses supports multi-threaded operation, enabling faster decoding on multi-core machines. The current limitations of multi-threaded Moses are:
Multi-threaded Moses is now built by default. If you omit the
-threads argument, then Moses will use a single worker thread, and a thread to read the input stream. Using the argument
-threads n specifies a pool of
n threads, and
-threads all will use all the cores on the machine.
The single-most important thing you need to run Moses fast is MEMORY. Lots of MEMORY. (For example, the Edinburgh group have servers with 144GB of RAM). The rest of this section is just details of how to make the training and decoding run fast.
Calculate total file size of the binary phrase tables, binary language models and binary reordering models.
% ll -h phrase-table.0-0.1.1.binphr.* -rw-r--r-- 1 s0565741 users 157K 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.idx -rw-r--r-- 1 s0565741 users 5.4M 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.srctree -rw-r--r-- 1 s0565741 users 282K 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.srcvoc -rw-r--r-- 1 s0565741 users 1.1G 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.tgtdata -rw-r--r-- 1 s0565741 users 1.7M 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.tgtvoc % ll -h reordering-table.1.wbe-msd-bidirectional-fe.binlexr.* -rw-r--r-- 1 s0565741 users 157K 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.idx -rw-r--r-- 1 s0565741 users 1.1G 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.srctree -rw-r--r-- 1 s0565741 users 1.1G 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.tgtdata -rw-r--r-- 1 s0565741 users 282K 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.voc0 -rw-r--r-- 1 s0565741 users 1.7M 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.voc1 % ll -h interpolated-binlm.1 -rw-r--r-- 1 s0565741 users 28G 2012-06-15 11:07 interpolated-binlm.1
The total size of these files is approx. 31GB. Therefore, a translation system using these models requires 31GB (+ roughly 500MB) of memory to run fast.
cat phrase-table.0-0.1.1.binphr.* > /dev/null cat reordering-table.1.wbe-msd-bidirectional-fe.binlexr.* > /dev/null cat interpolated-binlm.1 > /dev/null
This forces the operating system to cache the binary models in memory, minimizing pages faults while the decoder is running. Other memory-intensive processes on the computer should not be running, otherwise the file-system cache may be reduced.
Moses does a lot of random lookups. If you're running Linux, check that transparent huge pages are enabled. If
[always] madvise never
then transparent huge pages are enabled.
On some RedHat/Centos systems, the file is
madvise will not appear. If neither file exists, upgrade the kernel to at least 2.6.38 and compile with
CONFIG_SPARSEMEM_VMEMMAP. If the file exists, but the square brackets are not around "always", then run
echo always > /sys/kernel/mm/transparent_hugepage/enabled
as root (NB: to use
sudo, quote the
> character). This setting will not be preserved across reboots, so consider adding it to an init script.
See the manual on binarized and compact phrase table for a description how to compact your phrase tables. All the things said above for the standard binary phrase table are also true for the compact versions. The principle is the same, the total size of the binary files determines your memory usage, but since the combined size of the compact phrase table and the compact reordering model maybe up to 10 to 12 times smaller than with the original binary implementations, you will save exactly this much memory. You can also use the
--minlexr-memory options to load the tables into memory at Moses start-up instead of doing the above mentioned caching trick. This may take some time during warm-up, but may save a lot of time in the long term. If you are concerned for performance, see Junczys-Dowmunt (2012) for a comparison. There is virtually no overhead due to on-the-fly decompression on large-memory-systems and considerable speed-up on systems with limited memory.
The decoder can run on very little memory, about 200-300MB for phrase-based and 400-500MB for hierarchical decoding (according to Hieu). The decoder can run on an iPhone! And laptops.
However, it will be VERY slow, unless you have very small models or the models are on fast disks such as flash disks.
When word aligning, using mgiza with multiple threads significantly speed up word alignment.
To use MGIZA with multiple threads in the Moses training script, add these arguments:
.../train-model.perl -mgiza -mgiza-cpus 8 ....
To enable it in the EMS, add this to the [TRAINING] section
[TRAINING] training-options = "-mgiza -mgiza-cpus 8"
When running GIZA++ or MGIZA, the first stage involves running a program called
This requires approximately 6GB+ for typical Europarl-size corpora (1.8 million sentences). For users without this amount of memory on their computers, an alternative version is included in MGIZA:
To use this script, you must copy 2 files to the same place where
snt2cooc is run:
Add this argument when running the Moses training script:
.../train-model.perl -snt2cooc snt2cooc.pl
Once word alignment is completed, the phrase table is created from the aligned parallel corpus. There are 2 main ways to speed up this part of the training process.
Firstly, the training corpus and alignment can be split and phrase pairs from each part can be extracted simultaneously. This can be done by simply using the argument
.../train-model.perl -cores 4
Secondly, the Unix
sort command is often executed during training. It is essential to optimize this command to make use of the available disk and CPU. For example, recent versions of sort can take the following arguments
sort -S 10G --batch-size 253 --compress-program gzip --parallel 5
The Moses training script names these arguments
.../train-model.perl -sort-buffer-size 10G -sort-batch-size 253 \ -sort-compress gzip -sort-parallel 5
You should set these arguments. However, DO NOT just blindly copy the above settings, they must be tuned to the particular computer you are running on. The most important issues are:
--batch-sizearguments have only recently been added to the sort command.
-sort-buffer-size. In particular, you should take into account other programs running on the computer. Also, two or three simultaneous sort program will run (one to sort the
extractfile, one to sort
extract.inv, one to sort
extract.o). If there is not enough memory because you've set
sort-buffer-sizetoo high, your entire computer will likely crash.
--batch-sizeargument is OS-dependent. For example, it is 1024 on Linux, 253 on old Mac OSX, 2557 on new OSX.
--compress-programcan occasionally result in the following timeout errors.
gsort: couldn't create process for gzip -d: Operation timed out
In summary, to maximize speed on a large server with many cores and up-to-date software, add this to your training script:
.../train-model.perl -mgiza -mgiza-cpus 8 -cores 10 \ -parallel -sort-buffer-size 10G -sort-batch-size 253 \ -sort-compress gzip -sort-parallel 10
To run on a laptop with limited memory
.../train-model.perl -mgiza -mgiza-cpus 2 -snt2cooc snt2cooc.pl \ -parallel -sort-batch-size 253 -sort-compress gzip
In the EMS, for large servers, this can be done by adding:
[TRAINING] script = $moses-script-dir/training/train-model.perl training-options = "-mgiza -mgiza-cpus 8 -cores 10 \ -parallel -sort-buffer-size 10G -sort-batch-size 253 \ -sort-compress gzip -sort-parallel 10" parallel = yes
For servers with older OSes, and therefore older sort commands:
[TRAINING] script = $moses-script-dir/training/train-model.perl training-options = "-mgiza -mgiza-cpus 8 -cores 10 -parallel" parallel = yes
For laptops with limited memory:
[TRAINING] script = $moses-script-dir/training/train-model.perl training-options = "-mgiza -mgiza-cpus 2 -snt2cooc snt2cooc.pl \ -parallel -sort-batch-size 253 -sort-compress gzip" parallel = yes
Convert your language model to binary format. This reduces loading time and provides more control.
See the KenLM web site for the time-memory tradeoff presented by the KenLM data structures. Use
bin/build_binary (found in the same directory as
moses_chart) to convert ARPA files to the binary format. You can preview memory consumption with:
This preview includes only the language model's memory usage, which is in addition to the phrase table etc. For speed, use the default probing data structure.
bin/build_binary file.arpa file.binlm
To save memory, change to the trie data structure
bin/build_binary trie file.arpa file.binlm
To further losslessly compress the trie ("chop" in the benchmarks), use
-a 64 which will compress pointers to a depth of up to 64 bits.
bin/build_binary -a 64 trie file.arpa file.binlm
Note that you can also make this parameter smaller which will go faster but use more memory. Quantization will make the trie smaller at the expense of accuracy. You can choose any number of bits from 2 to 25, for example 10:
bin/build_binary -a 64 -q 10 trie file.arpa file.binlm
Note that quantization can be used independently of -a.
By default, language models fully load into memory at the beginning. If you are short on memory, you can use on-demand language model loading. The language model must be converted to binary format in advance and should be placed on LOCAL DISK, preferably SSD. For KenLM, you should use the trie data structure, not the probing data structure.
If the LM for binarized using IRSTLM, append .mm to the file name and change the ini file to reflect this. Eg. change
[feature] IRSTLM .... path=file.lm
[feature] IRSTLM .... path=file.lm.mm
If the LM was binarized using KenLM, add the argument lazyken=true. Eg. from
[feature] KENLM ....
[feature] KENLM .... lazyken=true
Suffix arrays store the entire parallel corpora and word alignment information in memory, instead of the phrase table. The parallel corpora and alignment file is often much smaller than the phrase table. For example, for the Europarl German-English (gzipped files):
de = 94MB en = 84MB alignment = 57MB phrase-based = 2.0GB hierarchical = 16.0GB
Therefore, it is more memory efficient to store the corpus in memory, rather than the entire phrase-table. This is usually structured as a suffix array to enable fast extraction of translations.
Translations are extracted as needed, usually per input test set, or per input sentence.
Moses support two different implementations of suffix arrays, one for phrase-based models, [[one for hierarchical models -> AdvancedFeatures#ntoc43 ]].
Cube pruning limits the number of hypotheses created for each stack (or chart cell in chart decoding). It is essential for chart decoding (otherwise decoding will take a VERY long time) and an option in phrase-based decoding.
In the phrase-based decoder, add:
[search-algorithm] 1 [cube-pruning-pop-limit] 500
There is a speed-quality tradeoff, lower pop limit means less work for the decoder, so faster decoding but less accurate translation.
TODO: MGIZA with reduced memory
The biggest consumer of memory during decoding are typically the models. Here are some links on how to reduce the size of each.
* use KenLM with trie data structure Moses.Optimize#ntoc14 * use on-demand loading Moses.Optimize#ntoc15
* use phrase table pruning Advanced.RuleTables#ntoc5 * use a compact phrase table http://www.statmt.org/moses/?n=Advanced.RuleTables#ntoc3 * filter the translation model given the text you want to translate Moses.SupportTools#ntoc3
* similar techniques than for translation models are possible: pruning Advanced.RuleTables#ntoc3, compact tables Advanced.RuleTables#ntoc4, and filtering Moses.SupportTools#ntoc3.
These options can be added to the bjam command line, trading generality for performance.
You should do a full rebuild with
-a when changing the values of most of these options.
Don't use factors? Add
Tailor KenLM's maximum order to only what you need. If your highest-order language model has order 5, add
Turn debug symbols off for speed and a little more memory.
But don't expect support from the mailing list until you rerun with debug symbols on!
Don't care about debug messages?
tcmalloc and see
BUILD-INSTRUCTIONS.txt in Moses for installation instructions.
bjam will automatically detect tcmalloc's presence and link against it for multi-threaded builds.
Install Boost and
zlib static libraries. Then link statically:
This may mean you have to install Boost and
Running single-threaded? Add
Using hierarchical or string-to-tree models, but none with source syntax?
Moses has multiple phrase table implementations. The one that suits you best depends on the model you're using (phrase-based or hierarchical/syntax), and how much memory your server has.
Here is a complete list of the types:
Memory - this read in the phrase table into memory. For phrase-based model and chart decoding. Note that this is much faster than Binary and OnDisk phrase table format, but it uses a lot of RAM.
Binary - a phrase table is converted into a 'database'. Only the translations which are required are loaded into memory. Therefore, requiring less memory, but potentially slower to run. For phrase-based model
OnDisk - reimplementation of Binary for chart decoding.
SuffixArray - stores the parallel training data and word alignment in memory, instead of the phrase table. Extraction is done on the fly. Also have a feature where you can add parallel data while the decoder is running ('Dynamic Suffix Array'). For Phrase-based models. See Levenberg et al., (2010).
ALSuffixArray - Suffix array for hierarchical models. See Lopez (2008).
FuzzyMatch - Implementation of Koehn and Senellart (2010).
Hiero - like SCFG, but translation rules are in standard Hiero-style format
Compact - for phrase-based model. See Junczys-Dowmunt (2012).