Moses » Optimize

Optimizing Moses

Content

Multi-threaded Moses

Moses supports multi-threaded operation, enabling faster decoding on multi-core machines. The current limitations of multi-threaded Moses are:

irstlm is not supported, since it uses a non-threadsafe cache
lattice input may not work - this has not been tested
increasing the verbosity of Moses will probably cause multi-threaded Moses to crash
Decoding speed will flatten out after about 16 threads. For more scalable speed with many threads, use Moses2

Multi-threaded Moses is now built by default. If you omit the -threads argument, then Moses will use a single worker thread, and a thread to read the input stream. Using the argument -threads n specifies a pool of n threads, and -threads all will use all the cores on the machine.

How much memory do I need during decoding?

The single-most important thing you need to run Moses fast is MEMORY. Lots of MEMORY. (For example, the Edinburgh group have servers with 144GB of RAM). The rest of this section is just details of how to make the training and decoding run fast.

Calculate total file size of the binary phrase tables, binary language models and binary reordering models.

For example,

	% ll -h phrase-table.0-0.1.1.binphr.*
	-rw-r--r-- 1 s0565741 users 157K 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.idx
	-rw-r--r-- 1 s0565741 users 5.4M 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.srctree
	-rw-r--r-- 1 s0565741 users 282K 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.srcvoc
	-rw-r--r-- 1 s0565741 users 1.1G 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.tgtdata
	-rw-r--r-- 1 s0565741 users 1.7M 2012-06-13 12:41 phrase-table.0-0.1.1.binphr.tgtvoc
	% ll -h reordering-table.1.wbe-msd-bidirectional-fe.binlexr.*
	-rw-r--r-- 1 s0565741 users 157K 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.idx
	-rw-r--r-- 1 s0565741 users 1.1G 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.srctree
	-rw-r--r-- 1 s0565741 users 1.1G 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.tgtdata
	-rw-r--r-- 1 s0565741 users 282K 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.voc0
	-rw-r--r-- 1 s0565741 users 1.7M 2012-06-13 13:36 reordering-table.1.wbe-msd-bidirectional-fe.binlexr.voc1
	% ll -h interpolated-binlm.1 
	-rw-r--r-- 1 s0565741 users 28G 2012-06-15 11:07 interpolated-binlm.1

The total size of these files is approx. 31GB. Therefore, a translation system using these models requires 31GB (+ roughly 500MB) of memory to run fast.

I've got this much memory but it's still slow. Why?

Run this:

   cat phrase-table.0-0.1.1.binphr.* > /dev/null
   cat reordering-table.1.wbe-msd-bidirectional-fe.binlexr.* > /dev/null
   cat interpolated-binlm.1 > /dev/null

This forces the operating system to cache the binary models in memory, minimizing pages faults while the decoder is running. Other memory-intensive processes on the computer should not be running, otherwise the file-system cache may be reduced.

Use huge pages

Moses does a lot of random lookups. If you're running Linux, check that transparent huge pages are enabled. If

   cat /sys/kernel/mm/transparent_hugepage/enabled

responds with

   [always] madvise never

then transparent huge pages are enabled.

On some RedHat/Centos systems, the file is /sys/kernel/mm/redhat_transparent_hugepage/enabled and madvise will not appear. If neither file exists, upgrade the kernel to at least 2.6.38 and compile with CONFIG_SPARSEMEM_VMEMMAP. If the file exists, but the square brackets are not around "always", then run

   echo always > /sys/kernel/mm/transparent_hugepage/enabled

as root (NB: to use sudo, quote the > character). This setting will not be preserved across reboots, so consider adding it to an init script.

Use the compact phrase and reordering table representations to reduce memory usage by a factor of 10

See the manual on binarized and compact phrase table for a description how to compact your phrase tables. All the things said above for the standard binary phrase table are also true for the compact versions. The principle is the same, the total size of the binary files determines your memory usage, but since the combined size of the compact phrase table and the compact reordering model maybe up to 10 to 12 times smaller than with the original binary implementations, you will save exactly this much memory. You can also use the --minphr-memory and --minlexr-memory options to load the tables into memory at Moses start-up instead of doing the above mentioned caching trick. This may take some time during warm-up, but may save a lot of time in the long term. If you are concerned for performance, see Junczys-Dowmunt (2012) for a comparison. There is virtually no overhead due to on-the-fly decompression on large-memory-systems and considerable speed-up on systems with limited memory.

How little memory can I get away with during decoding?

The decoder can run on very little memory, about 200-300MB for phrase-based and 400-500MB for hierarchical decoding (according to Hieu). The decoder can run on an iPhone! And laptops.

However, it will be VERY slow, unless you have very small models or the models are on fast disks such as flash disks.

Faster Training

Parallel training

When word aligning, using mgiza with multiple threads significantly speed up word alignment.

MGIZA

To use MGIZA with multiple threads in the Moses training script, add these arguments:

   .../train-model.perl -mgiza -mgiza-cpus 8 ....

To enable it in the EMS, add this to the [TRAINING] section

   [TRAINING]
   training-options = "-mgiza -mgiza-cpus 8"

snt2cooc

When running GIZA++ or MGIZA, the first stage involves running a program called

   snt2cooc

This requires approximately 6GB+ for typical Europarl-size corpora (1.8 million sentences). For users without this amount of memory on their computers, an alternative version is included in MGIZA:

   snt2cooc.pl

To use this script, you must copy 2 files to the same place where snt2cooc is run:

   snt2cooc.pl
   snt2coocrmp

Add this argument when running the Moses training script:

   .../train-model.perl -snt2cooc snt2cooc.pl

Parallel Extraction

Once word alignment is completed, the phrase table is created from the aligned parallel corpus. There are 2 main ways to speed up this part of the training process.

Firstly, the training corpus and alignment can be split and phrase pairs from each part can be extracted simultaneously. This can be done by simply using the argument -cores, e.g.,

   .../train-model.perl -cores 4

Secondly, the Unix sort command is often executed during training. It is essential to optimize this command to make use of the available disk and CPU. For example, recent versions of sort can take the following arguments

   sort  -S 10G --batch-size 253 --compress-program gzip --parallel 5

The Moses training script names these arguments

   .../train-model.perl  -sort-buffer-size 10G -sort-batch-size 253 \
     -sort-compress gzip -sort-parallel 5

You should set these arguments. However, DO NOT just blindly copy the above settings, they must be tuned to the particular computer you are running on. The most important issues are:

you must make sure the version of sort on your machine supports the arguments you specify, otherwise the script will crash. The --parallel, --compress-program, and --batch-size arguments have only recently been added to the sort command.
make sure you have enough memory when setting -sort-buffer-size. In particular, you should take into account other programs running on the computer. Also, two or three simultaneous sort program will run (one to sort the extract file, one to sort extract.inv, one to sort extract.o). If there is not enough memory because you've set sort-buffer-size too high, your entire computer will likely crash.
the maximum number for the --batch-size argument is OS-dependent. For example, it is 1024 on Linux, 253 on old Mac OSX, 2557 on new OSX.
on Mac OSX, using --compress-program can occasionally result in the following timeout errors.

     gsort: couldn't create process for gzip -d: Operation timed out

Training Summary

In summary, to maximize speed on a large server with many cores and up-to-date software, add this to your training script:

   .../train-model.perl -mgiza -mgiza-cpus 8 -cores 10 \
   -parallel -sort-buffer-size 10G -sort-batch-size 253 \
   -sort-compress gzip -sort-parallel 10

To run on a laptop with limited memory

   .../train-model.perl -mgiza -mgiza-cpus 2 -snt2cooc snt2cooc.pl \
   -parallel -sort-batch-size 253 -sort-compress gzip

In the EMS, for large servers, this can be done by adding:

  [TRAINING]
  script = $moses-script-dir/training/train-model.perl
  training-options = "-mgiza -mgiza-cpus 8 -cores 10 \
    -parallel -sort-buffer-size 10G -sort-batch-size 253 \
    -sort-compress gzip -sort-parallel 10"
  parallel = yes

For servers with older OSes, and therefore older sort commands:

  [TRAINING]
  script = $moses-script-dir/training/train-model.perl
  training-options = "-mgiza -mgiza-cpus 8 -cores 10 -parallel"
  parallel = yes

For laptops with limited memory:

  [TRAINING]
  script = $moses-script-dir/training/train-model.perl
  training-options = "-mgiza -mgiza-cpus 2 -snt2cooc snt2cooc.pl \
    -parallel -sort-batch-size 253 -sort-compress gzip"
  parallel = yes

Language Model

Convert your language model to binary format. This reduces loading time and provides more control.

Building a KenLM binary file

See the KenLM web site for the time-memory tradeoff presented by the KenLM data structures. Use bin/build_binary (found in the same directory as moses and moses_chart) to convert ARPA files to the binary format. You can preview memory consumption with:

  bin/build_binary file.arpa

This preview includes only the language model's memory usage, which is in addition to the phrase table etc. For speed, use the default probing data structure.

  bin/build_binary file.arpa file.binlm

To save memory, change to the trie data structure

  bin/build_binary trie file.arpa file.binlm

To further losslessly compress the trie ("chop" in the benchmarks), use -a 64 which will compress pointers to a depth of up to 64 bits.

  bin/build_binary -a 64 trie file.arpa file.binlm

Note that you can also make this parameter smaller which will go faster but use more memory. Quantization will make the trie smaller at the expense of accuracy. You can choose any number of bits from 2 to 25, for example 10:

  bin/build_binary -a 64 -q 10 trie file.arpa file.binlm

Note that quantization can be used independently of -a.

Loading on-demand

By default, language models fully load into memory at the beginning. If you are short on memory, you can use on-demand language model loading. The language model must be converted to binary format in advance and should be placed on LOCAL DISK, preferably SSD. For KenLM, you should use the trie data structure, not the probing data structure.

If the LM for binarized using IRSTLM, append .mm to the file name and change the ini file to reflect this. Eg. change

  [feature]
  IRSTLM .... path=file.lm

  [feature]
  IRSTLM .... path=file.lm.mm

If the LM was binarized using KenLM, add the argument lazyken=true. Eg. from

  [feature]
  KENLM ....

  [feature]
  KENLM .... lazyken=true

Suffix array

Suffix arrays store the entire parallel corpora and word alignment information in memory, instead of the phrase table. The parallel corpora and alignment file is often much smaller than the phrase table. For example, for the Europarl German-English (gzipped files):

   de = 94MB
   en = 84MB
   alignment = 57MB

   phrase-based = 2.0GB
   hierarchical = 16.0GB

Therefore, it is more memory efficient to store the corpus in memory, rather than the entire phrase-table. This is usually structured as a suffix array to enable fast extraction of translations.

Translations are extracted as needed, usually per input test set, or per input sentence.

Moses support two different implementations of suffix arrays, one for phrase-based models, [[one for hierarchical models -> AdvancedFeatures#ntoc43 ]].

Cube Pruning

Cube pruning limits the number of hypotheses created for each stack (or chart cell in chart decoding). It is essential for chart decoding (otherwise decoding will take a VERY long time) and an option in phrase-based decoding.

In the phrase-based decoder, add:

  [search-algorithm]
  1
  [cube-pruning-pop-limit]
  500

There is a speed-quality tradeoff, lower pop limit means less work for the decoder, so faster decoding but less accurate translation.

Minimizing memory during training

TODO: MGIZA with reduced memory sntcoc

Minimizing memory during decoding

The biggest consumer of memory during decoding are typically the models. Here are some links on how to reduce the size of each.

Language model:

 * use KenLM with trie data structure Moses.Optimize#ntoc14
 * use on-demand loading Moses.Optimize#ntoc15

Translation model:

 * use phrase table pruning Advanced.RuleTables#ntoc5
 * use a compact phrase table http://www.statmt.org/moses/?n=Advanced.RuleTables#ntoc3
 * filter the translation model given the text you want to translate Moses.SupportTools#ntoc3

Reordering model:

 * similar techniques than for translation models are possible: pruning Advanced.RuleTables#ntoc3, compact tables Advanced.RuleTables#ntoc4, and filtering Moses.SupportTools#ntoc3.

Compile-time options

These options can be added to the bjam command line, trading generality for performance.

You should do a full rebuild with -a when changing the values of most of these options.

Don't use factors? Add

  --max-factors=1

Tailor KenLM's maximum order to only what you need. If your highest-order language model has order 5, add

  --kenlm-max-order=5

Turn debug symbols off for speed and a little more memory.

  debug-symbols=off

But don't expect support from the mailing list until you rerun with debug symbols on!

Don't care about debug messages?

  --notrace

Download tcmalloc and see BUILD-INSTRUCTIONS.txt in Moses for installation instructions. bjam will automatically detect tcmalloc's presence and link against it for multi-threaded builds.

Install Boost and zlib static libraries. Then link statically:

  --static

This may mean you have to install Boost and zlib yourself.

Running single-threaded? Add threading=single.

Using hierarchical or string-to-tree models, but none with source syntax?

  --unlabelled-source

Phrase-table types

Moses has multiple phrase table implementations. The one that suits you best depends on the model you're using (phrase-based or hierarchical/syntax), and how much memory your server has.

Here is a complete list of the types:

Memory - this read in the phrase table into memory. For phrase-based model and chart decoding. Note that this is much faster than Binary and OnDisk phrase table format, but it uses a lot of RAM.

Binary - a phrase table is converted into a 'database'. Only the translations which are required are loaded into memory. Therefore, requiring less memory, but potentially slower to run. For phrase-based model

OnDisk - reimplementation of Binary for chart decoding.

SuffixArray - stores the parallel training data and word alignment in memory, instead of the phrase table. Extraction is done on the fly. Also have a feature where you can add parallel data while the decoder is running ('Dynamic Suffix Array'). For Phrase-based models. See Levenberg et al., (2010).

ALSuffixArray - Suffix array for hierarchical models. See Lopez (2008).

FuzzyMatch - Implementation of Koehn and Senellart (2010).

Hiero - like SCFG, but translation rules are in standard Hiero-style format

Compact - for phrase-based model. See Junczys-Dowmunt (2012).

Moses
statistical
machine translation
system

1. Moses

2. Getting Started

3. Tutorials

4. Training

5. User Documentation

6. Development

7. Background

Optimizing Moses

Content

Multi-threaded Moses

How much memory do I need during decoding?

I've got this much memory but it's still slow. Why?

Use huge pages

Use the compact phrase and reordering table representations to reduce memory usage by a factor of 10

How little memory can I get away with during decoding?

Faster Training

Parallel training

MGIZA

snt2cooc

Parallel Extraction

Training Summary

Language Model

Building a KenLM binary file

Loading on-demand

Suffix array

Cube Pruning

Minimizing memory during training

Minimizing memory during decoding

Compile-time options

Phrase-table types

Mosesstatisticalmachine translationsystem

1. Moses

2. Getting Started

3. Tutorials

4. Training

5. User Documentation

6. Development

7. Background

Optimizing Moses

Content

Multi-threaded Moses

How much memory do I need during decoding?

I've got this much memory but it's still slow. Why?

Use huge pages

Use the compact phrase and reordering table representations to reduce memory usage by a factor of 10

How little memory can I get away with during decoding?

Faster Training

Parallel training

MGIZA

snt2cooc

Parallel Extraction

Training Summary

Language Model

Building a KenLM binary file

Loading on-demand

Suffix array

Cube Pruning

Minimizing memory during training

Minimizing memory during decoding

Compile-time options

Phrase-table types

Moses
statistical
machine translation
system