machine translation

Advanced Features of the Decoder

The basic features of the decoder are explained in the Tutorial. Here, we describe some additional features that have been demonstrated to be beneficial in some cases.


Lexicalized Reordering Models
Enhanced orientation detection
Operation Sequence Model (OSM)
Class-based Models
Unsupervised Transliteration Model
Binary Phrase Tables with On-demand Loading
On-Disk Phrase table
Binary Phrase table
Binary Reordering Tables with On-demand Loading
Compact Phrase Table
Compact Lexical Reordering Table
XML Markup
Generating n-Best Lists
Word-to-word alignment
Minimum Bayes Risk Decoding
Lattice MBR and Consensus Decoding
Handling Unknown Words
Output Search Graph
Early Discarding of Hypotheses
Maintaining stack diversity
Cube Pruning
Specifying Reordering Constraints
Multiple Translation Tables and Back-off Models
Pruning the Translation Table
Build Instructions
Usage Instructions
Using the EMS
Pruning the Phrase Table based on Relative Entropy
Build Instructions
Usage Instructions
Multi-threaded Moses
Moses Server
Using Multiple Translation Systems in the Same Server
Continue Partial Translation
Global Lexicon Model
Incremental Training
Initial Training
How to use memory-mapped dynamic suffix array phrase tables in the moses decoder
Preprocess New Data
Prepare New Data
Update and Compute Alignments
Distributed Language Model
Installing and Compiling
Create Table
Shard Model
Create Bloom Filter
Integration with Moses
Suffix Arrays for Hierarchical Models
Using the EMS
Fuzzy Match Rule Table for Hierachical Models
Translation Model Combination
Linear Interpolation and Instance Weighting
Online Translation Model Combination (Multimodel phrase table type)
Online Computation of Translation Model Features Based on Sufficient Statistics
Alternate Weight Settings
Open Machine Translation Core (OMTC) - A proposed machine translation system standard
Pipeline Creation Language (PCL)
Modified Moore-Lewis Filtering
Constrained Decoding

Lexicalized Reordering Models

The default standard model that for phrase-based statistical machine translation is only conditioned on movement distance and nothing else. However, some phrases are reordered more frequently than others. A French adjective like extérieur typically gets switched with the preceding noun, when translated into English.

Hence, we want to consider a lexicalized reordering model that conditions reordering on the actual phrases. One concern, of course, is the problem of sparse data. A particular phrase pair may occur only a few times in the training data, making it hard to estimate reliable probability distributions from these statistics.

Therefore, in the lexicalized reordering model we present here, we only consider three reordering types: (m) monotone order, (s) switch with previous phrase, or (d) discontinuous. See below for an illustration of these three different types of orientation of a phrase.

To put it more formally, we want to introduce a reordering model po that predicts an orientation type {m,s,d} given the phrase pair currently used in translation:

orientation ε {m, s, d}


How can we learn such a probability distribution from the data? Again, we go back to the word alignment that was the basis for our phrase table. When we extract each phrase pair, we can also extract its orientation type in that specific occurrence.

Looking at the word alignment matrix, we note for each extracted phrase pair its corresponding orientation type. The orientation type can be detected, if we check for a word alignment point to the top left or to the top right of the extracted phrase pair. An alignment point to the top left signifies that the preceding English word is aligned to the preceding Foreign word. An alignment point to the top right indicates that the preceding English word is aligned to the following french word. See below for an illustration.

The orientation type is defined as follows:

  • monotone: if a word alignment point to the top left exists, we have evidence for monotone orientation.
  • swap: if a word alignment point to the top right exists, we have evidence for a swap with the previous phrase.
  • discontinuous: if neither a word alignment point to top left nor to the top right exists, we have neither monotone order nor a swap, and hence evidence for discontinuous orientation.

We count how often each extracted phrase pair is found with each of the three orientation types. The probability distribution po is then estimated based on these counts using the maximum likelihood principle:

po(orientation|f,e) = count(orientation,e,f) / Σo count(o,e,f)

Given the sparse statistics of the orientation types, we may want to smooth the counts with the unconditioned maximum-likelihood probability distribution with some factor σ:

po(orientation) = Σf Σe count(orientation,e,f) / Σo Σf Σe count(o,e,f)

po(orientation|f,e) = (σ p(orientation) + count(orientation,e,f) ) / ( σ + Σo count(o,e,f) )

There are a number of variations of this lexicalized reordering model based on orientation types:

  • bidirectional: Certain phrases may not only flag, if they themselves are moved out of order, but also if subsequent phrases are reordered. A lexicalized reordering model for this decision could be learned in addition, using the same method.
  • f and e: Out of sparse data concerns, we may want to condition the probability distribution only on the foreign phrase (f) or the English phrase (e).
  • monotonicity: To further reduce the complexity of the model, we might merge the orientation types swap and discontinuous, leaving a binary decision about the phrase order.

These variations have shown to be occasionally beneficial for certain training corpus sizes and language pairs. Moses allows the arbitrary combination of these decisions to define the reordering model type (e.g. bidirectional-monotonicity-f). See more on training these models in the training section of this manual.

Enhanced orientation detection

As explained above, statistics about the orientation of each phrase can be collected by looking at the word alignment matrix, in particular by checking the presence of a word at the top left and right corners. This simple approach is capable of detecting a swap with a previous phrase that contains a word exactly aligned on the top right corner, see case (a) in the figure below. However, this approach cannot detect a swap with a phrase that does not contain a word with such an alignment, like the case (b). A variation to the way phrase orientation statistics are collected is the so-called phrase-based orientation model by Tillmann (2004), which uses phrases both at training and decoding time. With the phrase-based orientation model, the case (b) is properly detected and counted during training as a swap. A further improvement of this method is the hierarchical orientation model by Galley and Manning (2008), which is able to detect swaps or monotone arrangements between blocks even larger than the length limit imposed to phrases during training, and larger than the phrases actually used during decoding. For instance, it can detect at decoding time the swap of blocks in the case (c) shown below.

(Figure from Galley and Manning, 2008)

Empirically, the enhanced orientation methods should be used with language pairs involving significant word re-ordering.

Operation Sequence Model (OSM)

The Operation Sequence Model as described in Durrani et al. (2011) and Durrani et al. (2013) has been integrated into Moses.

What is OSM?

OSM is an N-gram-based translation and reordering model that represents aligned bilingual corpus as a sequence of operations and learns a Markov model over the resultant sequences. Possible operations are (i) generation of a sequence of source and target words (ii) insertion of gaps as explicit target positions for reordering operations, and (iii) forward and backward jump operations which do the actual reordering. The probability of a sequence of operations is defined according to an N-gram model, i.e., the probability of an operation depends on the n-1 preceding operations. Let O = o1, ... , oN be a sequence of operations as hypothesized by the translator to generate a word-aligned bilingual sentence pair < F;E;A >; the model is then defined as:

posm(F,E,A) = p(o1,...,oN) = ∏i p(oi|oi-n+1...oi-1)

The OSM model addresses several drawbacks of the phrase-based translation and lexicalized reordering models: i) it considers source and target contextual information across phrasal boundries and does not make independence assumption, ii) it is based on minimal translation units therefore does not have the problem of spurious phrasal segmentation, iii) it consider much richer conditioning than the lexcialized reordering model which only learns orientation of a phrase w.r.t previous phrase (or block of phrases) ignoring how previous words were translated and reordered. The OSM model conditions translation and reordering decisions on 'n' previous translation and reordering decisions which can span across phrasal boundaries.

A list of operations is given below:

Generate (X,Y): X and Y are source and target cepts in an MTU (minimal translation unit). This operation causes the words in Y and the first word in X to be added to the target and source strings respectively, that were generated so far. Subsequent words in X are added to a queue to be generated later.
Continue Source Cept: The source words added to the queue by the Generate (X,Y) operation are generated by the Continue Source Cept operation. Each Continue Source Cept operation removes one German word from the queue and copies it to the source string.
Generate Source Only (X): The words in X are added at the current position in the source string. This operation is used to generate an target word with no corresponding target word.
Generate Target Only (Y): The words in Y are added at the current position in the target string. This operation is used to generate an target word with no corresponding source word.
Generate Identical: The same word is added at the current position in both the source and target strings. The Generate Identical operation is used during decoding for the translation of unknown words.
Insert Gap: This operation inserts a gap which acts as a placeholder for the skipped words. There can be more than one open gap at a time.
Jump Back (W): This operation lets the translator jump back to an open gap. It takes a parameter W specifying which gap to jump to. W=1 for the gap closest to the right most source word covered, W=2 for the second most closest and so on.
Jump Forward: This operation makes the translator jump to the right-most source word so far covered. It is performed when the next source word to be generated is at the right of the source word generated and does not follow immediately

The example shown in figure is deterministically converted to the following operation sequence:

Generate Identical -- Generate (hat investiert, invested) -- Insert Gap -- Continue Source Cept -- Jump Back (1) -- Generate (Millionen, million) -- Generate Source Only (von) -- Generate (Dollars, dollars) -- Generate (in, in) -- Generate (die, the) -- Generate (Untersuchungen, research)


To enable the OSM model in phrase-based decoder, just put the following in the EMS config file:

 operation-sequence-model = "yes"
 operation-sequence-model-order = 5
 operation-sequence-model-settings = ""

Due to data sparsity the lexically driven OSM model may often fall back to very small context sizes. This problem is addressed in Durrani et al. (2014b) by learning operation sequences over generalized representations such as POS/Morph tags/word classes (See Section: Class-based Models). If the data has been augmented with additional factors, then use

 operation-sequence-model-settings = "0-0+1-1"

"0-0" will learn OSM model over lexical forms and "1-1" will learn OSM model over second factor (POS/Morph/Cluster-id etc.). Note that using

 operation-sequence-model-settings = ""

for a factor augmented training data is an error. Use

 operation-sequence-model-settings = "0-0"

if you only intend to train OSM model over surface form in such a scenario.

In case you are not using EMS and want to train OSM model manually, you will need to do two things:

1) Run the following command

 /path-to-moses/scripts/OSM/OSM-Train.perl --corpus-f --corpus-e corpus.en --alignment aligned.grow-diag-final-and --order 5 --out-dir /path-to-experiment/model/OSM --moses-src-dir /path-to-moses/ --srilm-dir /path-to-srilm/bin/i686-m64 --factor 0-0

2) Edit model/moses.ini to add

OpSequenceModel name=OpSequenceModel0 num-features=5 path=/path-to-experiment/model/OSM/operationLM.bin
OpSequenceModel0= 0.08 -0.02 0.02 -0.001 0.03

Class-based Models

Automatically clustering the training data into word classes in order to obtain smoother distributions and better generalizations has been a widely known and applied technique in natural language processing. Using class-based models have shown to be useful when translating into morphologically rich languages. We use the mkcls utility in GIZA to cluster source and target vocabularies into classes. This is generally run during alignment process where data is divided into 50 classes to estimate IBM Model-4. Durrani et al. (2014b) found using different number of clusters to be useful for different language pairs. To map the data (say into higher number of clusters (say 1000) use:

 /path-to-GIZA/statmt/bin/mkcls –c1000 -n2 -p/path-to-corpus/ -V/path-to-experiment/training/prepared.stepID/fr.vcb.classes opt

To annotate the data with cluster-ids add the following to the EMS-config file:


 temp-dir = $working-dir/training/factor


 ### script that generates this factor
 factor-script = "/path-to-moses/scripts/training/wrappers/make-factor-brown-cluster-mkcls.perl $working-dir/training/prepared.stepID/$input-extension.vcb.classes"


 ### script that generates this factor
 factor-script = "/path-to-moses/scripts/training/wrappers/make-factor-brown-cluster-mkcls.perl $working-dir/training/prepared.stepID/$output-extension.vcb.classes"


Adding the above will augment the training data with cluster-ids. These can be enabled in different models. For example to train a joint-source target phrase-translation model, add the following to the EMS-config file:


 input-factors = word mkcls
 output-factors = word mkcls
 alignment-factors = "word -> word"
 translation-factors = "word+mkcls -> word+mkcls "
 reordering-factors = "word -> word"
 decoding-steps = "t0"

To train a target sequence model over cluster-ids, add the following to the EMS config-file


 raw-corpus = /path-to-raw-monolingual-data/rawData.en
 factors = mkcls
 settings = "-unk"

To train operation sequence model over cluster-ids, use the following in the EMS config-file


 operation-sequence-model-settings = "1-1"

if you want to train both lexically driven and class-based OSM models then use:


 operation-sequence-model-settings = "0-0+1-1"

Unsupervised Transliteration Model

Character-based translation model/Transliteration models have shown to be quite useful in MT for translating OOV words, for disambiguation and for translating closely related languages. A transliteration module as described in Durrani et al. (2014a) has been integrated into Moses. It is completely unsupervised and language independent. It extracts transliteration corpus from the parallel data and builds a transliteration model from it which can then be used to translate OOV word or named-entities.

To enable transliteration module add the following to the EMS config file:

 transliteration-module = "yes"

It will extract transliteration corpus from the word-aligned parallel data and learn a character-based model from it.

To use the post-decoding transliteration (Method 2 as described in the paper) add the following lines

 post-decoding-transliteration = "yes"
 language-model-file = /path to language model file/

To use the in-decoding method (Method 3 as described in the paper) add the following lines

 in-decoding-transliteration = "yes"
 transliteration-file = /file containing list of words to be transliterated/

Post-decoding method obtains the list of OOV words automatically by running the decoder. The in-decoding method requires the user to provide the list of words to be transliterated. This gives a freedom to transliterate any additional words that might be known to the translation model but can also be transliterated in some scenarios. For example "Little" can be translated to in Urdu when it is used as adjective and transliterated to when it is a name as in "Stuart Little". You can add the OOV list as obtained from Method 2 if you don't want to add any other words. Transliterating all the words in the test-set might be helpful when translating between closely related language pairs such as Hindi-Urdu, Thai-Lao etc. See Durrani and Koehn (2014) for a case-study.

Binary Phrase Tables with On-demand Loading

For larger tasks the phrase tables usually become huge, typically too large to fit into memory. Therefore, Moses supports a binary phrase table with on-demand loading, i.e. only the part of the phrase table that is required to translate a sentence is loaded into memory.

There are currently 3 binary formats to do this:

  • OnDisk phrase-table. Works with SCFG models and phrase-based models.
  • Binary phrase-table. Works with phrase-based models only.
  • Compress phrase-table. Works with phrase-based models only (may be extended in the near future). Small and fast. Described here.

On-Disk Phrase table

This phrase-table can be used for both phrase-based models and hierarchical models. (It can be used for fully syntax models too, but is likely to be very slow).

You first need to convert the rule table into a binary prefix format. This is done with the command CreateOnDiskPt:

 CreateOnDiskPt [#source factors] [#target factors] [#scores] [ttable-limit]  \ 
    [index of p(e|f) (usually 2)] [input text pt] [output directory]


  ~/CreateOnDiskPt 1 1 4 100 2 phrase-table.1.gz phrase-table.1.folder

This will create a directory, phrase-table.1.folder, with the following files:


The configuration file moses.ini should also be changed so that the binary files is used instead of the text file. You should change it from:

   PhraseDictionaryMemory path=phrase-table.1.gz ....


   PhraseDictionaryOnDisk path=phrase-table.1.folder ....

Binary Phrase table

NB- Works with phrase-based models only. NB2 - This phrase-table is now included for bacward-compatibility. It may be deleted in future.

You have to convert the standard ASCII phrase tables into the binary format. Here is an example (standard phrase table phrase-table, with 4 scores):

 cat phrase-table | LC_ALL=C sort | bin/processPhraseTable \
   -ttable 0 0 - -nscores 4 -out phrase-table


  • -ttable int int string -- translation table file, use '-' for stdin
  • -out string -- output file name prefix for binary translation table
  • -nscores int -- number of scores in translation table

If you just want to convert a phrase table, the two integers in the -ttable option do not matter, so use 0's.

Important: If your data is encoded in UTF8, make sure you set the environment variable with the LC_ALL=C before sorting. If your phrase table is already sorted, you can skip that.

The output files will be:


In the Moses configuration file, specify only the file name stem phrase-table as phrase table and set the type to 1, i.e.:

 PhraseDictionaryBinary path=phrase-table ...

Word-to-word alignment:

There are 2 arguments to the decoder that enables it to print out the word alignment information

  -alignment-output-file [file]

print out the word alignment for the best translation to a file.


print the word alignment information of each entry in the n-best list as an extra column in the n-best file.

Word alignment is included in the phrase-table by default (as of November 2012). To exclude them, add


as an argument to the score program.

When binarizing the phrase-table, the word alignment is also included by default. To turn this behaviour off for the phrase-based binarizer:

  processPhraseTable -no-alignment-info ....


  processPhraseTableMin -no-alignment-info ....

(For the compact phrase-table representation).

There is no way to exclude word alignment information from the chart-based binarization process.

Phrase-based binary format When word alignment information is stored, the two output files ".srctree" and " .tgtdata" will end with the suffix ".wa".

Note: The argument


has been deleted from the decoder. -print-alignment-info did nothing. -use-alignment-info is now inferred from the arguments


Additionally, the


has been renamed


to reflect what it actually does.

The word alignment MUST be enabled during binarization, otherwise the decoder will

  1. complain
  2. carry on blindly but doesn't print any word alignment

Binary Reordering Tables with On-demand Loading

The reordering tables may be also converted into a binary format. The command is slightly simpler:

 mosesdecoder/bin/processLexicalTable -in reordering-table -out reordering-table

The file names for input and output are typically the same, since the actual output file names have similar extensions to the phrase table file names.

Compact Phrase Table

A compact phrase table implementation is available that is around 6 to 7 times smaller and than the original binary phrase table. It can be used in-memory and for on-demand loading. Like the original phrase table, this can only be used for phrase-based models. If you use this or the compact lexical reordering table below, please cite:

Download the CMPH library from and install according to the included instructions. Make sure the installation target directory contains an "include" and a "lib" directory. Next you need to recompile Moses with

  ./bjam --with-cmph=/path/to/cmph

Now, you can convert the standard ASCII phrase tables into the compact format. Phrase tables are required to be sorted as above. For a maximal compression effect, it is advised to generate a phrase table with phrase-internal word alignment (this is the default). If you want to compress a phrase table without alignment information, rather use -encoding None (see advanced options below). It is possible to use the default encoding (PREnc) without alignment information, but it will take much longer. For now, there may be problems to compact phrase tables on 32-bit systems since virtual memory usage quickly exceeds the 3 GB barrier.

Here is an example (standard phrase table phrase-table, with 5 scores) which produces a single file phrase-table.minphr:

  mosesdecoder/bin/processPhraseTableMin -in phrase-table.gz -out phrase-table -nscores 4 -threads 4

In the Moses config file, specify the WHOLE file name of the phrase table:

 PhraseDictionaryCompact path=phrase-table.minphr ...


  • -in string -- input table file name
  • -out string -- prefix of binary table file
  • -nscores int -- number of score components in phrase table
  • -no-alignment-info -- do not include alignment info in the binary phrase table
  • -threads int -- number of threads used for conversion
  • -T string -- path to custom temporary directory

As for the original phrase table, the option -no-alignment-info omits phrase internal alignment information in the phrase table and should also be used if you provide a phrase table without alignment information in the phrase table. Also if no alignment data is given in the phrase table you should use -encoding None (see below), since the default compression method assumes that alignment information is present.

Since compression is quite processor-heavy, it is advised to use the -threads option to increase speed.

Advanced options: Default settings should be fine for most of your needs, but the size of the phrase table can be tuned to your specific needs with the help of the advanced options.


  • -encoding string -- encoding type: PREnc REnc None (default PREnc)
  • -rankscore int -- score index of P(t|s) (default 2)
  • -maxrank int -- maximum rank for PREnc (default 100)
  • -landmark int -- use landmark phrase every 2^n source phrases (default 10)
  • -fingerprint int -- number of bits used for source phrase fingerprints (default 16)
  • -join-scores -- single set of Huffman codes for score components
  • -quantize int -- maximum number of scores per score component
  • -no-warnings -- suppress warnings about missing alignment data

Encoding methods: There are two encoding types that can be used on-top the standard compression methods, Phrasal Rank-Encoding (PREnc) and word-based Rank Encoding (REnc). PREnc (see Junczys-Dowmunt (MT Marathon 2012) for details) is used by default and requires a phrase table with phrase-internal alignment to reach its full potential. PREnc can also work without explicit alignment information, but encoding is slower and the resulting file will be bigger, but smaller than without PREnc. The tool will warn you about every line that misses alignment information if you use PREnc or REnc. These warnings can be suppressed with -no-warnings. If you use PREnc with non-standard scores you should specify which score type is used for sorting with -rankscore int. By default this is P(t|s) which in the standard phrase table is the third score (index 2).

Basically with PREnc around, there is no reason to use REnc unless you really want to. It requires the lexical translation table lex.f2e to be present in the same directory as the text version phrase table. If no alignment information is available it falls back to None (See Junczys-Dowmunt (EAMT 2012) for details on REnc and None).

None is the fasted encoding method, but produces the biggest files. Concerning translation speed, there is virtually no difference between the encoding methods when the phrase tables are later used with Moses, but smaller files result in lesser memory-usage, especially if the phrase tables are loaded entirely in-memory.

Indexing options: The properties of the source phrase index can be modified with the -landmark and -fingerprint options, changing these options can affect file size and translation quality, so do it carefully. Junczys-Dowmunt (TSD 2012) contains a discussion of these values and their effects.

Scores and quantization: You can reduce the file size even more by using score quantization. E.g. with -quantize 1000000, a phrase table is generated that uses at most one million different scores for each score type. Be careful, low values will affect translation quality. By default, each score type is encoded with an own set of Huffman codes, with the -join-scores option only one set is used. If this option is combined with -quantize N, the summed number of different scores for all scores types will not exceed N.

In-memory loading: You can start Moses with the option -minphr-memory to load the compact phrase table directly into memory at start up. Without this option, on-demand loading is used by default.

Compact Lexical Reordering Table

The compact lexical reordering table produces files about 12 to 15 times smaller than the original Moses binary implementation. As for the compact phrase table you need to install CMPH and link against it. Reordering tables must be sorted in the same way as the phrase tables above. The command below produces a single file reordering-table.minlexr.

  mosesdecoder/bin/processLexicalTableMin -in reordering-table.gz -out reordering-table -threads 4

If you include the prefix in the Moses config file, the compact reordering table will be recognized and loaded automatically. You can start Moses with the option -minlexr-memory to load the compact lexical reordering table directly into memory at start up.

Options: See the compact phrase table above for a description of available common options.

XML Markup

Sometimes we have external knowledge that we want to bring to the decoder. For instance, we might have a better translation system for translating numbers of dates. We would like to plug in these translations to the decoder without changing the model.

The -xml-input flag is used to activate this feature. It can have one of four values:

  • exclusive Only the XML-specified translation is used for the input phrase. Any phrases from the phrase table that overlap with that span are ignored.
  • inclusive The XML-specified translation competes with all the phrase table choices for that span.
  • constraint The XML-specified translation competes with phrase table choices that contain the specified translation.
  • ignore The XML-specified translation is ignored completely.
  • pass-through (default) For backwards compatibility, the XML data is fed straight through to the decoder. This will produce erroneous results if the decoder is fed data that contains XML markup.

The decoder has an XML markup scheme that allows the specification of translations for parts of the sentence. In its simplest form, we can tell the decoder what to use to translate certain words or phrases in the sentence:

 % echo 'das ist <np translation="a cute place">ein kleines haus</np>' \
   | moses -xml-input exclusive -f moses.ini
 this is a cute place

 % echo 'das ist ein kleines <n translation="dwelling">haus</n>' \
   | moses -xml-input exclusive -f moses.ini
 this is a little dwelling

The words have to be surrounded by tags, such as <np...> and </np>. The name of the tags can be chosen freely. The target output is specified in the opening tag as a parameter value for a parameter that is called english for historical reasons (the canonical target language).

We can also provide a probability along with these translation choice. The parameter must be named prob and should contain a single float value. If not present, an XML translation option is given a probability of 1.

 % echo 'das ist ein kleines <n translation="dwelling" prob="0.8">haus</n>' \
   | moses -xml-input exclusive -f moses.ini \
 this is a little dwelling

This probability isn't very useful without letting the decoder have other phrase table entries "compete" with the XML entry, so we switch to inclusive mode. This allows the decoder to use either translations from the model or the specified xml translation:

 % echo 'das ist ein kleines <n translation="dwelling" prob="0.8">haus</n>' \
   | moses -xml-input inclusive -f moses.ini
 this is a small house

The switch -xml-input inclusive gives the decoder a choice between using the specified translations or its own. This choice, again, is ultimately made by the language model, which takes the sentence context into account.

This doesn't change the output from the non-XML sentence because that prob value is first logged, then split evenly among the number of scores present in the phrase table. Additionally, in the toy model used here, we are dealing with a very dumb language model and phrase table. Setting the probability value to something astronomical forces our option to be chosen.

 % echo 'das ist ein kleines <n translation="dwelling" prob="0.8">haus</n>' \
   | moses -xml-input inclusive -f moses.ini
 this is a little dwelling

Multiple translation can be specified if separated by two bars (||):

 % echo 'das ist ein kleines <n translation="dwelling||house" prob="0.8||0.2">haus</n>' \
   | moses -xml-input inclusive -f moses.ini

The XML-input implementation is NOT currently compatible with factored models or confusion networks.


  • -xml-input ('pass-through' (default), 'inclusive', 'constraint', 'exclusive', 'ignore')

Generating n-Best Lists

The generation of n-best lists (the top n translations found by the search according to the model) is pretty straight-forward. You simple have to specify the file where the n-best list will be stored and the size of the n-best list for each sentence.

Example: The command

 % moses -f moses.ini -n-best-list listfile 100 < in

stores the n-best list in the file listfile with up to 100 translations per input sentence.

Here an example n-best list:

 0 ||| we must discuss on greater vision .  ||| d: 0 -5.56438 0 0 -7.07376 0 0 \
   lm: -36.0974 -13.3428 tm: -39.6927 -47.8438 -15.4766 -20.5003 4.99948 w: -7 ||| -9.2298
 0 ||| we must also discuss on a vision .  ||| d: -10 -2.3455 0 -1.92155 -3.21888 0 -1.51918 \
   lm: -31.5841 -9.96547 tm: -42.3438 -48.4311 -18.913 -20.0086 5.99938 w: -8 ||| -9.26197
 0 ||| it is also discuss a vision .  ||| d: -10 -1.63574 -1.60944 -2.70802 -1.60944 -1.94589 -1.08417 \
   lm: -31.9699 -12.155 tm: -40.4555 -46.8605 -14.3549 -13.2247 4.99948 w: -7 ||| -9.31777

Each line of the n-best list file is made up of (separated by |||):

  • sentence number (in above example 0, the first sentence)
  • output sentence
  • individual component scores (unweighted)
  • weighted overall score

Note that it is possible (and very likely) that the n-best list contains many sentences that look the same on the surface, but have different scores. The most common reason for this is different phrase segmentation (two words may be mapped by a single phrase mapping, or by two individual phrase mappings for each word).

To produce an n-best list that only contains the first occurrence of an output sentence, add the word distinct after the file and size specification:

 % moses -f moses.ini -n-best-list listfile 100 distinct < in

This creates an n-best list file that contains up to 100 distinct output sentences for each input sentences. Note that potentially a large numbers of candidate translations have to be examined to find the top 100. To keep memory usage in check only 20 times the specified number of distinct entries are examined. This factor can be changed with the switch -n-best-factor.


  • -n-best-list FILE SIZE [distinct] --- output an n-best list of size SIZE to file FILE
  • -n-best-factor FACTOR --- exploring at most FACTOR*SIZE candidates for distinct
  • -include-alignment-in-n-best --- output of word-to-word alignments in the n-best list; it requires that w2w alignments are included in the phrase table and that -use-alignment-info is set. (See here for further details).

Word-to-word alignment

If the phrase table (binary or textual) includes word-to-word alignments between source and target phrases (see "Score Phrases" and "Binary Phrase Table"), Moses can report them in the output.

There are four options that control the output of alignment information: -use-alignment-info, -print-alignment-info, -print-alignment-info-in-n-best, and -alignment-output-file.

For instance, by translating the sentence "ich frage" from German into English and activating all parameters, you get in the verbose output:

 BEST TRANSLATION: i ask [11]  [total=-1.429] <<features>> [f2e: 0=0 1=1] [e2f: 0=0 1=1]

The last two fields report the word-to-word alignments from source to target and from target to source, respectively.

In the n-best list you get:

 0 ||| i ask  ||| ...feature_scores.... ||| -1.42906 ||| 0-1=0-1 ||| 0=0 1=1 ||| 0=0 1=1
 0 ||| i am asking  ||| ...feature_scores.... ||| -2.61281 ||| 0-1=0-2 ||| 0=0 1=1,2 ||| 0=0 1=1 2=1
 0 ||| i ask you  ||| ...feature_scores.... ||| -3.1068 ||| 0-1=0-2 ||| 0=0 1=1,2 ||| 0=0 1=1 2=1
 0 ||| i ask this  ||| ...feature_scores.... ||| -3.48919 ||| 0-1=0-2 ||| 0=0 1=1 ||| 0=0 1=1 2=-1

Indexes (starting from 0) are used to refer to words. '2=-1' means that the word of index 2 (i.e. the word) is not associated with any word in the other language. For instance, by considering the last translation hypothesis "i ask this" of "ich frage", the source to target alignment ("0=0 1=1") means that:

 German   -> English
 ich      -> i
 frage    -> ask

and vice versa the target to source alignment ("0=0 1=1 2=-1") means that:

 English  -> German
 i        -> ich
 ask      -> frage
 this      -> 

Note: in the same translation hypothesis, the the field "0-1=0-2" after the global score refers to the phrase-to-phrase alignment and means that "ich frage" is translated as a phrase into the three-word English phrase "i ask you".
This information is generated if the option -include-alignment-in-n-best is activated.

Important: the phrase table can include different word-to-word alignments for the source-to-target and target-to-source directions, at least in principle. Hence, the two alignments can differ.


  • -use-alignment-info -- to activate this feature (required for binarized ttables, see "Binary Phrase Table").
  • -print-alignment-info -- to output the word-to-word alignments into the verbose output.
  • -print-alignment-info-in-n-best -- to output the word-to-word alignments into the n-best lists.
  • -alignment-output-file outfilename -- to output word-to-word alignments into a separate file in a compact format (one line per sentence).

Minimum Bayes Risk Decoding

Minumum Bayes Risk (MBR) decoding was proposed by Kumar and Byrne (HLT/NAACL 2004). Roughly speaking, instead of outputting the translation with the highest probability, MBR decoding outputs the translation that is most similar to the most likely translations. This requires a similarity measure to establish similar. In Moses, this is a smoothed BLEU score.

Using MBR decoding is straight-forward, just use the switch -mbr when invoking the decoder.


 % moses -f moses.ini -mbr < in

MBR decoding uses by default the top 200 distinct candidate translations to find the translation with minimum Bayes risk. If you want to change this to some other number, use the switch -mbr-size:

 % moses -f moses.ini -decoder-type 1 -mbr-size 100 < in

MBR decoding requires that the translation scores are converted into probabilities that add up to one. The default is to take the log-scores at face value, but you may get better results with scaling the scores. This may be done with the switch -mbr-scale, so for instance:

 % moses -f moses.ini -decoder-type 1 -mbr-scale 0.5 < in


  • -mbr -- use MBR decoding
  • -mbr-size SIZE -- number of translation candidates to consider (default 200)
  • -mbr-scale SCALE -- scaling factor used to adjust the translation scores (default 1.0)

Note: MBR decoding and its variants is currently only implemented for the phrase-based decoder, not the chart decoder.

Lattice MBR and Consensus Decoding

These are extensions to MBR which may run faster or give better results. For more details see Tromble et al (2008), Kumar et al (2009) and De Nero et al (2009). The n-gram posteriors (required for Lattice MBR) and the ngram expectations (for Consensus decoding) are both calculated using an algorithm described in De Nero et al (2010). Currently both lattice MBR and consensus decoding are implemented as n-best list rerankers, in other words the hypothesis space is an n-best list (not a lattice).

Here's the list of options which affect both Lattice MBR and Consensus decoding.


  • -lmbr -- use Lattice MBR decoding
  • -con -- use Consensus decoding
  • -mbr-size SIZE -- as for MBR
  • -mbr-scale SCALE -- as for MBR
  • -lmbr-pruning-factor FACTOR -- mean words per node in pruned lattice, as described in Tromble et al (2008) (default 30)

Lattice MBR has several further parameters which are described in the Tromble et al 2008 paper.


  • -lmbr-p P -- The unigram precision (default 0.8)
  • -lmbr-r R -- The ngram precision ratio (default 0.6)
  • -lmbr-thetas THETAS Instead of specifying p and r, lattice MBR can be configured by specifying all the ngram weights and the length penalty (5 numbers). This is described fully in the references.
  • -lmbr-map-weight WEIGHT The weight given to the map hypothesis (default 0)

Since Lattice MBR has so many parameters, a utility to perform a grid search is provided. This is in moses-cmd/src and is called lmbrgrid. A typical usage would be

 % ./lmbrgrid -lmbr-p 0.4,0.6,0.8 -lmbr-r 0.4,0.6,0.8 -mbr-scale 0.1,0.2,0.5,1 -lmbr-pruning-factor   \
      30 -mbr-size 1000 -f moses.ini -i input.txt

In other words, the same Lattice MBR parameters as for Moses are used, but this time a comma separated list can be supplied. Each line in the output takes the following format:

 <sentence-id> ||| <p> <r> <pruning-factor> <scale> ||| <translation>

In the Moses Lattice MBR experiments that have been done to date, lattice MBR showed small overall improvements on a NIST Arabic data set (+0.4 over map, +0.1 over MBR), once the parameters were chosen carefully. Parameters were optimized by grid search on 500 sentences of held-out, and the following were found to be optimal

 -lmbr-p 0.8 -lmbr-r 0.8 -mbr-scale 5 -lmbr-pruning-factor 50

Handling Unknown Words

Unknown words are copied verbatim to the output. They are also scored by the language model, and may be placed out of order. Alternatively, you may want to drop unknown words. To do so add the switch -drop-unknown.

When translating between languages that use different writing sentences (say, Chinese-English), dropping unknown words results in better BLEU scores. However, it is misleading to a human reader, and it is unclear what the effect on human judgment is.


  • -drop-unknown -- drop unknown words instead of copying them into the output

Output Search Graph

It may be useful for many downstream applications to have a dump of the search graph, for instance to compile a word lattice. One the one hand you can use the -verbose 3 option, which will give a trace of all generated hypotheses, but this creates logging of many hypotheses that get immediately discarded. If you do not want this, a better option is using the switch -output-search-graph FILE, which also provides some additional information.

The generated file contains lines that could be seen as both a dump of the states in the graph and the transitions in the graph. The state graph more closely reflects the hypotheses that are generated in the search. There are three types of hypotheses:

  • The initial empty hypothesis is the only one that is not generated by a phrase translation
 0 hyp=0 stack=0 [...]
  • Regular hypotheses
 0 hyp=17 stack=1 back=0 score=-1.33208 [...] covered=0-0 out=from now on
  • Recombined hypotheses
 0 hyp=5994 stack=2 back=108 score=-1.57388 [...] recombined=13061 [...] covered=2-2 out=be

The relevant information for viewing each line as a state in the search graph is the sentence number (initial 0), the hypothesis id (hyp), the stack where the hypothesis is placed (same as number of foreign words covered, stack), the back-pointer to the previous hypotheses (back), the score so far (score), the last output phrase (out) and that phrase's foreign coverage (covered). For recombined hypotheses, also the superior hypothesis id is given (recombined).

The search graph output includes additional information that is computed after the fact. While the back-pointer and score (back, score) point to the cheapest path and cost to the beginning of the graph, the generated output also included the pointer to the cheapest path and score (forward, fscore) to the end of the graph.

One way to view the output of this option is a reflection of the search and all (relevant) hypotheses that are generated along the way. But often, we want to generate a word lattice, where the states are less relevant, but the information is in the transitions from one state to the next, each transition emitting a phrase at a certain cost. The initial empty hypothesis is irrelevant here, so we need to consider only the other two hypothesis types:

  • Regular hypotheses
 0 hyp=17 [...] back=0 [...] transition=-1.33208 [...] covered=0-0 out=from now on 
  • Recombined hypotheses
 0 [...] back=108 [...] transition=-0.640114 recombined=13061 [...] covered=2-2 out=be

For the word lattice, the relevant information is the cost of the transition (transition), its output (out), maybe the foreign coverage (covered), and the start (back) and endpoint (hyp). Note that the states generated by recombined hypothesis are ignored, since the transition points to the superior hypothesis (recombined).

Here, for completeness sake, the full lines for the three examples we used above:

 0 hyp=0 stack=0 forward=9 fscore=-107.279
 0 hyp=17 stack=1 back=0 score=-1.33208 transition=-1.33208 \
   forward=517 fscore=-106.484 covered=0-0 out=from now on 
 0 hyp=5994 stack=2 back=108 score=-1.57388 transition=-0.640114 \
   recombined=13061 forward=22455 fscore=-106.807 covered=2-2 out=be

When using the switch -output-search-graph-extended (or short: -osgx), a detailed score breakdown is provided for each line. The format is the same as in the n-best list.

What is the difference between the search graph output file generated with this switch and the true search graph?

  • It contains the additional forward costs and forward paths.
  • It also only contains the hypotheses that are part of a fully connected path from the initial empty hypothesis to a final hypothesis that covers the full foreign input sentence.
  • The recombined hypotheses already point to the correct superior hypothesis, while the -verbose 3 log shows the recombinations as they happen (recall that momentarily superior hypotheses may be recombined to even better ones down the road).

Note again that you can get the full search graph with the -verbose 3 option. It is, however, much larger and mostly consists of discarded hypotheses.


  • -output-search-graph FILE -- output the search graph for each sentence in a file
  • -output-search-graph-extended FILE -- output the search graph for each sentence in a file, with detailed feature breakdown

Early Discarding of Hypotheses

During the beam search, many hypotheses are created that are too bad to be even entered on a stack. For many of them, it is even clear before the construction of the hypothesis that it would be not useful. Early discarding of such hypotheses hazards a guess about their viability. This is based on correct score except for the actual language model costs which are very expensive to compute. Hypotheses that, according to this estimate, are worse than the worst hypothesis of the target stack, even given an additional specified threshold as cushion, are not constructed at all. This often speeds up decoding significantly. Try threshold factors between 0.5 and 1.


  • -early-discarding-threshold THRESHOLD -- use early discarding of hypotheses with the specified threshold (default: 0 = not used)

Maintaining stack diversity

The beam search organizes and compares hypotheses based on the number of foreign words they have translated. Since they may have different foreign words translated, we use future score estimates about the remaining sentence translation score.

Instead of comparing such apples and oranges, we could also organize hypotheses by their exact foreign word coverage. The disadvantage of this is that it would require an exponential number of stacks, but with reordering limits the number of stacks is only exponential with regard to maximum reordering distance.

Such coverage stacks are implemented in the search, and their maximum size is specified with the switch -stack-diversity (or -sd), which sets the maximum number of hypotheses per coverage stack.

The actual implementation is a hybrid of coverage stacks and foreign word count stacks: the stack diversity is a constraint on which hypotheses are kept on the traditional stack. If the stack diversity limits leave room for additional hypotheses according to the stack size limit (specified by -s, default 200), then the stack is filled up with the best hypotheses, using score so far and the future score estimate.


  • -stack-diversity LIMIT -- keep a specified number of hypotheses for each foreign word coverage (default: 0 = not used)

Cube Pruning

Cube pruning, as described by Huang and Chiang (2007), has been implemented in the Moses decoder. This is in addition to the traditional search algorithm. The code offers developers the opportunity to implement different search algorithms using an extensible framework.

Cube pruning is faster than the traditional search at comparable levels of search errors. To get faster performance than the default Moses setting at roughly the same performance, use the parameter settings:

 -search-algorithm 1 -cube-pruning-pop-limit 2000 -s 2000

This uses cube pruning (-search-algorithm) that adds 2000 hypotheses to each stack (-cube-pruning-pop-limit 2000) and also increases the stack size to 2000 (-s 2000). Note that with cube pruning, the size of the stack has little impact on performance, so it should be set rather high. The speed/quality trade-off is mostly regulated by the cube pruning pop limit, i.e. the number of hypotheses added to each stack.

Stacks are organized by the number of foreign words covered, so they may differ by which words are covered. You may also require that a minimum number of hypotheses is added for each word coverage (they may be still pruned out, however). This is done using the switch -cube-pruning-diversity MINIMUM which sets the minimum. The default is 0.


  • -search-algorithm 1 -- turns on cube pruning
  • -cube-pruning-pop-limit LIMIT -- number of hypotheses added to each stack
  • -cube-pruning-diversity MINIMUM -- minimum number of hypotheses from each coverage pattern

Specifying Reordering Constraints

For various reasons, it may be useful to specify reordering constraints to the decoder, for instance because of punctuation. Consider the sentence:

 I said " This is a good idea . " , and pursued the plan .

The quoted material should be translated as a block, meaning that once we start translating some of the quoted words, we need to finish all of them. We call such a block a zone and allow the specification of such constraints using XML markup.

 I said <zone> " This is a good idea . " </zone> , and pursued the plan .

Another type of constraints are walls which are hard reordering constraints: First all words before a wall have to be translated, before words afterwards are translated. For instance:

 This is the first part . <wall /> This is the second part .

Walls may be specified within zones, where they act as local walls, i.e. they are only valid within the zone.

 I said <zone> " <wall /> This is a good idea . <wall /> " </zone> , and pursued the plan .

If you add such markup to the input, you need to use the option -xml-input with either exclusive or inclusive (there is no difference between these options in this context).

Specifying reordering constraints around punctuation is often a good idea.
The switch -monotone-at-punctuation introduces walls around the punctuation tokens ,.!?:;".


  • walls and zones have to specified in the input using the tags <zone>, </zone>, and <wall>.
  • -xml-input -- needs to be exclusive or inclusive
  • -monotone-at-punctuation (-mp) -- adds walls around punctuation ,.!?:;".

Multiple Translation Tables and Back-off Models

Moses allows the use of multiple translation tables, but there are two different ways how they are used:

  • both translation tables are used for scoring: This means that every translation option is collected from each table and scored by each table. This implies that each translation option has to be contained in each table: if it is missing in one of the tables, it can not be used.
  • either translation table is used for scoring: Translation options are collected from one table, and additional options are collected from the other tables. If the same translation option (in terms of identical input phrase and output phrase) is found in multiple tables, separate translation options are created for each occurrence, but with different scores.

In any case, each translation table has its own set of weights.

First, you need to specify the translation tables in the section [feature] of the moses.ini configuration file, for instance:

 PhraseDictionaryMemory path=/my-dir/table1 ...
 PhraseDictionaryMemory path=/my-dir/table2 ...

Secondly, you need to set weights for each phrase-table in the section [weight].

Thirdly, you need to specify how the tables are used in the section [mapping]. As mentioned above, there are two choices:

  • scoring with both tables:
 0 T 0
 0 T 1
  • scoring with either table:
 0 T 0
 1 T 1

Note: what we are really doing here is using Moses' capabilities to use different encoding paths. The number before "T" defines a decoding path, so in the second example are two different decoding paths specified. Decoding paths may also contain additional mapping steps, such as generation steps and translation steps using different factors.

Also note that there is no way to have the option "use both tables, if the phrase pair is in both table, otherwise use only the table where you can find it". Keep in mind, that scoring a phrase pair involves a cost and lowers the chances that the phrase pair is used. To effectively use this option, you may create a third table that consists of the intersection of the two phrase tables, and remove shared phrase pairs from each table.

Backoff Models: You may want to prefer to use the first table, and the second table only if there are no translations to be found in the first table. In other words, the second table is only a back-off table for unknown words and phrases in the first table. This can be specified by the option decoding-graph-back-off. The option also allows if the back-off table should only be used for single words (unigrams), unigrams and bigrams, everything up to trigrams, up to 4-grams, etc.

For example, if you have two translation tables, and you want to use the second one only for unknown words, you would specify:


The 0 indicates that the first table is used for anything (which it always should be), and the 1 indicates that the second table is used for unknown n-grams up to size 1. Replacing it with a 2 would indicate its use for unknown unigrams and bigrams (unknown in the sense that the first table has no translations for it).

Also note, that this option works also with more complicated mappings than just a single translation table. For instance the following specifies the use of a simple translation table first, and as a back-off a more complex factored decomposition involving two translation tables and two generation tables:

 0 T 0
 1 T 1
 1 G 0
 1 T 2
 1 G 1


Pruning the Translation Table

The translation table contains all phrase pairs found in the parallel corpus, which includes a lot of noise. To reduce the noise, recent work by Johnson et al. has suggested to prune out unlikely phrase pairs. For more detail, please refer to the paper:

H. Johnson, J. Martin, G. Foster and R. Kuhn. (2007) '''Improving Translation Quality by Discarding Most of the Phrasetable'''. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 967-975.

Build Instructions

Moses includes a re-implementation of this method in the directory contrib/sigtest-filter. You first need to build it from the source files.

This implementation relies on Joy Zhang's. The source code can be download from github. Joy's original code is here.

  1. download and extract the SALM source release.
  2. in SALM/Distribution/Linux type: make
  3. enter the directory contrib/sigtest-filter in the main Moses distribution directory
  4. type make SALMDIR=/path/to/SALM

Usage Instructions

Using the SALM/Bin/Linux/Index/IndexSA.O32, create a suffix array index of the source and target sides of your training bitext (SOURCE, TARGET).

 % SALM/Bin/Linux/Index/IndexSA.O32 TARGET
 % SALM/Bin/Linux/Index/IndexSA.O32 SOURCE

Prune the phrase table:

 % cat phrase-table | ./filter-pt -e TARGET -f SOURCE -l FILTER-VALUE > phrase-table.pruned

FILTER-VALUE is the -log prob threshold described in Johnson et al. (2007)'s paper. It may be either 'a+e', 'a-e', or a positive real value. Run with no options to see more use-cases. A good setting is -l a+e -n 30, which also keeps only the top 30 phrase translations for each source phrase, based on p(e|f).

If you filter an hierarchical model, add the switch -h.

Using the EMS

To use this method in experiment.perl, you will have to add two settings in the TRAINING section:

 salm-index = /path/to/project/salm/Bin/Linux/Index/IndexSA.O64
 sigtest-filter = "-l a+e -n 50"

The setting salm-index points to the binary to build the suffix array, and sigtest-filter contains the options for filtering (excluding -e, -f, -h). EMS detects automatically, if you filter a phrase-based or hierarchical model and if a reordering model is used.

Pruning the Phrase Table based on Relative Entropy

While the pruning method in Johnson et al. (2007) is designed to remove spurious phrase pairs due to noisy data, it is also possible to remove phrase pairs that are redundant. That is, phrase pairs that can be composed by smaller phrase pairs in the model with similar probabilities. For more detail please refer to the following papers:

Ling, W., Graça, J., Trancoso, I., and Black, A. (2012). Entropy-based Pruning for Phrase-based Machine Translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 962-971.

Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase Table Pruning Technique. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 972-983.

The code from Ling et al. (2012)'s paper is available at contrib/relent-filter.

Update The code in contrib/relent-filter no longer works with the current version of Moses. To compile it, use an older version of Moses with this command:

    git checkout RELEASE-0.91

Build Instructions

The binaries for Relative Entropy-based Pruning are built automatically with Moses. However, this implementation also calculates the significance scores (Johnson et al., 2007), using a slightly modified version of the code by Chris Dyer, which is in contrib/relent-filter/sigtest-filter. This must be built using the same procedure:

  1. Download and build SALM available here
  2. Run "make SALMDIR=/path/to/SALM" in "contrib/relent-filter/sigtest-filter" to create the executable filter-pt

Usage Instructions

Checklist of required files (I will use <varname> to refer to these vars):

  1. s_train - source training file
  2. t_train - target training file
  3. moses_ini - path to the Moses configuration file ( after tuning )
  4. pruning_binaries - path to the relent pruning binaries ( should be "bin" if no changes were made )
  5. pruning_scripts - path to the relent pruning scripts ( should be "contrib/relent-filter/scripts" if no changes were made )
  6. sigbin - path to the sigtest filter binaries ( should be "contrib/relent-filter/sigtest-filter" if no changes were made )
  7. output_dir - path to write the output

Build suffix arrays for the source and target parallel training data

 % SALM/Bin/Linux/Index/IndexSA.O32 <s_train>
 % SALM/Bin/Linux/Index/IndexSA.O32 <t_train>

Calculate phrase pair scores by running:

 % perl <pruning_scripts>/ -moses_ini <moses_ini> \
   -training_s <s_train> -training_t <t_train> \
   -prune_bin <pruning_binaries> -prune_scripts <pruning_scripts> \
   -moses_scripts <path_to_moses>/scripts/training/ \
   -workdir <output_dir> -dec_size 10000

This will create the following files in the <output_dir>/scores/ dir:

  1. count.txt - counts of the phrase pairs for N(s,t) N(s,*) and N(*,t)
  2. divergence.txt - negative log of the divergence of the phrase pair
  3. empirical.txt - empirical distribution of the phrase pairs N(s,t)/N(*,*)
  4. rel_ent.txt - relative entropy of the phrase pairs
  5. significance.txt - significance of the phrase pairs

You can use any one of these files for pruning and also combine these scores using the script <pruning_scripts>/

To actually prune a phrase table, run <pruning_scripts>/, this will prune phrase pairs based on the score file that is used. This script will prune the phrase pairs with lower scores first.

For instance, to prune 30% of the phrase table using relative entropy run:

 % perl <pruning_scripts>/ -table <phrase_table_file> \
 -scores <output_dir>/scores/rel_ent.txt -percentage 70 > <pruned_phrase_table_file>

You can also prune by threshold

 % perl <pruning_scripts>/ -table <phrase_table_file> \
 -scores <output_dir>/scores/rel_ent.txt -threshold 0.1 > <pruned_phrase_table_file>

The same must be done for the reordering table by replacing <phrase_table_file> with the <reord_table_file>

 % perl <pruning_scripts>/ -table <reord_table_file> \
 -scores <output_dir>/scores/rel_ent.txt -percentage 70 > <pruned_reord_table_file>


The script <pruning_scripts>/ requires the forced decoding of the whole set of phrase pairs in the phrase table, so unless it is used for a small corpora, it usually requires large amounts of time to process. Thus, we recommend users to run multiple instances of <pruning_scripts>/ in parallel to process different parts of the phrase table.

To do this, run:

 % perl <pruning_scripts>/ -moses_ini <moses_ini> \
 -training_s <s_train> -training_t <t_train> \
 -prune_bin <pruning_binaries> -prune_scripts <pruning_scripts> \
 -moses_scripts <path_to_moses>/scripts/training/ \
 -workdir <output_dir> -dec_size 10000 -start 0 -end 100000

The -start and -end options tell the script to only calculate the results for phrase pairs between 0 and 99999.

Thus, an example of a shell script to run for the whole phrase table would be:

 size=`wc <phrase_table_file> | gawk '{print $1}'`

 for i in $(seq 0 $phrases_per_process $size)
   end=`expr $i + $phrases_per_process`
   perl <pruning_scripts>/ -moses_ini <moses_ini> \
   -training_s <s_train> -training_t <t_train> \
   -prune_bin <pruning_binaries> -prune_scripts <pruning_scripts> \
   -moses_scripts <path_to_moses>/scripts/training/ 
   -workdir <output_dir>.$i-$end -dec_size 10000 -start $i -end $end

After all processes finish, simply join the partial score files together in the same order.

Multi-threaded Moses

Moses supports multi-threaded operation, enabling faster decoding on multi-core machines. The current limitations of multi-threaded Moses are:

  1. irstlm is not supported, since it uses a non-threadsafe cache
  2. lattice input may not work - this has not been tested
  3. increasing the verbosity of Moses will probably cause multi-threaded Moses to crash

Multi-threaded Moses is now built by default. If you omit the -threads argument, then Moses will use a single worker thread, and a thread to read the input stream. Using the argument -threads n specifies a pool of n threads, and -threads all will use all the cores on the machine.

Moses Server

The Moses server enables you to run the decoder as a server process, and send it sentences to be translated via XMLRPC. This means that one Moses process can service distributed clients coded in Java, perl, python, php, or any of the many other languages which have XMLRPC libraries.

To build the Moses server, you need to have XMLRPC-c installed and you need to add the argument --with-xmlrpc-c=<path-xmlrpc-c-config> to the configure arguments. It has been tested with the latest stable version, 1.16.19. You will also need to configure Moses for multi-threaded operation, as described above.

Running make should then build an executable server/mosesserver. This can be launched using the same command-line arguments as moses, with two additional arguments to specify the listening port and log-file (--server-port and --server-log). These default to 8080 and /dev/null respectively.

A sample client is included in the server directory (in perl), which requires the SOAP::Lite perl module installed. To access the Moses server, an XMLRPC request should be sent to http://host:port/RPC2 where the parameter is a map containing the keys text and (optionally) align. The value of the first of these parameters is the text to be translated and the second, if present, causes alignment information to be returned to the client. The client will receive a map containing the same two keys, where the value associated with the text key is the translated text, and the align key (if present) maps to a list of maps. The alignment gives the segmentation in target order, with each list element specifying the target start position (tgt-start), source start position (src-start) and source end position (src-end).

Note that although the Moses server needs to be built against multi-threaded moses, it can be run in single-threaded mode using the --serial option. This enables it to be used with non-threadsafe libraries such as (currently) irstlm.

Using Multiple Translation Systems in the Same Server

Alert: This functionality has been removed as of May 2013. A replacement is Alternate Weight Settings.

The Moses server is now able to load multiple translation systems within the same server, and the client is able to decide which translation system that the server should use, on a per-sentence basis. The client does this by passing a system argument in the translate operation.

One possible use-case for this multiple models feature is if you want to build a server that translates both French and German into English, and uses a large English language model. Instead of running two copies of the Moses server, each with a copy of the English language model in memory, you can now run one Moses server instance, with the language model in memory, thus saving on RAM.

To use the multiple models feature, you need to make some changes to the standard Moses configuration file. A sample configuration file can be found here.

The first piece of extra configuration required for a multiple models setup is to specify the available systems, for example

 de D 0 R 0 L 0
 fr D 1 R 1 L 1

This specifies that there are two systems (de and fr), and that the first uses decode path 0, reordering model 0, and language model 0, whilst the second uses the models with id 1. The multiple decode paths are specified with a stanza like

 0 T 0
 1 T 1

which indicates that the 0th decode path uses the 0th translation model, and the 1st decode path uses the 1st translation model. Using a language model specification like

 0 0 5 /disk4/translation-server/models/interpolated-lm
 0 0 5 /disk4/translation-server/models/interpolated-lm

means that the same language model can be used in two different systems with two different weights, but Moses will only load it once. The weights sections of the configuration file must have the correct numbers of weights for each of the models, and there must be a word penalty and linear distortion weight for each translation system. The lexicalised reordering weights (if any) must be specified in the [weight-lr] stanza, with the distortion penalty in the [weight-d] stanza.

Continue Partial Translation

Alert: This functionality has been removed as of May 2013.

This option forces Moses to start generating the translation from a non-empty hypothesis. This can be useful in situations, when you have already translated some part of the sentence and want to get a suggestion or an n-best-list of continuations.

Use -continue-partial-translation (-cpt) to activate this feature. With -cpt, Moses accepts also a special format of the input: three parameters delimited by the triple bar (|||). The first parameter is the string of output produced so far (used for LM scoring). The second parameter is the coverage vector of input words are already translated by the output so far, written as a string of "1"s and "0"s of the same length as there are words in the input sentence. The third parameter is the source sentence.


 % echo "that is ||| 11000 ||| das ist ein kleines haus" | moses -f moses.ini -continue-partial-translation
 that is a small house

 % echo "that house ||| 10001 ||| das ist ein kleines haus" | moses -f moses.ini -continue-partial-translation
 that house is a little

If the input does not fit to this pattern, it is treated like normal input with no words translated yet.

This type of input is currently not compatible with factored models or confusion networks. The standard non-lexicalized distortion works but more or less as one would expect (note that some input coverage vectors may prohibit translation under low distortion limits). The lexicalized reordering has not been tested.


  • -continue-partial-translation (-cpt) -- activate the feature

Global Lexicon Model

The global lexicon model predicts the bag of output words from the bag of input words. It does not use an explicit alignment between input and output words, so word choice is also influenced by the input context. For details, please check Mauser et al., (2009).

The model is trained with the script

 scripts/training/train-global-lexicon-model.perl --corpus-stem FILESTEM --lex-dir DIR --f EXT --e EXT

which requires the tokenized parallel corpus, and the lexicon files required for GIZA++.

You will need the MegaM maximum entropy classifier from Hal Daume for training.

Warning: A separate maximum entropy classifier is trained for each target word, which is very time consuming. The training code is a very experimental state. It is very inefficient. For instance training a model on Europarl German-English with 86,700 distinct English words took about 10,000 CPU hours.

The model is stored in a text file.

File format:

 county initiativen 0.34478
 county land 0.92405
 county schaffen 0.23749
 county stehen 0.39572
 county weiteren 0.04581
 county europa -0.47688

Specification in moses.ini:

 GlobalLexicalModel input-factor=0 output-factor=0 path=.../global-lexicon.gz

 GlobalLexicalModel0= 0.1

Incremental Training


Translation models for Moses are typically batch trained. That is, before training you have all the data you wish to use, you compute the alignments using GIZA, and from that produce a phrase table which you can use in the decoder. If some time later you wish to utilize some new training data, you must repeat the process from the start, and for large data sets, that can take quite some time.

Incremental training provides a way of avoiding having to retrain the model from scratch every time you wish to use some new training data. Instead of producing a phrase table with precalculated scores for all translations, the entire source and target corpora are stored in memory as a suffix array along with their alignments, and translation scores are calculated on the fly. Now, when you have new data, you simply update the word alignments, and append the new sentences to the corpora along with their alignments. Moses provides a means of doing this via XML RPC, so you don't even need to restart the decoder to use the new data.

Note that at the moment the incremental phrase table code is not thread safe.

Initial Training

This section describes how to initially train and use a model which support incremental training.

 training-options = "-final-alignment-model hmm"
to the TRAINING section of your experiment configuration file.


* Modify the moses.ini file found in <experiment-dir> /evaluation/filtered.<evaluation-set>.<run-number> to have a ttable-file entry as follows:

PhraseDictionaryDynSuffixArray source=<path-to-source-corpus> target=<path-to-target-corpus> alignment=<path-to-alignments>

The source and target corpus paths should be to the tokenized, cleaned, and truecased versions found in <experiment-dir>/training/corpus.<run>.<lang>, and the alignment path should be to <experiment-dir>/model/aligned.<run>.grow-diag-final-and.

How to use memory-mapped dynamic suffix array phrase tables in the moses decoder

(phrase-based decoding only)

1. Compile with the bjam switch --with-mm

2. You need

   - sentences aligned text files
   - the word alignment between these files in symal output format

3. Build binary files

   ${L1} be the extension of the language that you are translating from,
   ${L2} the extension of the language that you want to translate into, and 
   ${CORPUS} the name of the word-aligned training corpus

   % zcat ${CORPUS}.${L1}.gz  | mtt-build -i -o /some/path/${CORPUS}.${L1}
   % zcat ${CORPUS}.${L2}.gz  | mtt-build -i -o /some/path/${CORPUS}.${L2}
   % zcat ${CORPUS}.${L1}-${L2}.symal.gz | symal2mam /some/path/${CORPUS}.${L1}-${L2}.mam
   % mmlex-build /some/path/${CORPUS} ${L1} ${L2} -o /some/path/${CORPUS}.${L1}-${L2}.lex -c /some/path/${CORPUS}.${L1}-${L2}.coc

4. Define line in moses.ini :

   PhraseDictionaryBitextSampling name=PT0 output-factor=0 num-features=9 path=/some/path/${CORPUS} L1=${L1} L2=${L2} pfwd=g pbwd=g smooth=0 sample=1000 workers=1 

You can increase the number of workers for sampling (a bit faster), but you'll lose replicability of the translation output. (The best configuration of phrase table features is still under investigation.)


Preprocess New Data

First, tokenise, clean, and truecase both target and source sentences (in that order) in the same manner as for the original corpus. You can see how this was done by looking at the <experiment-dir>/steps/<run>/CORPUS_{tokenize,clean,truecase}.<run> scripts.

Prepare New Data

The preprocessed data now needs to be prepared for use by GIZA. This involves updating the vocab files for the corpus, converting the sentences into GIZA's snt format, and updating the cooccurrence file.


 $ $INC_GIZA_PP/GIZA++-v2/plain2snt.out <new-source-sentences> <new-target-sentences> \
 -txt1-vocab <previous-source-vocab> -txt2-vocab <previous-target-vocab>
The previous vocabulary files for the original corpus can be found in <experiment-dir>/training/prepared.<run>/{<source-lang>,<target-lang>}.vcb. Running this command with the files containing your new tokenized, cleaned, and truecased source and target as txt1 and txt2 will produce new a new vocab file for each language and a couple of .snt files. Any further references to vocabs in commands or config files should reference the new vocabulary files just produced.
Note: if this command fails with the error message plain2snt.cpp:28: int loadVocab(): Assertion `iid1.size()-1 == ID' failed., then change line 15 in plain2snt.cpp to vector<string> iid1(1),iid2(1); and recompile.


 $ $INC_GIZA_PP/bin/snt2cooc.out <new-source-vcb> <new-target-vcb> <new-source_target.snt> \
   <previous-source-target.cooc > new.source-target.cooc
 $ $INC_GIZA_PP/bin/snt2cooc.out <new-target-vcb> <new-source-vcb> <new-target_source.snt> \
   <previous-target-source.cooc >
This commands is run once in the source-target direction, and once in the target-source direction. The previous cooccurrence files can be found in <experiment-dir>/training/giza.<run>/<target-lang>-<source-lang>.cooc and <experiment-dir>/training/giza-inverse.<run>/<source-lang>-<target-lang>.cooc.

Update and Compute Alignments

GIZA++ can now be run to update and compute the alignments for the new data. This should be run in the source to target, and target to source directions. A sample GIZA++ config file is given below for the source to target direction; for the target to source direction, simply swap mentions of target and source.
 S: <path-to-src-vocab>
 T: <path-to-tgt-vocab>
 C: <path-to-src-to-tgt-snt>
 O: <prefix-of-output-files>
 coocurrencefile: <path-to-src-tgt-cooc-file>
 model1iterations: 1
 model1dumpfrequency: 1
 hmmiterations: 1
 hmmdumpfrequency: 1
 model2iterations: 0
 model3iterations: 0
 model4iterations: 0
 model5iterations: 0
 emAlignmentDependencies: 1
 step_k: 1
 oldTrPrbs: <path-to-original-thmm> 
 oldAlPrbs: <path-to-original-hhmm>

To run GIZA++ with these config files, just issue the command

 GIZA++ <path-to-config-file>

With the alignments updated, we can get the alignments for the new data by running the command: -d <path-to-updated-tgt-to-src-ahmm> -i <path-to-updated-src-to-tgt-ahmm> \
 | symal -alignment="grow" -diagonal="yes" -final="yes" -both="yes" > new-alignment-file
  • Update Model
Now that alignments have been computed for the new sentences, you can use them in the decoder. Updating a running Moses instance is done via XML RPC, however to make the changes permanent, you must append the tokenized, cleaned, and truecased source and target sentences to the original corpora, and the new alignments to the alignment file.

Distributed Language Model


In most cases, MT output improves significantly when more data is used to train the Language Model. More data however produces larger models, and it is very easy to produce a model which cannot be held in the main memory of a single machine. To overcome this, the Language Model can be distributed across many machines, allowing more data to be used at the cost of a performance overhead.

Support for Distributed Language Models in Moses are built on top of a bespoke distributed map implementation called DMap. DMap and support for Distributed Language Models are still in beta, and any feedback or bug reports are welcomed.

Installing and Compiling

Before compiling Moses with DMap support, you must configure your DMap setup (see below). Once that has been done, run Moses' configure script with your normal options and --with-dmaplm=<path-to-dmap>, then the usual make, make install.


Configuring DMap is at the moment, a very crude process. One must edit the src/DMap/Config.cpp file by hand and recompile when making any changes. With the configuration being compiled in, this also means that once it is changed, any programs statically linked to DMap will have to be recompiled too. The file src/DMap/Config.cpp provides a good example configuration which is self explanatory.


In this example scenario, we have a Language Model trained on the giga4 corpus which we wish to host across 4 servers using DMap. The model is a 5-gram model, containing roughly 210 million ngrams; the probabilities and backoff weights of ngrams will be uniformly quantised to 5 bit values.


Here is an example Config.cpp for such a set up:
     config->addTableConfig(new TableConfigLossyDoubleHash(
             "giga4",    // name of table
             283845991,  // number of cells (approx 1.23 * number of ngrams)
             64,         // number of chunks (not too important, leave at 64)
             (((uint64_t)1 << 61) - 1),              // universal hashing P parameter
             5789372245 % (((uint64_t)1 << 61) - 1), // universal hashing a parameter
             3987420741 % (((uint64_t)1 << 61) - 1), // universal hashing b parameter
             16,         // num_error_bits (higher -> fewer collisions but more memory)
             10,         // num_value_bits (higher -> more accurate probabilities 
                         // and backoff weights but more memory)
             20));       // num_hashes (higher -> 
                         // config->addStructConfig(new StructConfigLanguageModelBackoff(
             "giga4",    // struct name
             "giga4",    // lm table name
             5,          // lm order
             5,          // num logprob bits (these fields should add up to the number 
                         // of value bits for the table)
             5));        // num backoff bits
     config->addServerConfig(new ServerConfig("server0.some.domain", 5000));
     config->addServerConfig(new ServerConfig("server1.some.domain", 5000));
     config->addServerConfig(new ServerConfig("server2.some.domain", 5000));
     config->addServerConfig(new ServerConfig("server3.some.domain", 5000));
Note that the shard directory should be on a shared file system all Servers can access.

Create Table

The command:
 create_table giga4
will create the files for the shards.

Shard Model

The model can now be split into chunks using the shard utility:
 shard giga4 /home/user/dmap/

Create Bloom Filter

A Bloom filter is a probabilistic data structure encoding set membership in an extremely space efficient manner. When querying whether a given item is present in the set they encode, they can produce an error with a calculable probability. This error is one sided in that they can produce false positives, but never false negatives. To avoid making slow network requests, DMap keeps a local Bloom filter containing the set of ngrams in the Language Model. Before making a network request to get the probability of an ngram, DMap first checks to see if the ngram is present in the Bloom filter. If is not, then we know for certain the ngram is not present in the model and therefore not worth issuing a network request for. However, if the ngram is present in the filter, it might actually be in the model, or the filter may have produced a false positive.

To create a Bloom filter containing the ngrams of the Language Model, run this command:

 ngrams < /home/user/dmap/ | mkbf 134217728 210000000 /home/user/dmap/

Integration with Moses

The structure within DMap Moses should use as the Language Model should be put into a file, in this case at /home/user/dmap/giga4.conf:


Note that if for testing or experimentation purposes you would like to have the whole model on the local machine instead of over the network, change the false to true. You must have sufficient memory to host the whole model, but decoding will be significantly faster.

To use this, put the following line in your moses.ini file:

 11 0 0 5 /home/user/dmap/giga4.conf

Suffix Arrays for Hierarchical Models

The phrase-based model uses a suffix array implementation which comes with Moses.

If you want to use suffix arrays for hierarchical models, use Adam Lopez's implementation. The source code for this is currently available in cdec. You have to compile cdec so please follow its instructions.

You also need to install pycdec

    cd python
    python install 

Note: the suffix array code requires Python 2.7 or above. If you have Linux installations which are a few years old, check this first.

Adam Lopez's implementation writes the suffix array to binary files, given the parallel training data and word alignment. The Moses toolkit has a wrapper script which simplifies this process:

    ./scripts/training/wrappers/adam-suffix-array/ \
           [path to cdec/python/pkg] \
           [source corpus] \
           [target corpus] \
           [word alignment] \
           [output suffix array directory] \
           [output glue rules]

WARNING - This requires a lot of memory (approximately 10GB for a parallel corpus of 15 million sentence pairs)

Once the suffix array has been created, run another Moses wrapper script to extract the translation rules required for a particular set of input sentences.

     ./scripts/training/wrappers/adam-suffix-array/ \
           [suffix array directory from previous command] \
           [input sentences] \   
           [output rules directory] \
           [number of jobs]

This command creates one file for each input sentences with just the rules required to decode that sentences. eg.

    # ls filtered.5/
    grammar.0.gz	grammar.3.gz	grammar.7.gz
    grammar.1.gz	grammar.4.gz	grammar.8.gz
    grammar.10.gz	grammar.5.gz	grammar.9.gz ....

Note - these files are gzipped, and the rules are formatted in the Hiero format, rather than the Moses format. eg.

    # zcat filtered.5/grammar.out.0.gz | head -1
    [X] ||| monsieur [X,1] ||| mr [X,1] ||| 0.178069829941 2.04532289505 1.8692317009 0.268405526876 0.160579100251 0.0 0.0 ||| 0-0

To use these rules in the decoder, put this into the ini file

    PhraseDictionaryALSuffixArray name=TranslationModel0 table-limit=20 \
       num-features=7 path=[path-to-filtered-dir] input-factor=0 output-factor=0
    PhraseDictionaryMemory name=TranslationModel1 num-features=1 \
       path=[path-to-glue-grammar] input-factor=0 output-factor=0

Using the EMS

Adam Lopez's suffix array implementation is integrated into the EMS, where all of the above commands are executed for you. Add the following line to your EMS config file:

   suffix-array = [pycdec package path]
   # e.g.
   # suffix-array = /home/github/cdec/python/pkg

and the EMS will use the suffix array instead of the usual Moses rule extraction algorithms.

You can also have multiple extractors running at once

   sa_extractors = 8

WARNING: currently the pycdec simply forks itself N times, therefore this will require N times more memory. Be careful with the interaction with multiple evaluations in parallel in EMS and large suffix arrays.

Fuzzy Match Rule Table for Hierachical Models

Another method of extracting rules from parallel data is described in (Koehn, Senellart, 2010-1 AMTA) and (Koehn, Senellart, 2010-2 AMTA).

To use this extraction method in the decoder, add this to the moses.ini file:

    PhraseDictionaryFuzzyMatch source=<source/path> target=<target/path> alignment=<alignment/path>

It has not yet been integrated into the EMS.

Note: The translation rules generated by this algorith is intended to be used in the chart decoder. It can't be used in the phrase-based decoder.

Translation Model Combination

You can combine several phrase tables by linear interpolation or instance weighting using the script contrib/tmcombine/, or by fill-up using the script contrib/combine-ptables/

Linear Interpolation and Instance Weighting

Linear interpolation works with any models; for instance weighting, models need to be trained with the option -write-lexical-counts so that all sufficient statistics are available. You can set corpus weights by hand, and instance weighting with uniform weights corresponds to a concatenation of your training corpora (except for differences in word alignment).

You can also set weights automatically so that perplexity on a tuning set is minimized. To obtain a tuning set from a parallel tuning corpus, use the Moses training pipeline to automatically extract a list of phrase pairs. The file model/extract.sorted.gz is in the right format.

An example call: (this weights test/model1 and test/model2 with instance weighting (-m counts) and test/extract as development set for perplexity minimization, and writes the combined phrase table to test/phrase-table_test5)

    python combine_given_tuning_set test/model1 test/model2 
-m counts -o test/phrase-table_test5 -r test/extract

More information is available in (Sennrich, 2012 EACL) and contrib/tmcombine/


This combination technique is useful when the relevance of the models is known a priori: typically, when one is trained on in-domain data and the others on out-of-domain data.

Fill-up preserves all the entries and scores coming from the first model, and adds entries from the other models only if new. Moreover, a binary feature is added for each additional table to denote the provenance of an entry. These binary features work as scaling factors that can be tuned directly by MERT along with other models' weights.

Fill-up can be applied to both translation and reordering tables.

Example call, where ptable0 is the in-domain model:

    perl --mode=fillup ptable0 ptable1 ... ptableN > ptable-fillup

More information is available in (Bisazza et al., 2011 IWSLT) and contrib/combine-ptables/

Online Translation Model Combination (Multimodel phrase table type)

Additionally to the log-linear combination of translation models, Moses supports additional methods to combine multiple translation models into a single virtual model, which is then passed to the decoder. The combination is performed at decoding time.

In the config, add a feature PhraseDictionaryMultiModel, which refers to its components as follows:

 0 T 2 [or whatever the zero-based index of PhraseDictionaryMultiModel is]

 PhraseDictionaryMemory tuneable=false num-features=4 input-factor=0 output-factor=0 path=/path/to/model1/phrase-table.gz table-limit=20
 PhraseDictionaryMemory tuneable=false num-features=4 input-factor=0 output-factor=0 path=/path/to/model2/phrase-table.gz table-limit=20
 PhraseDictionaryMultiModel num-features=4 input-factor=0 output-factor=0 table-limit=20 mode=interpolate lambda=0.2,0.8 components=PhraseDictionaryMemory0,PhraseDictionaryMemory1


 PhraseDictionaryMemory0= 0 0 1 0
 PhraseDictionaryMemory1= 0 0 1 0
 PhraseDictionaryMultiModel0= 0.2 0.2 0.2 0.2

As component models, PhraseDictionaryMemory, PhraseDictionaryBinary and PhraseDictionaryCompact are supported (you may mix them freely). Set the key tuneable=false for all component models; their weights are only used for table-limit pruning, so we recommend 0 0 1 0 0 (which means p(e|f) is used for pruning).

There are two additional valid options for PhraseDictionaryMultiModel, mode and lambda. The only mode supported so far is interpolate, which linearly interpolates all component models, and passes the results to the decoder as if they were coming from a single model. Results are identical to offline interpolation with and -mode interpolate, except for pruning and rounding differences. The weights for each component model can be configured through the key lambda. The number of weights must be one per model, or one per model per feature.

Weights can also be set for each sentence during decoding through mosesserver by passing the parameter lambda. See contrib/server/ for an example. Sentence-level weights override those defined in the config.

With a running Moses server instance, the weights can also be optimized on a tuning set of phrase pairs, using perplexity minimization. This is done with the XMLRPC method optimize and the parameter phrase_pairs, which is an array of phrase pairs, each phrase pair being an array of two strings. For an example, consult contrib/server/ Online optimization depends on the dlib library, and requires Moses to be compiled with the flag --with-dlib=/path/to/dlib. Note that optimization returns a weight vector, but does not affect the running system. To use the optimized weights, either update the moses.ini and restart the server, or pass the optimized weights as a parameter for each sentence.

Online Computation of Translation Model Features Based on Sufficient Statistics

With default phrase tables, only linear interpolation can be performed online. Moses also supports computing translation probabilities and lexical weights online, based on a (weighted) combination of the sufficient statistics from multiple corpora, i.e. phrase and word (pair) frequencies.

As preparation, the training option --write-lexical-counts must be used when training the translation model. Then, use the script scripts/training/ to convert the phrase tables into phrase tables that store phrase (pair) frequencies as their feature values.

  scripts/training/ /path/to/model/phrase-table.gz /path/to/model

The format for the translation tables in the moses.ini is similar to that of the Multimodel type, but using the feature type PhraseDictionaryMultiModelCounts and additional parameters to specifify the component models. Four parameters are required: components, target-table, lex-f2e and lex-e2f. The files required for the first two are created by, the last two during training of the model with --write-lexical-counts. Binarized/compacted tables are also supported (like for PhraseDictionaryMultiModel). Note that for the target count tables, phrase table filtering needs to be disabled (filterable=false).

 0 T 4 [or whatever the zero-based index of PhraseDictionaryMultiModelCounts is]

 PhraseDictionaryMemory tuneable=false num-features=3 input-factor=0 output-factor=0 path=/path/to/model1/count-table.gz table-limit=20
 PhraseDictionaryMemory tuneable=false num-features=3 input-factor=0 output-factor=0 path=/path/to/model2/count-table.gz table-limit=20

 PhraseDictionaryMemory tuneable=false filterable=false num-features=1 input-factor=0 output-factor=0 path=/path/to/model1/count-table-target.gz
 PhraseDictionaryMemory tuneable=false filterable=false num-features=1 input-factor=0 output-factor=0 path=/path/to/model2/count-table-target.gz

 PhraseDictionaryMultiModelCounts num-features=4 input-factor=0 output-factor=0 table-limit=20 mode=instance_weighting lambda=1.0,10.0 components=PhraseDictionaryMemory0,PhraseDictionaryMemory1 target-table=PhraseDictionaryMemory2,PhraseDictionaryMemory3 lex-e2f=/path/to/model1/lex.counts.e2f,/path/to/model2/lex.counts.e2f lex-f2e=/path/to/model1/lex.counts.f2e,/path/to/model2/lex.counts.f2e

 PhraseDictionaryMemory0= 1 0 0
 PhraseDictionaryMemory1= 1 0 0
 PhraseDictionaryMemory2= 1 
 PhraseDictionaryMemory3= 1
 PhraseDictionaryMultiModelCounts0= 0.00402447059454402 0.0685647475075862 0.294089113124688 0.0328320356515851

Setting and optimizing weights is done as for the Multimodel phrase table type, but the supported modes are different. The weights of the component models are only used for table-limit pruning, and the weight 1 0 0, which is pruning by phrase pair frequency, is recommended.

The following modes are implemented:

  • instance_weighting: weights are applied to the sufficient statistics (i.e. the phrase (pair) frequencies), not to model probabilities. Results are identical to offline optimization with and -mode counts, except for pruning and rounding differences.
  • interpolate: both phrase and word translation probabilities (the latter being used to compute lexical weights) are linearly interpolated. This corresponds to with -mode interpolate and -recompute-lexweights.

Alternate Weight Settings

Note: this functionality currently does not work with multi-threaded decoding.

You may want to translate different some sentences with different weight settings than others, due to significant differences in genre, text type, style, or even to have separate settings for headlines and questions.

Moses allows you to specify alternate weight settings in the configuration file, e.g.:

 Distortion0= 0.1
 LexicalReordering0= 0.1 0.1 0.1 0.1 0.1 0.1
 LM0= 1
 WordPenalty0= 0
 TranslationModel0= 0.1 0.1 0.1 0.1 0

This example specifies a weight setting with the identifying name strong-lm.

When translating a sentence, the default weight setting is used, unless the use of an alternate weight setting is specified with an XML tag:

 <seg weight-setting="strong-lm">This is a small house .</seg>

This functionality also allows for the selective use of feature functions and decoding graphs (unless decomposed factored models are used, a decoding graph corresponds to a translation table).

Feature functions can be turned off by adding the parameter ignore-ff to the identifier line (names of feature functions, separated by comma), decoding graphs can be ignored with the parameter ignore-decoding-path (number of decoding paths, separated by comma).

Note that with these additional options all the capability of the previously (pre-2013) implemented "Translation Systems" is provided. You can even have one configuration file and one Moses process to translate two different language pairs that share nothing but basic features.

See the example below for a complete configuration file with exactly this setup. In this case, the default weight setting is not useful since it mixes translation models and language models from both language pairs.


 # mapping steps
 0 T 0
 1 T 1


 # feature functions
 PhraseDictionaryBinary name=TranslationModel0 num-features=5 \ 
    path=/path/to/french-english/phrase-table output-factor=0 
 LexicalReordering num-features=6 name=LexicalReordering0 \ 
    type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 \ 
 KENLM name=LM0 order=5 factor=0 path=/path/to/french-english/language-model lazyken=0
 PhraseDictionaryBinary name=TranslationModel1 num-features=5 \ 
    path=/path/to/german-english/phrase-table output-factor=0 
 LexicalReordering num-features=6 name=LexicalReordering1 \ 
    type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 \ 
 KENLM name=LM1 order=5 factor=0 path=/path/to/german-english/language-model lazyken=0

 # core weights - not used 
 Distortion0= 0
 WordPenalty0= 0
 TranslationModel0= 0 0 0 0 0
 LexicalReordering0= 0 0 0 0 0 0
 LM0= 0
 TranslationModel1= 0 0 0 0 0
 LexicalReordering1= 0 0 0 0 0 0
 LM1= 0

 id=fr ignore-ff=LM1,LexicalReordering1 ignore-decoding-path=1
 Distortion0= 0.155
 LexicalReordering0= 0.074 -0.008 0.002 0.050 0.033 0.042
 LM0= 0.152
 WordPenalty0= -0.097
 TranslationModel0= 0.098 0.065 -0.003 0.060 0.156
 id=de ignore-ff=LM0,LexicalReordering0 ignore-decoding-path=0
 LexicalReordering1= 0.013 -0.012 0.053 0.116 0.006 0.080
 Distortion0= 0.171
 LM0= 0.136
 WordPenalty0= 0.060
 TranslationModel1= 0.112 0.160 -0.001 0.067 0.006

With this model, you can translate:

 <seg weight-setting=de>Hier ist ein kleines Haus .</seg>
 <seg weight-setting=fr>C' est une petite maison . </seg>

Open Machine Translation Core (OMTC) - A proposed machine translation system standard

A proposed standard for machine translation APIs has been developed as part of the MosesCore project (European Commission Grant Number 288487 under the 7th Framework Programme). It is called Open Machine Translation Core (OMTC) and defines a service interface for MT interfaces. This approach allows software engineers to wrap disparate MT back-ends such that they look identical to others no matter which flavour of MT system is being wrapped. This provides a standard protocol for “talking” to MT back-ends. In applications where many MT back-ends are to be used, OMTC allows for easier integration of these back-ends. Even in applications where one MT back-end is used, OMTC provides highly cohesive, yet low coupled, interfaces that should allow the back-end to be replaced by another with little effort.

OMTC standardises the follow aspects of an MT system:

  • Resources: A resource is an object that is provided or constructed by a user action for use in an MT system. Examples of resources are: translation memory, glossary, MT engine, or a document. Two base resource types are defined, from which all other resource types are derived, they are primary and derived resources. Primary resources are resource which are constructed outside of the MT system and are made available to it, e.g., through an upload action. Primary resources are used to defined mono- and multi-lingual resources, translation memories and glossaries. Derived resources, on the other hand, are ones which have been constructed by user action inside of the MT system, e.g., a SMT engine.
  • Sessions: A session is the period of time in which a user interacts with the MT system. The session interface hierarchy supports both user identity and anonymity. Mixin interfaces are, also, defined, to integrate with any authentication system.
  • Session Negotiation: This is an optional part of the standard and, if used, shall allow a client and the MT server to come to an agreement about which features, resources (this includes exchange and document formats), pre-requisites (e.g. payment) and API version support is to be expected from both parties. If no agreement can be found then the client's session should be torn down, but this is completely application defined.
  • Authorisation: OMTC can integrate with an authorisation system that may be being used in an MT system. It allows users and roles to be mapped into the API.
  • Machine Translation Engines: Machine translation engines are derived resources which are capable of performing machine translation of, possibly, unseen sentences. An engine may be an SMT decoding pipeline, for instance. It is application defined as to how this part of the API is implemented. Optionally engine functionality can be mixed-in in order to add the following operations: composition, evaluation, parameter updating, querying, (re-)training, testing and updating. Potentially long running tasks return tickets in order for the application to track these tasks.
  • Translators: Translators, as defined in OMTC, are a derived resource and are a conglomeration of, at least one of the following, an MT engine, a collection of translation memories, and a collection of glossaries. The translator interface provides methods for translation with returned tickets due to the long running nature of these tasks.

A reference implementation of OMTC has been constructed in Java v1.7. It is available in the contrib/omtc directory of the mosesdecoder as a Git submodule. Please see the contrib/omtc/README for details.

Pipeline Creation Language (PCL)

Building pipelines can be tedious and error-prone. Using Moses scripts to build pipelines can be hampered by the fact that scripts need to be able to parse the output of the previous script. Moving scripts to different positions in the pipeline is tricky and may require a code change! It would be better if the scripts were re-usable without change and users can start to build up a library of computational pieces that can be used in any pipeline in any position.

Since pipelines are widely used in machine translation, and given the problem outlined above, a more convienent and less error prone way of building pipelines quickly, with re-usable components, would aid construction.

A domain specific language called Pipeline Creation Language (PCL) has been developed part of the MosesCore project (European Commission Grant Number 288487 under the 7th Framework Programme). PCL enables users to gather components into libraries, or packages, and re-use them in pipelines. Each component defines inputs and outputs which are checked by the PCL compiler to verify components are compatible with each other.

PCL is a general purpose language that can be used to construct non-recurrent software pipelines. In order to adapt your existing programs and script for use with PCL a Python wrapper must be defined for each program. This builds up a library of components with are combined with others in PCL files. The Python wrapper scripts must implement the following function interface:

  • get_name() - Returns an object representing the name of the component. The __str__() function should be implemented to return a meaningful name.
  • get_inputs() - Returns the inputs of the component. Components should only be defined with one input port. A list of input names must be returned.
  • get_outputs() - Returns the outputs of the component. Components should only be defined with one output port. A list of output names must be returned.
  • get_configuration() - Returns a list of names that represent the static data that shall be used to construct the component.
  • configure(args) - This function is the component designer's chance to preprocess configuration injected at runtime. The args parameter is a dictionary that contains all the configuration provided to the pipeline. This function is to filter out, and optionally preprocess, the configuration used by this component. This function shall return an object containing the configuration necessary to construct this component.
  • initialise(config) - This function is where the component designer defines the component's computation. The function receives the output object from the configure() function and must return a function that takes two parameters, an input object, and a state object. The input object is a dictionary that is received from the previous component in the pipeline, and the state object is the configuration for the component. The returned function should be used to define the component's computation.

Once your library of components have been written they can be combined using the PCL language. A PCL file defines one component which uses other defined components. For example, the following file defines a component that performs tokenisation for source and target files.

 # Component definition: 2 input ports, 2 output ports
 #                 +---------+
 # src_filename -->+         +--> tokenised_src_filename
 #                 |         |
 # trg_filename -->+         +--> tokenised_trg_filename
 #                 +---------+
 import wrappers.tokenizer.tokenizer as tokeniser

 component src_trg_tokeniser
  inputs (src_filename), (trg_filename)
  outputs (tokenised_src_filename), (tokenised_trg_filename)
  configuration tokeniser.src.language,
    src_tokeniser := new tokeniser with
      tokeniser.src.language -> language,
      tokeniser.src.tokenisation_dir -> tokenisation_dir,
      tokeniser.moses.installation -> moses_installation_dir
    trg_tokeniser := new tokeniser with
      tokeniser.trg.language -> language,
      tokeniser.trg.tokenisation_dir -> tokenisation_dir,
      tokeniser.moses.installation -> moses_installation_dir
    wire (src_filename -> filename),
         (trg_filename -> filename) >>>
    (src_tokeniser *** trg_tokeniser) >>>
    wire (tokenised_filename -> tokenised_src_filename),
         (tokenised_filename -> tokenised_trg_filename)

A PCL file is composed of the following bits:

  • Imports: Optional imports can be specified. Notice that all components must be given an alias, in this case the component wrappers.tokenizer.tokenizer shall be referenced in this file by the name tokeniser.
  • Component: This starts the component definition and provides the name. The component's name must be the same as the filename. E.g., a component in fred.pcl must be called fred.
  • Inputs: Defines the inputs of the component. The example above defines a component with a two port input. Specifing a comma-separated list of names defines a one port input.
  • Outputs: Defines the outputs of the component. The example above defines a component with a two port output. Specifing a comma-separated list of names defines a one port output.
  • Configuration: Optional configuration for the component. This is static data that shall be used to construct components used in this component.
  • Declarations: Optional declarations of components used in this component. Configuration is used to construct imported components
  • Definition: The as portion of the component definition is an expression which defines how the construct components are to be combined to create the computation required for the component.

The definition of a component can use the following pre-defined components:

  • first - This component takes one expression with a one port input and creates a two port input and output component. The provided component is applied only to the first port of the input.
  • second - This component takes one expression with a one port input and creates a two port input and output component. The provided component is applied only to the second port of the input.
  • split - Split is a component with one input port and two output ports. The value of the outputs is the input, i.e., spliting the input.
  • merge - Merge values from the two port input to a one port output. A comma-separated list of top and bottom keywords subscripted with input names are used to map these values to a new name. E.g., merge top[a] -> top_a, bottom[b] -> bottom_b takes the a value of the top input and maps that value to a new name top_a, and the b value of the bottom input and maps that value to a new name bottom_b.
  • wire - Wires are used to adapt one component's output to another's input. For wires with one input and output port then the wire mapping is a comma-separated mapping, e.g., wire a -> next_a, b -> next_b adapts a one port output component whose outputs are a and b to a one port component whose inputs are next_a and next_b. For wires with two input and output ports mappings are in comma-separated parenthese, e.g., wire (a -> next_a, b -> next_b), (c -> next_c, d -> next_d). This wire adapts the top input from a to next_a, and b to next_b, and the bottom input from c to next_c and d to next_d.
  • if - Conditional execution of a component can be achieved with the if component. This component takes three arguments: a conditional expression, a then component and an else component. If the condition is evaluated to a truthy value the then component is executed, otherwise the else component is executed. See the conditional example in the PCL Git repository for an example of usage.

Combinator operators used to compose the pipeline, they are:

  • >>> - Composition. This operator composes two components. E.g., a >>> b creates a component in which a is executed before b.
  • *** - Parallel execution. This operator creates a component in which the two components provided are executed in parallel. E.g., a *** b creates a component with two input and output ports.
  • &&& - Parallel execution. The operator creates a component in which two components are executed in parallel from a single input port. E.g., a &&& b creates a component with one input port and two output ports.

Examples in the PCL Git repository show the usage of these operators and pre-defined components. Plus an example Moses training pipeline is available in contrib/arrow-pipelines directory of the mosesdecoder Git repository. Please see contrib/arrow-pipelines/README for details of how to compile and run this pipeline.

For more details of how to use PCL please see the latest manual at



Placeholders are symbols that replaces a word or phrase. For example, numbers ('42.85') can be replaced with a symbol '@num@'. Other words and phrases that can potentially be replaced with placeholders include dates and time, and named-entities.

This is good in training since the sparse numbers are replaced with more numerous placeholders symbols, producing more reliable statistics for the MT models.

The same reason also applies during decoding - the raw number is often an unknown symbol in the phrase-table and language models. Unknown symbols are translated as single words, disabling the advantage of phrasal translation. The reordering of unknown symbols can also be unreliable as we don't have statistics for it.

However, 2 issues arises using placeholder:

    1. Translate the original word or phrase. In the example, '42.85' should be translated. If the language pair is en-fr, then it may be translated as '42,85'.
    2. How do we insert this translation into the output if the word has be replaced with the placeholder.

Moses has support for placeholders in training and decoding.


When preparing your data, process with data with the script

   scripts/generic/ph_numbers.perl -c

The script was designed to run after tokenization, that is, instead of tokenizing like this:

   cat [RAW-DATA] | ./scripts/tokenizer/tokenizer.perl -a -l en > TOK-DATA

do this

   cat [RAW-DATA] | ./scripts/tokenizer/tokenizer.perl -a -l en | scripts/generic/ph_numbers.perl -c > TOK-DATA

Do this for both source and target language, for parallel and monolingual data.

The script will replace numbers with the symbol @num@.

NB. - this script is currently very simple and language independent. It can be improved to create better translations.

During extraction, add the following to the extract command (phrase-based only for now):

   ./extract --Placeholders @num@ ....

This will discard any extracted translation rule which are non-consistent with the placeholders. That is, all placeholders must be aligned to 1-to-1 with a placeholder in the other language.


The input sentence must also be processed with the placeholder script to replace numbers with placeholder symbol. However, don't add the -c argument so that the original number will be retained in the output as an XML entry. For example,

    generic $echo  "you owe me $ 100 ." | ./ph_numbers.perl 

will output

   you owe me $ <ne translation="@num@" entity="100">@num@</ne> .

Add this to the decoder command when executing the decoder (phrase-based only for now):

   ./moses  -placeholder-factor 1 -xml-input exclusive

The factor must NOT be one which is being used by the source side of the translation model. For vanilla models, only factor 0 is used.

The argument -xml-input can be any permitted value, except 'pass-through'.

The output from the decoder will contain the number, not the placeholder. The is the case in the best output, and the n-best list.


The above changes can be added to the EMS config file.

For my (Hieu) experiment, these are the changes I made:

   1. In the [GENERAL section, change
         input-tokenizer = "$misc-script-dir/normalize-punctuation.perl $input-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $input-extension"
         input-tokenizer = "$misc-script-dir/normalize-punctuation.perl $input-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $input-extension | $moses-script-dir/generic/ph_numbers.perl -c"

      and change
         output-tokenizer = "$misc-script-dir/normalize-punctuation.perl $output-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $output-extension | $moses-script-dir/generic/ph_numbers.perl -c"
        output-tokenizer = "$misc-script-dir/normalize-punctuation.perl $output-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $output-extension | $moses-script-dir/generic/ph_numbers.perl -c" 

   2. In the [TRAINING] section, add
        extract-settings = "--Placeholders @num@"

   3. In the [TUNING] section, change
        decoder-settings = "-threads 8"
        decoder-settings = "-threads 8 -placeholder-factor 1 -xml-input exclusive"
      And in the [EVALUATION] section, change
        decoder-settings = "-mbr -mp -search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000 -threads 8"
         decoder-settings = "-mbr -mp -search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000 -threads 8 -placeholder-factor 1 1 -xml-input exclusive"

   4. In the [EVALUATION] section, add
        input-tokenizer = "$misc-script-dir/normalize-punctuation.perl $input-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $input-extension | $moses-script-dir/generic/ph_numbers.perl"
        output-tokenizer = "$misc-script-dir/normalize-punctuation.perl $output-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $output-extension



This was tested on some experiments, trained with Europarl data. It didn't have a positive effect on BLEU score, even reducing it slightly.

However, it may still be helpful to users who translate text with lots of numbers or dates etc. Also, the recognizer script could be improved.

         baseline: 24.59.
         with placeholder: 24.68
         baseline: 23.00
         with placeholder: 22.84
         baseline: 11.05
         with placeholder: 10.62
         baseline: 15.80
         with placeholder: 15.62

Modified Moore-Lewis Filtering

When you have a lot of out-of-domain data and you do not want to use all of it, then you can filter down that data to the parts that are more similar to the in-domain data. Moses implements a method called modified Moore-Lewis filtering. The method basically train in-domain and out-of-domain language models, and removes sentence pairs that receive relatively low scores by the in-domain models. For more details, please refer to the following paper:

Axelrod, Amittai and He, Xiaodong and Gao, Jianfeng: Domain Adaptation via Pseudo In-Domain Data Selection, Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing mentioned in Domain Adaptation, pdf, bib.

The Moses implementation is integrated into EMS. You have to specify in-domain and out-of-domain in separate CORPUS sections (you can have more than one of each), and then set in the configuration file which out-of-domain corpora need to be filtered

 ### filtering some corpora with modified Moore-Lewis
 mml-filter-corpora = giga
 mml-before-wa = "-proportion 0.2"
 #mml-after-wa = "-proportion 0.2"

There are two different places when to do the filtering, either before or after word alignment. There may be some benefits of having out-of-domain data to improve sentence alignment, but that may also be computationally to expensive. In the configuration file, you specify the proportion of the out-of-domain data that will be retained - in the example above 20% will be kept, 80% will be thrown out.

Constrained Decoding

To constrain the output of the decoder to just the reference sentences, add this as a feature:

    ConstrainedDecoding path=ref.txt
Edit - History - Print
Page last modified on July 30, 2014, at 01:08 PM