Moses
statistical
machine translation
system

Obsolete Features

Contents

Binary Phrase table

Note - You should NOT use this phrase-table anymore. The program to create will not be compiled by Moses any longer, and it will not be included in the decoder in the near future.

Note 2- Works with phrase-based models only.

You have to convert the standard ASCII phrase tables into the binary format. Here is an example (standard phrase table phrase-table, with 4 scores): -}

  cat phrase-table | LC_ALL=C sort | bin/processPhraseTable \
   -ttable 0 0 - -nscores 4 -out phrase-table

Options:

  • -ttable int int string -- translation table file, use '-' for stdin
  • -out string -- output file name prefix for binary translation table
  • -nscores int -- number of scores in translation table

If you just want to convert a phrase table, the two integers in the -ttable option do not matter, so use 0's.

Important: If your data is encoded in UTF8, make sure you set the environment variable with the LC_ALL=C before sorting. If your phrase table is already sorted, you can skip that.

The output files will be:

 phrase-table.binphr.idx 
 phrase-table.binphr.srctree
 phrase-table.binphr.srcvoc
 phrase-table.binphr.tgtdata
 phrase-table.binphr.tgtvoc

In the Moses configuration file, specify only the file name stem phrase-table as phrase table and set the type to 1, i.e.:

 [feature]
 PhraseDictionaryBinary path=phrase-table ...

Word-to-word alignment

This is on by default, so most of these arguments are not relevant

There are 2 arguments to the decoder that enables it to print out the word alignment information

  -alignment-output-file [file]

print out the word alignment for the best translation to a file.

   -print-alignment-info-in-n-best

print the word alignment information of each entry in the n-best list as an extra column in the n-best file.

Word alignment is included in the phrase-table by default (as of November 2012). To exclude them, add

   --NoWordAlignment

as an argument to the score program.

When binarizing the phrase-table, the word alignment is also included by default. To turn this behaviour off for the phrase-based binarizer: processPhraseTable -no-alignment-info .... Or

  processPhraseTableMin -no-alignment-info ....

(For the compact phrase-table representation).

There is no way to exclude word alignment information from the chart-based binarization process.

Phrase-based binary format When word alignment information is stored, the two output files ".srctree" and " .tgtdata" will end with the suffix ".wa".

Note: The argument

   -use-alignment-info
   -print-alignment-info

has been deleted from the decoder. -print-alignment-info did nothing. -use-alignment-info is now inferred from the arguments

   -alignment-output-file
   -print-alignment-info-in-n-best

Additionally, the

  -include-alignment-in-n-best

has been renamed

  -include-segmentation-in-n-best

to reflect what it actually does.

The word alignment MUST be enabled during binarization, otherwise the decoder will

  1. complain
  2. carry on blindly but doesn't print any word alignment

Binary Reordering Tables with On-demand Loading

The reordering tables may be also converted into a binary format. The command is slightly simpler:

 mosesdecoder/bin/processLexicalTable -in reordering-table -out reordering-table

The file names for input and output are typically the same, since the actual output file names have similar extensions to the phrase table file names.

Continue Partial Translation

Alert: This functionality has been removed as of May 2013.

This option forces Moses to start generating the translation from a non-empty hypothesis. This can be useful in situations, when you have already translated some part of the sentence and want to get a suggestion or an n-best-list of continuations.

Use -continue-partial-translation (-cpt) to activate this feature. With -cpt, Moses accepts also a special format of the input: three parameters delimited by the triple bar (|||). The first parameter is the string of output produced so far (used for LM scoring). The second parameter is the coverage vector of input words are already translated by the output so far, written as a string of "1"s and "0"s of the same length as there are words in the input sentence. The third parameter is the source sentence.

Example:

 % echo "that is ||| 11000 ||| das ist ein kleines haus" | moses -f moses.ini -continue-partial-translation
 that is a small house

 % echo "that house ||| 10001 ||| das ist ein kleines haus" | moses -f moses.ini -continue-partial-translation
 that house is a little

If the input does not fit to this pattern, it is treated like normal input with no words translated yet.

This type of input is currently not compatible with factored models or confusion networks. The standard non-lexicalized distortion works but more or less as one would expect (note that some input coverage vectors may prohibit translation under low distortion limits). The lexicalized reordering has not been tested.

Options

  • -continue-partial-translation (-cpt) -- activate the feature

Distributed Language Model

NB - THIS HAS BEEN REMOVED FROM MOSES (HIEU)

In most cases, MT output improves significantly when more data is used to train the Language Model. More data however produces larger models, and it is very easy to produce a model which cannot be held in the main memory of a single machine. To overcome this, the Language Model can be distributed across many machines, allowing more data to be used at the cost of a performance overhead.

Support for Distributed Language Models in Moses are built on top of a bespoke distributed map implementation called DMap. DMap and support for Distributed Language Models are still in beta, and any feedback or bug reports are welcomed.

Installing and Compiling

Before compiling Moses with DMap support, you must configure your DMap setup (see below). Once that has been done, run Moses' configure script with your normal options and --with-dmaplm=<path-to-dmap>, then the usual make, make install.

Configuration

Configuring DMap is at the moment, a very crude process. One must edit the src/DMap/Config.cpp file by hand and recompile when making any changes. With the configuration being compiled in, this also means that once it is changed, any programs statically linked to DMap will have to be recompiled too. The file src/DMap/Config.cpp provides a good example configuration which is self explanatory.

Example

In this example scenario, we have a Language Model trained on the giga4 corpus which we wish to host across 4 servers using DMap. The model is a 5-gram model, containing roughly 210 million ngrams; the probabilities and backoff weights of ngrams will be uniformly quantised to 5 bit values.

Configuration

Here is an example Config.cpp for such a set up:
     config->setShardDirectory("/home/user/dmap");
     config->addTableConfig(new TableConfigLossyDoubleHash(
             "giga4",    // name of table
             283845991,  // number of cells (approx 1.23 * number of ngrams)
             64,         // number of chunks (not too important, leave at 64)
             (((uint64_t)1 << 61) - 1),              // universal hashing P parameter
             5789372245 % (((uint64_t)1 << 61) - 1), // universal hashing a parameter
             3987420741 % (((uint64_t)1 << 61) - 1), // universal hashing b parameter
             "/home/user/dmap/giga4.bf",
             16,         // num_error_bits (higher -> fewer collisions but more memory)
             10,         // num_value_bits (higher -> more accurate probabilities 
                         // and backoff weights but more memory)
             20));       // num_hashes (higher -> 
                         // config->addStructConfig(new StructConfigLanguageModelBackoff(
             "giga4",    // struct name
             "giga4",    // lm table name
             5,          // lm order
             5,          // num logprob bits (these fields should add up to the number 
                         // of value bits for the table)
             5));        // num backoff bits
     config->addServerConfig(new ServerConfig("server0.some.domain", 5000));
     config->addServerConfig(new ServerConfig("server1.some.domain", 5000));
     config->addServerConfig(new ServerConfig("server2.some.domain", 5000));
     config->addServerConfig(new ServerConfig("server3.some.domain", 5000));
Note that the shard directory should be on a shared file system all Servers can access.

Create Table

The command:
 create_table giga4
will create the files for the shards.

Shard Model

The model can now be split into chunks using the shard utility:
 shard giga4 /home/user/dmap/giga4.arpa

Create Bloom Filter

A Bloom filter is a probabilistic data structure encoding set membership in an extremely space efficient manner. When querying whether a given item is present in the set they encode, they can produce an error with a calculable probability. This error is one sided in that they can produce false positives, but never false negatives. To avoid making slow network requests, DMap keeps a local Bloom filter containing the set of ngrams in the Language Model. Before making a network request to get the probability of an ngram, DMap first checks to see if the ngram is present in the Bloom filter. If is not, then we know for certain the ngram is not present in the model and therefore not worth issuing a network request for. However, if the ngram is present in the filter, it might actually be in the model, or the filter may have produced a false positive.

To create a Bloom filter containing the ngrams of the Language Model, run this command:

 ngrams < /home/user/dmap/giga4.arpa | mkbf 134217728 210000000 /home/user/dmap/giga4.bf

Integration with Moses

The structure within DMap Moses should use as the Language Model should be put into a file, in this case at /home/user/dmap/giga4.conf:

 giga4
 false

Note that if for testing or experimentation purposes you would like to have the whole model on the local machine instead of over the network, change the false to true. You must have sufficient memory to host the whole model, but decoding will be significantly faster.

To use this, put the following line in your moses.ini file:

 11 0 0 5 /home/user/dmap/giga4.conf

Using Multiple Translation Systems in the Same Server

Alert: This functionality has been removed as of May 2013. A replacement is Alternate Weight Settings?.

The Moses server is now able to load multiple translation systems within the same server, and the client is able to decide which translation system that the server should use, on a per-sentence basis. The client does this by passing a system argument in the translate operation.

One possible use-case for this multiple models feature is if you want to build a server that translates both French and German into English, and uses a large English language model. Instead of running two copies of the Moses server, each with a copy of the English language model in memory, you can now run one Moses server instance, with the language model in memory, thus saving on RAM.

To use the multiple models feature, you need to make some changes to the standard Moses configuration file. A sample configuration file can be found here.

The first piece of extra configuration required for a multiple models setup is to specify the available systems, for example

 [translation-systems]
 de D 0 R 0 L 0
 fr D 1 R 1 L 1

This specifies that there are two systems (de and fr), and that the first uses decode path 0, reordering model 0, and language model 0, whilst the second uses the models with id 1. The multiple decode paths are specified with a stanza like

 [mapping]
 0 T 0
 1 T 1

which indicates that the 0th decode path uses the 0th translation model, and the 1st decode path uses the 1st translation model. Using a language model specification like

 [lmodel-file]
 0 0 5 /disk4/translation-server/models/interpolated-lm
 0 0 5 /disk4/translation-server/models/interpolated-lm

means that the same language model can be used in two different systems with two different weights, but Moses will only load it once. The weights sections of the configuration file must have the correct numbers of weights for each of the models, and there must be a word penalty and linear distortion weight for each translation system. The lexicalised reordering weights (if any) must be specified in the [weight-lr] stanza, with the distortion penalty in the [weight-d] stanza.

Edit - History - Print
Page last modified on February 13, 2015, at 04:55 PM