Note - You should NOT use this phrase-table anymore. The program to create will not be compiled by Moses any longer, and it will not be included in the decoder in the near future.
Note 2- Works with phrase-based models only.
You have to convert the standard ASCII phrase tables into the binary format. Here is an example (standard phrase table
phrase-table, with 4 scores): -}
cat phrase-table | LC_ALL=C sort | bin/processPhraseTable \ -ttable 0 0 - -nscores 4 -out phrase-table
-ttable int int string-- translation table file, use
-out string-- output file name prefix for binary translation table
-nscores int-- number of scores in translation table
If you just want to convert a phrase table, the two integers in the
-ttable option do not matter, so use 0's.
Important: If your data is encoded in UTF8, make sure you set the environment variable with the
LC_ALL=C before sorting. If your phrase table is already sorted, you can skip that.
The output files will be:
phrase-table.binphr.idx phrase-table.binphr.srctree phrase-table.binphr.srcvoc phrase-table.binphr.tgtdata phrase-table.binphr.tgtvoc
In the Moses configuration file, specify only the file name stem
phrase-table as phrase table and set the type to 1, i.e.:
[feature] PhraseDictionaryBinary path=phrase-table ...
This is on by default, so most of these arguments are not relevant
There are 2 arguments to the decoder that enables it to print out the word alignment information
print out the word alignment for the best translation to a file.
print the word alignment information of each entry in the n-best list as an extra column in the n-best file.
Word alignment is included in the phrase-table by default (as of November 2012). To exclude them, add
as an argument to the score program.
When binarizing the phrase-table, the word alignment is also included by default. To turn this behaviour off for the phrase-based binarizer:
processPhraseTable -no-alignment-info ....
processPhraseTableMin -no-alignment-info ....
(For the compact phrase-table representation).
There is no way to exclude word alignment information from the chart-based binarization process.
Phrase-based binary format When word alignment information is stored, the two output files ".srctree" and " .tgtdata" will end with the suffix ".wa".
Note: The argument
has been deleted from the decoder.
-print-alignment-info did nothing.
-use-alignment-info is now inferred from the arguments
has been renamed
to reflect what it actually does.
The word alignment MUST be enabled during binarization, otherwise the decoder will
The reordering tables may be also converted into a binary format. The command is slightly simpler:
mosesdecoder/bin/processLexicalTable -in reordering-table -out reordering-table
The file names for input and output are typically the same, since the actual output file names have similar extensions to the phrase table file names.
Alert: This functionality has been removed as of May 2013.
This option forces Moses to start generating the translation from a non-empty hypothesis. This can be useful in situations, when you have already translated some part of the sentence and want to get a suggestion or an n-best-list of continuations.
-cpt) to activate this feature. With
-cpt, Moses accepts also a special format of the input: three parameters delimited by the triple bar (
|||). The first parameter is the string of output produced so far (used for LM scoring). The second parameter is the coverage vector of input words are already translated by the output so far, written as a string of "1"s and "0"s of the same length as there are words in the input sentence. The third parameter is the source sentence.
% echo "that is ||| 11000 ||| das ist ein kleines haus" | moses -f moses.ini -continue-partial-translation that is a small house % echo "that house ||| 10001 ||| das ist ein kleines haus" | moses -f moses.ini -continue-partial-translation that house is a little
If the input does not fit to this pattern, it is treated like normal input with no words translated yet.
This type of input is currently not compatible with factored models or confusion networks. The standard non-lexicalized distortion works but more or less as one would expect (note that some input coverage vectors may prohibit translation under low distortion limits). The lexicalized reordering has not been tested.
-cpt) -- activate the feature
NB - THIS HAS BEEN REMOVED FROM MOSES (HIEU)
In most cases, MT output improves significantly when more data is used to train the Language Model. More data however produces larger models, and it is very easy to produce a model which cannot be held in the main memory of a single machine. To overcome this, the Language Model can be distributed across many machines, allowing more data to be used at the cost of a performance overhead.
Support for Distributed Language Models in Moses are built on top of a bespoke distributed map implementation called DMap. DMap and support for Distributed Language Models are still in beta, and any feedback or bug reports are welcomed.
Before compiling Moses with DMap support, you must configure your DMap setup (see below). Once that has been done, run Moses'
configure script with your normal options and
--with-dmaplm=<path-to-dmap>, then the usual
Configuring DMap is at the moment, a very crude process. One must edit the
src/DMap/Config.cpp file by hand and recompile when making any changes. With the configuration being compiled in, this also means that once it is changed, any programs statically linked to DMap will have to be recompiled too. The file
src/DMap/Config.cpp provides a good example configuration which is self explanatory.
In this example scenario, we have a Language Model trained on the
giga4 corpus which we wish to host across 4 servers using DMap. The model is a 5-gram model, containing roughly 210 million ngrams; the probabilities and backoff weights of ngrams will be uniformly quantised to 5 bit values.
Config.cppfor such a set up:
config->setShardDirectory("/home/user/dmap"); config->addTableConfig(new TableConfigLossyDoubleHash( "giga4", // name of table 283845991, // number of cells (approx 1.23 * number of ngrams) 64, // number of chunks (not too important, leave at 64) (((uint64_t)1 << 61) - 1), // universal hashing P parameter 5789372245 % (((uint64_t)1 << 61) - 1), // universal hashing a parameter 3987420741 % (((uint64_t)1 << 61) - 1), // universal hashing b parameter "/home/user/dmap/giga4.bf", 16, // num_error_bits (higher -> fewer collisions but more memory) 10, // num_value_bits (higher -> more accurate probabilities // and backoff weights but more memory) 20)); // num_hashes (higher -> // config->addStructConfig(new StructConfigLanguageModelBackoff( "giga4", // struct name "giga4", // lm table name 5, // lm order 5, // num logprob bits (these fields should add up to the number // of value bits for the table) 5)); // num backoff bits config->addServerConfig(new ServerConfig("server0.some.domain", 5000)); config->addServerConfig(new ServerConfig("server1.some.domain", 5000)); config->addServerConfig(new ServerConfig("server2.some.domain", 5000)); config->addServerConfig(new ServerConfig("server3.some.domain", 5000));
shard giga4 /home/user/dmap/giga4.arpa
A Bloom filter is a probabilistic data structure encoding set membership in an extremely space efficient manner. When querying whether a given item is present in the set they encode, they can produce an error with a calculable probability. This error is one sided in that they can produce false positives, but never false negatives. To avoid making slow network requests, DMap keeps a local Bloom filter containing the set of ngrams in the Language Model. Before making a network request to get the probability of an ngram, DMap first checks to see if the ngram is present in the Bloom filter. If is not, then we know for certain the ngram is not present in the model and therefore not worth issuing a network request for. However, if the ngram is present in the filter, it might actually be in the model, or the filter may have produced a false positive.
To create a Bloom filter containing the ngrams of the Language Model, run this command:
ngrams < /home/user/dmap/giga4.arpa | mkbf 134217728 210000000 /home/user/dmap/giga4.bf
The structure within
DMap Moses should use as the Language Model should be put into a file, in this case at
Note that if for testing or experimentation purposes you would like to have the whole model on the local machine instead of over the network, change the false to true. You must have sufficient memory to host the whole model, but decoding will be significantly faster.
To use this, put the following line in your
11 0 0 5 /home/user/dmap/giga4.conf
The Moses server is now able to load multiple translation systems within the same server, and the client is able to decide which translation system that the server should use, on a per-sentence basis. The client does this by passing a
system argument in the
One possible use-case for this multiple models feature is if you want to build a server that translates both French and German into English, and uses a large English language model. Instead of running two copies of the Moses server, each with a copy of the English language model in memory, you can now run one Moses server instance, with the language model in memory, thus saving on RAM.
To use the multiple models feature, you need to make some changes to the standard Moses configuration file. A sample configuration file can be found here.
The first piece of extra configuration required for a multiple models setup is to specify the available systems, for example
[translation-systems] de D 0 R 0 L 0 fr D 1 R 1 L 1
This specifies that there are two systems (
fr), and that the first uses decode path 0, reordering model 0, and language model 0, whilst the second uses the models with id 1. The multiple decode paths are specified with a stanza like
[mapping] 0 T 0 1 T 1
which indicates that the 0th decode path uses the 0th translation model, and the 1st decode path uses the 1st translation model. Using a language model specification like
[lmodel-file] 0 0 5 /disk4/translation-server/models/interpolated-lm 0 0 5 /disk4/translation-server/models/interpolated-lm
means that the same language model can be used in two different systems with two different weights, but Moses will only load it once. The weights sections of the configuration file must have the correct numbers of weights for each of the models, and there must be a word penalty and linear distortion weight for each translation system. The lexicalised reordering weights (if any) must be specified in the
[weight-lr] stanza, with the distortion penalty in the