The language model should be trained on a corpus that is suitable to the domain. If the translation model is trained on a parallel corpus, then the language model should be trained on the output side of that corpus, although using additional training data is often beneficial.
Our decoder works with the following language models:
To use these language models, they have to be compiled with the proper option:
KenLM is compiled by default. In the Moses configuration file, the type (SRI/IRST/RandLM/KenLM/DALM) of the LM is specified by the feature function class, eg.
[feature] SRILM path=filename.srilm order=3 .....
IRSTLM path=filename.irstlm ...
RANDLM path=filename.irstlm ...
KENLM path=filename.arpa ...
DALM path=filename.dalm ...
The toolkits all come with programs that create a language model file, as required by our decoder. ARPA files are generally exchangeable, so you can estimate with one toolkit and query with a different one.
A language model can be created by calling:
ngram-count -text CORPUS_FILE -lm SRILM_FILE
The command works also on compressed (
gz) input and output. There are a variety of switches that can be used, we recommend
Moses can also use language models created with the IRSTLM toolkit (see Federico & Cettolo, (ACL WS-SMT, 2007)). The commands described in the following are supplied with the IRSTLM toolkit that has to be downloaded and compiled separately.
IRSTLM toolkit handles LM formats which permit to reduce both storage and decoding memory requirements, and to save time in LM loading. In particular, it provides tools for:
Training a language model from huge amounts of data can be definitively memory and time expensive. The IRSTLM toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs. IRSTLM is open source and can be downloaded from here.
Typically, LM estimation starts with the collection of n-grams and their frequency counters. Then, smoothing parameters are estimated for each n-gram level; infrequent n-grams are possibly pruned and, finally, a LM file is created containing n-grams with probabilities and back-off weights. This procedure can be very demanding in terms of memory and time if applied to huge corpora. IRSTLM provides a simple way to split LM training into smaller and independent steps, which can be distributed among independent processes.
The procedure relies on a training script that makes little use of computer memory and implements the Witten-Bell
smoothing method. (An approximation of the modified Kneser-Ney smoothing method is also available.) First, create a special directory
stat under your working directory,
where the script will save lots of temporary files; then, simply run the script build-lm.sh as in the example:
build-lm.sh -i "gunzip -c corpus.gz" -n 3 -o train.irstlm.gz -k 10
The script builds a 3-gram LM (option
-n) from the specified input command (
-i), by splitting the
training procedure into 10 steps (
The LM will be saved in the output (
-o) file train.irstlm.gz with an
intermediate ARPA format. This format can be
properly managed through the
compile-lm command in order to produce a
compiled version or a standard
of the LM.
For a detailed description of the procedure and of other commands available under IRSTLM please refer to the user manual supplied with the package.
You can convert your language model file (created either with the SRILM ngram-count command or with the IRSTLM toolkit) into a compact binary format with the command:
compile-lm language-model.srilm language-model.blm
Moses compiled with the IRSTLM toolkit is able to properly handle that binary format; the setting of
moses.ini for that file is:
IRSTLM order=3 factor=0 path=language-model.blm
The binary format allows LMs to be efficiently stored and loaded. The implementation privileges memory saving rather than access time.
Before compiling the language model, you can quantize (see Federico & Bertoldi, (ACL WS-SMT, 2006)) its probabilities and back-off weights with the command:
quantize-lm language-model.srilm language-model.qsrilm
Hence, the binary format for this file is generated by the commmand:
compile-lm language-model.qsrilm language-model.qblm
The resulting language model requires less memory because all its probabilities and back-off weights are now stored in 1 byte instead of 4. No special setting of the configuration file is required: Moses compiled with the IRSTLM toolkit is able to read the necessary information from the header of the file.
It is possible to avoid the loading of the LM into the central memory by exploiting the memory mapping mechanism. Memory mapping permits the decoding process to directly access the (binary) LM file stored on the hard disk.
Warning: In case of parallel decoding in a cluster of computers, each process will access the same file. The possible large number of reading requests could overload the driver of the hard disk which the LM is stored on, and/or the network. One possible solution to such a problem is to store a copy of the LM on the local disk of each processing node, for example under the /tmp/ directory.
In order to activate the access through the memory mapping, simply add the suffix .mm to the name of the LM file (which must be stored in the binary format) and update the Moses configuration file accordingly.
As an example, let us suppose that the 3gram LM has been built and stored in binary format in the file
Rename it for adding the .mm suffix:
mv language-model.blm language-model.blm.mm
or create a properly named symbolic link to the original file:
ln -s language-model.blm language-model.blm.mm
Now, the activation of the memory mapping mechanism is obtained simply by updating the Moses configuration file as follows:
IRSTLM order=3 factor=0 path=language-model.blm.mm
Typically, LMs employed by Moses provide the probability of n-grams of single factors. In addition to the standard way, the IRSTLM toolkit allows Moses to query the LMs in other different ways.
In the following description, it is assumed that the target side of training texts contains words which are concatenation of
N>=1 fields separated by the character
Similarly to factored models, where the word is not anymore a simple token but a vector of factors that can represent different levels of annotation, here the word can be the concatenation of different tags for the surface form of a word, e.g.:
Specific LMs for each tag can be queried by Moses simply by adding a fourth parameter in the line of the configuration file devoted to the specification of the LM. The additional parameter is a file containing (at least) the following header:
Possibly, it can also include a one-to-one map which is applied to each component of n-grams before the LM query:
w1 class(w1) w2 class(w2) ... wM class(wM)
The value of <int> determines the processing applied to the n-gram components, which are supposed to be strings like
_as separator. For example, if <int>=21, the LM is queried with n-grams of strings
field2_field1. If the map is given, it is applied to the field corresponding to the first digit.
The last case is useful for lexicalization of LMs: if the fields n. 2 and 1 correspond to the POS and lemma of the actual word respectively, the LM is queried with n-grams of
A particular processing is performed whenever fields are supposed to correspond to microtags, i.e. the per-word projections of chunk labels. The processing aims at collapsing the sequence of microtags defining a chunk to the label of that chunk. The chunk LM is then queried with n-grams of chunk labels, in an asynchronous manner with respect to the sequence of words, as in general chunks consist of more words.
The collapsing operation is automatically activated if the sequence of microtags is:
(TAG TAG+ TAG+ ... TAG+ TAG)
TAG( TAG+ TAG+ ... TAG+ TAG)
Both those sequences are collapsed into a single chunk label (let us say
CHNK) as long as
TAG) are all mapped into the same label
CHNK. The map into different labels or a different use/position of characters
) in the lexicon of tags prevent the collapsing operation.
Currently (Aug 2008), lexicalized chunk LMs are still under investigation and only non-lexicalized chunk LMs are properly handled; then, the range of admitted
<int> values for this kind of LMs is -1...9, with the above described meaning.
If you really want to build the largest LMs possible (for example, a 5-gram trained on one hundred billion words then you should look at the RandLM. This takes a very different approach to either the SRILM or the IRSTLM. It represents LMs using a randomized data structure (technically, variants of Bloom filters). This can result in LMs that are ten times smaller than those created using the SRILM (and also smaller than IRSTLM), but at the cost of making decoding about four times slower. RandLM is multithreaded now, so the speed reduction should be less of a problem.
Technical details of randomized language modelling can be found in a ACL paper (see Talbot and Osborne, (ACL 2007))
RandLM is available at Sourceforge.
After extracting the tar ball, go to the directory
src and type
For integrating RandLM into Moses, please see above.
buildlm binary (in
randlm/bin) preprocesses and builds randomized language models.
The toolkit provides three ways for building a randomized language models:
The former type of model will be referred to as a CountRandLM while the second will be referred to as a BackoffRandLM. Models built from precomputed ngram-count pairs are also of type "CountRandLM". CountRandLMs use either StupidBackoff or else Witten-Bell smoothing. BackoffRandLM models can use any smoothing scheme that the SRILM implements. Generally, CountRandLMs are smaller than BackoffRandLMs, but use less sophisticated smoothing. When using billions of words of training material there is less of a need for good smoothing and so CountRandLMs become appropriate.
The following parameters are important in all cases:
struct: The randomized data structure used to represent the language model (currently only
order: The order of the n-gram model e.g., 3 for a trigram model.
falsepos: The false positive rate of the randomized data structure on an inverse log scale so
-falsepos 8produces a false positive rate of 1/28.
values: The quantization range used by the model. For a CountRandLM quantisation is performed by taking a logarithm. The base of the logarithm is set as 21/
values. For a BackoffRandLM a binning quantisation algorithm is used. The size of the codebook is set as 2
values. A reasonable setting in both cases is
input-path: The location of data to be used to create the language model.
input-type: The format of the input data. The following four formats are supported
corpustokenised corpora one sentence per line;
countsn-gram counts file (one count and one n-gram per line);
arpaan ARPA backoff language model;
backofflanguage model file (two floats and one n-gram per line).
arpafile the toolkit will create a 'backoff' file which may be reused (see examples below).
output-prefix:Prefix added to all output files during the construction of a randomized language model.
./buildlm -struct BloomMap -falsepos 8 -values 8 -output-prefix model -order 3 < corpus
would produce the following files:-
model.BloomMap <- the randomized language model model.counts.sorted <- n-gram counts file model.stats <- statistics file (counts of counts) model.vcb <- vocabulary file (not needed)
model.BloomMap: This randomized language model is ready to use on its own (see 'Querying a randomized language model' below).
model.counts.sorted: This is a file in the RandLM 'counts' format with one count followed by one n-gram per line. It can be specified as shown in Example 3 below to avoid recomputation when building multiple randomized language models from the same corpus.
model.stats: This statistics file contains counts of counts and can be specified via the optional parameter '-statspath' as shown in Example 3 to avoid recomputation when building multiple randomized language models from the same data.
./buildlm -struct BloomMap -falsepos 8 -values 8 -output-prefix model -order 3 \ -input-path precomputed.bo -input-type arpa
precomputed.bo contains an ARPA-formatted backoff model) would produce the following files:
model.BloomMap <- the randomized language model model.backoff <- RandLM backoff file model.stats <- statistics file (counts of counts) model.vcb <- vocabulary file (not needed)
model.backoff is a RandLM formatted copy of the ARPA model. It can be reused in the same manner as the
model.counts.sorted file (see Example 3).
./buildlm -struct BloomMap -falsepos 4 -values 8 -output-prefix model4 -order 3 -input-path model.counts.sorted -input-type counts -stats-path model.stats
would construct a new randomized language model (
model4.BloomMap) from the same data as used in Example 1 but with a different error rate (here
-falsepos 4). This usage avoids re-tokenizing the corpus and recomputing the statistics file.
At some point you will discover that you cannot build a LM using your data. RandLM natively uses a disk-based method for creating n-grams and counts, but this will be slow for large corpora. Instead you can create these ngram-count pairs using Hadoop (Map-Reduce). The RandLM release has Hadoop scripts which take raw text files and create ngram-counts. We have built randomised LMs this way using more than 110 billion tokens.
The procedure for using Hadoop is as follows:
Hadoop dfs -put myFile data/
perl hadoop-lm-count.prl data data-counts 5 data-counting
perl hadoopRead.prl data-counts | gzip - > /unix/path/to/counts.gz
./buildlm -estimator batch -smoothing WittenBell -order 5 \ -values 12 -struct LogFreqBloomFilter -tmp-dir /disk5/miles \ -output-prefix giga3.rlm -output-dir /disk5/miles -falsepos 12 \ -keep-tmp-files -sorted-by-ngram -input-type counts \ -input-path /disk5/miles/counts.gz
Moses uses its own interface to the randLM, but it may be interesting to query the language model directly. The
querylm binary (in
randlm/bin) allows a randomized language model to be queried. Unless specified the scores provided by the tool will be conditional log probabilities (subject to randomisation errors).
The following parameters are available:-
randlm: The path of the randomized language model built using the
buildlmtool as described above.
test-path: The location of test data to be scored by the model.
test-type: The format of the test data: currently
corpuswill treat each line in the test file as a sentence and provide scores for all n-grams (adding
ngramswill score each line once treating each as an independent n-gram.
get-counts: Return the counts of n-grams rather than conditional log probabilities (only supported by CountRandLM).
checks: Applies sequential checks to n-grams to avoid unnecessary false positives.
Example: The command
./querylm -randlm model.BloomMap -test-path testfile -test-type ngrams -order 3 > scores
would write out conditional log probabilities one for each line in the file
KenLM is a language model that is simultaneously fast and low memory. The probabilities returned are the same as SRI, up to floating point rounding. It is maintained by Ken Heafield, who provides additional information on his website, such as benchmarks comparing speed and memory use against the other language model implementations. KenLM is distributed with Moses and compiled by default. KenLM is fully thread-safe for use with multi-threaded Moses.
The lmplz program estimates language models with Modified Kneser-Ney smoothing and no pruning. Pass the order (-o), an amount of memory to use for building (-S), and a location to place temporary files (-T). Note that -S is compatible with GNU sort so e.g. 1G = 1 gigabyte and 80% means 80% of physical RAM. It scales to much larger models than SRILM can handle and does not resort to approximation like IRSTLM does.
bin/lmplz -o 5 -S 80% -T /tmp <text >text.arpa
See the page on estimation for more.
To use lmplz in EMS set the following three parameters to your needs and copy the fourth one as is.
# path to lmplz binary lmplz = $moses-bin-dir/lmplz # order of the language model order = 3 # additional parameters to lmplz (check lmplz help message) settings = "-T $working-dir/tmp -S 10G" # this tells EMS to use lmplz and tells EMS where lmplz is located lm-training = "$moses-script-dir/generic/trainlm-lmplz.perl -lmplz $lmplz"
ARPA files can be read directly:
KENLM factor=<factor> order=<order> path=filename.arpa
but the binary format loads much faster and provides more flexibility. The <order> field is ignored. By contrast, SRI silently returns incorrect probabilities if you get it wrong (Kneser-Ney smoothed probabilties for lower-order n-grams are conditioned on backing off).
Using the binary format significantly reduces loading time. It also exposes more configuration options. The kenlm/build_binary program converts ARPA files to binary files:
kenlm/build_binary filename.arpa filename.binary
This will build a binary file that can be used in place of the ARPA file. Note that, unlike IRST, the file extension does not matter; the binary format is recognized using magic bytes. You can also specify the data structure to use:
kenlm/build_binary trie filename.arpa filename.binary
where valid values are probing, sorted, and trie. The default is probing. Generally, I recommend using probing if you have the memory and trie if you do not. See benchmarks for details. To determine the amount of RAM each data structure will take, provide only the arpa file:
Bear in mind that this includes only language model size, not the phrase table or decoder state.
Building the trie entails an on-disk sort. You can optimize this by setting the sorting memory with -S using the same options as GNU sort e.g. 100M, 1G, 80%. Final model building will still use the amount of memory needed to store the model. The -T option lets you customize where to place temporary files (the default is based on the output file name).
kenlm/build_binary -T /tmp/trie -S 1G trie filename.arpa filename.binary
KenLM supports lazy loading via mmap. This allows you to further reduce memory usage, especially with trie which has good memory locality. This is specified by another arguments in the feature function for the KENLM feature function:
KENLM ... lazyken=<true/false>
I recommend fully loading if you have the RAM for it; it actually takes less time to load the full model and use it because the disk does not have to seek during decoding. Lazy loading works best with local disk and is not recommended for networked filesystems.
Probing is the fastest and default data structure. Unigram lookups happen by array index. Bigrams and longer n-grams are hashed to 64-bit integers which have very low probability of collision, even with the birthday attack. This 64-bit hash is the key to a probing hash table where values are probability and backoff.
A linear probing hash table is an array consisting of blanks (zeros) and entries with non-zero keys. Lookup proceeds by hashing the key modulo the array size, starting at this point in the array, and scanning forward until the entry or a blank is found. The ratio of array size to number of entries is controlled by the probing multiplier parameter p. This is a time-space tradeoff: space is linear in p and time is O(p/(p-1)). The value of p can be set at binary building time e.g.
kenlm/build_binary -p 1.2 probing filename.arpa filename.binary
sets a value of 1.2. The default value is 1.5 meaning that one third of the array is blanks.
The trie data structure uses less memory than all other options (except RandLM with stupid backoff), has the best memory locality, and is still faster than any other toolkit. However, it does take longer to build. It works in much the same way as SRI and IRST's inverted option. Like probing, unigram lookup is an array index. Records in the trie have a word index, probability, backoff, and pointer. All of the records for n-grams of the same order are stored consecutively in memory. An n-gram's pointer is actually the index into the (n+1)-gram array where block of (n+1)-grams with one more word of history starts. The end of this block is found by reading the next entry's pointer. Records within the block are sorted by word index. Because the vocabulary ids are randomly permuted, a uniform key distribution applies. Interpolation search within each block finds the word index and its correspoding probability, backoff, and pointer. The trie is compacted by using the minimum number of bits to store each integer. Probability is always non-positive, so the sign bit is also removed.
Since the trie stores many vocabulary ids and uses the minimum number of bits to do so, vocabulary filtering is highly effective for reducing overall model size even if less n-grams of higher order are removed.
The trie supports quantization to any number of bits from 1 to 25. To quantize to 8 bits, use -q 8. If you want to separately control probability and backoff quantization, use -q for probability and -b for backoff.
The trie pointers comprise a sorted array. These can be compressed using a technique from Raj and Whittaker by chopping off bits and storing offsets instead. The -a option acts as an upper bound on the number of bits to chop; it will never chop more bits than minimizes memory use. Since this is a time-space tradeoff (time is linear in the number of bits chopped), you can set the upper bound number of bits to chop using -a. To minimize memory, use -a 64. To save time, specify a lower limit e.g. -a 10.
The original strings are kept at the end of the binary file and passed to Moses at load time to obtain or generate Moses IDs. This is why lazy binary loading still takes a few seconds. KenLM stores a vector mapping from Moses ID to KenLM ID. The cost of this vector and Moses-side vocabulary word storage are not included in the memory use reported by build_binary. However, benchmarks report the entire cost of running Moses.