Moses
statistical
machine translation
system

Building a Language Model

Content

Language Models in Moses

The language model should be trained on a corpus that is suitable to the domain. If the translation model is trained on a parallel corpus, then the language model should be trained on the output side of that corpus, although using additional training data is often beneficial.

Our decoder works with the following language models:

To use these language models, they have to be compiled with the proper option:

  • --with-srilm=<root dir of the SRILM toolkit>
  • --with-irstlm=<root dir of the IRSTLM toolkit>
  • --with-randlm=<root dir of the RandLM toolkit>
  • --with-dalm=<root dir of the DALM toolkit>

KenLM is compiled by default. In the Moses configuration file, the type (SRI/IRST/RandLM/KenLM/DALM) of the LM is specified by the feature function class, eg.

 [feature]
 SRILM path=filename.srilm order=3 .....

or

 IRSTLM path=filename.irstlm ...

or

 RANDLM path=filename.irstlm ...

or

 KENLM path=filename.arpa ...

or

 DALM path=filename.dalm ...

The toolkits all come with programs that create a language model file, as required by our decoder. ARPA files are generally exchangeable, so you can estimate with one toolkit and query with a different one.

Building a LM with the SRILM Toolkit

A language model can be created by calling:

 ngram-count -text CORPUS_FILE -lm SRILM_FILE

The command works also on compressed (gz) input and output. There are a variety of switches that can be used, we recommend -interpolate -kndiscount.

On the IRSTLM Toolkit

Moses can also use language models created with the IRSTLM toolkit (see Federico & Cettolo, (ACL WS-SMT, 2007)). The commands described in the following are supplied with the IRSTLM toolkit that has to be downloaded and compiled separately.

IRSTLM toolkit handles LM formats which permit to reduce both storage and decoding memory requirements, and to save time in LM loading. In particular, it provides tools for:

Compiling IRSTLM

Compiling IRSTLM requires:

   1. automake 1.9 or higher
   2. autoconf 2.59 or higher
   3. libtool 2.2.6 or higher

This is the commands I (Hieu) executed to download and install IRSTLM:

   wget http://downloads.sourceforge.net/project/irstlm/irstlm/irstlm-5.80/irstlm-5.80.03.tgz?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Firstlm%2F&ts=1389616103&use_mirror=garr
   tar zxvf irstlm-5.80.03.tgz 
   cd irstlm-5.80.03/
   ./regenerate-makefiles.sh 
   ./configure --prefix=$PWD
   make -j8
   make install

This the binaries and libraries in the current directory, in bin/ and lib/.

Note - as of January 2014, there are problems with the source code in the IRSTLM sourceforge repository. Please use the version 5.80.03 available for download on the website.

Building Huge Language Models

Training a language model from huge amounts of data can be definitively memory and time expensive. The IRSTLM toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs. IRSTLM is open source and can be downloaded from here.

Typically, LM estimation starts with the collection of n-grams and their frequency counters. Then, smoothing parameters are estimated for each n-gram level; infrequent n-grams are possibly pruned and, finally, a LM file is created containing n-grams with probabilities and back-off weights. This procedure can be very demanding in terms of memory and time if applied to huge corpora. IRSTLM provides a simple way to split LM training into smaller and independent steps, which can be distributed among independent processes.

The procedure relies on a training script that makes little use of computer memory and implements the Witten-Bell smoothing method. (An approximation of the modified Kneser-Ney smoothing method is also available.) First, create a special directory stat under your working directory, where the script will save lots of temporary files; then, simply run the script build-lm.sh as in the example:

 build-lm.sh -i "gunzip -c corpus.gz" -n 3 -o train.irstlm.gz -k 10

The script builds a 3-gram LM (option -n) from the specified input command (-i), by splitting the training procedure into 10 steps (-k). The LM will be saved in the output (-o) file train.irstlm.gz with an intermediate ARPA format. This format can be properly managed through the compile-lm command in order to produce a compiled version or a standard ARPA version of the LM.

For a detailed description of the procedure and of other commands available under IRSTLM please refer to the user manual supplied with the package.

Binary Language Models

You can convert your language model file (created either with the SRILM ngram-count command or with the IRSTLM toolkit) into a compact binary format with the command:

  compile-lm language-model.srilm language-model.blm

Moses compiled with the IRSTLM toolkit is able to properly handle that binary format; the setting of moses.ini for that file is:

 IRSTLM order=3 factor=0 path=language-model.blm

The binary format allows LMs to be efficiently stored and loaded. The implementation privileges memory saving rather than access time.

Quantized Language Models

Before compiling the language model, you can quantize (see Federico & Bertoldi, (ACL WS-SMT, 2006)) its probabilities and back-off weights with the command:

 quantize-lm language-model.srilm language-model.qsrilm

Hence, the binary format for this file is generated by the commmand:

 compile-lm language-model.qsrilm language-model.qblm

The resulting language model requires less memory because all its probabilities and back-off weights are now stored in 1 byte instead of 4. No special setting of the configuration file is required: Moses compiled with the IRSTLM toolkit is able to read the necessary information from the header of the file.

Memory Mapping

It is possible to avoid the loading of the LM into the central memory by exploiting the memory mapping mechanism. Memory mapping permits the decoding process to directly access the (binary) LM file stored on the hard disk.

Warning: In case of parallel decoding in a cluster of computers, each process will access the same file. The possible large number of reading requests could overload the driver of the hard disk which the LM is stored on, and/or the network. One possible solution to such a problem is to store a copy of the LM on the local disk of each processing node, for example under the /tmp/ directory.

In order to activate the access through the memory mapping, simply add the suffix .mm to the name of the LM file (which must be stored in the binary format) and update the Moses configuration file accordingly.

As an example, let us suppose that the 3gram LM has been built and stored in binary format in the file

 language-model.blm

Rename it for adding the .mm suffix:

 mv language-model.blm  language-model.blm.mm

or create a properly named symbolic link to the original file:

 ln -s language-model.blm  language-model.blm.mm

Now, the activation of the memory mapping mechanism is obtained simply by updating the Moses configuration file as follows:

 IRSTLM order=3 factor=0 path=language-model.blm.mm

Class Language Models and more

Typically, LMs employed by Moses provide the probability of n-grams of single factors. In addition to the standard way, the IRSTLM toolkit allows Moses to query the LMs in other different ways. In the following description, it is assumed that the target side of training texts contains words which are concatenation of N>=1 fields separated by the character #. Similarly to factored models, where the word is not anymore a simple token but a vector of factors that can represent different levels of annotation, here the word can be the concatenation of different tags for the surface form of a word, e.g.:

 word#lemma#part-of-speech#word-class

Specific LMs for each tag can be queried by Moses simply by adding a fourth parameter in the line of the configuration file devoted to the specification of the LM. The additional parameter is a file containing (at least) the following header:

 FIELD <int>

Possibly, it can also include a one-to-one map which is applied to each component of n-grams before the LM query:

 w1 class(w1)
 w2 class(w2)
 ...
 wM class(wM)

The value of <int> determines the processing applied to the n-gram components, which are supposed to be strings like field0#field1#...#fieldN:

  • -1: the strings are used are they are; if the map is given, it is applied to the whole string before the LM query
  • 0-9: the field number <int> is selected; if the map is given, it is applied to the selected field
  • 00-99: the two fields corresponding to the two digits are selected and concatenated together using the character _ as separator. For example, if <int>=21, the LM is queried with n-grams of strings field2_field1. If the map is given, it is applied to the field corresponding to the first digit.

The last case is useful for lexicalization of LMs: if the fields n. 2 and 1 correspond to the POS and lemma of the actual word respectively, the LM is queried with n-grams of POS_lemma.

Chunk Language Models

A particular processing is performed whenever fields are supposed to correspond to microtags, i.e. the per-word projections of chunk labels. The processing aims at collapsing the sequence of microtags defining a chunk to the label of that chunk. The chunk LM is then queried with n-grams of chunk labels, in an asynchronous manner with respect to the sequence of words, as in general chunks consist of more words.

The collapsing operation is automatically activated if the sequence of microtags is:

 (TAG TAG+ TAG+ ... TAG+ TAG)

or

 TAG( TAG+ TAG+ ... TAG+ TAG)

Both those sequences are collapsed into a single chunk label (let us say CHNK) as long as (TAG / TAG(, TAG+ and TAG) are all mapped into the same label CHNK. The map into different labels or a different use/position of characters (, + and ) in the lexicon of tags prevent the collapsing operation.

Currently (Aug 2008), lexicalized chunk LMs are still under investigation and only non-lexicalized chunk LMs are properly handled; then, the range of admitted <int> values for this kind of LMs is -1...9, with the above described meaning.

RandLM

If you really want to build the largest LMs possible (for example, a 5-gram trained on one hundred billion words then you should look at the RandLM. This takes a very different approach to either the SRILM or the IRSTLM. It represents LMs using a randomized data structure (technically, variants of Bloom filters). This can result in LMs that are ten times smaller than those created using the SRILM (and also smaller than IRSTLM), but at the cost of making decoding about four times slower. RandLM is multithreaded now, so the speed reduction should be less of a problem.

Technical details of randomized language modelling can be found in a ACL paper (see Talbot and Osborne, (ACL 2007))

Installing RandLM

RandLM is available at Sourceforge.

After extracting the tar ball, go to the directory src and type make.

For integrating RandLM into Moses, please see above.

Building a randomized language model

The buildlm binary (in randlm/bin) preprocesses and builds randomized language models.

The toolkit provides three ways for building a randomized language models:

  1. from a tokenised corpus (this is useful for files around 100 million words or less)
  2. from a precomputed backoff language model in ARPA format (this is useful if you want to use a precomputed SRILM model)
  3. from a set of precomputed ngram-count pairs (this is useful if you need to build LMs from billions of words. RandLM has supporting Hadoop scripts).

The former type of model will be referred to as a CountRandLM while the second will be referred to as a BackoffRandLM. Models built from precomputed ngram-count pairs are also of type "CountRandLM". CountRandLMs use either StupidBackoff or else Witten-Bell smoothing. BackoffRandLM models can use any smoothing scheme that the SRILM implements. Generally, CountRandLMs are smaller than BackoffRandLMs, but use less sophisticated smoothing. When using billions of words of training material there is less of a need for good smoothing and so CountRandLMs become appropriate.

The following parameters are important in all cases:

  • struct: The randomized data structure used to represent the language model (currently only BloomMap and LogFreqBloomFilter).
  • order: The order of the n-gram model e.g., 3 for a trigram model.
  • falsepos: The false positive rate of the randomized data structure on an inverse log scale so -falsepos 8 produces a false positive rate of 1/28.
  • values: The quantization range used by the model. For a CountRandLM quantisation is performed by taking a logarithm. The base of the logarithm is set as 21/values. For a BackoffRandLM a binning quantisation algorithm is used. The size of the codebook is set as 2values. A reasonable setting in both cases is -values 8.
  • input-path: The location of data to be used to create the language model.
  • input-type: The format of the input data. The following four formats are supported
    • for a CountRandLM:
      • corpus tokenised corpora one sentence per line;
      • counts n-gram counts file (one count and one n-gram per line);
    • Given a 'corpus' file the toolkit will create a 'counts' file which may be reused (see examples below).
    • for a BackoffRandLM:
      • arpa an ARPA backoff language model;
      • backoff language model file (two floats and one n-gram per line).
    • Given an arpa file the toolkit will create a 'backoff' file which may be reused (see examples below).
  • output-prefix:Prefix added to all output files during the construction of a randomized language model.

Example 1: Building directly from corpora

The command

 ./buildlm -struct BloomMap -falsepos 8 -values 8 -output-prefix model -order 3 < corpus

would produce the following files:-

 model.BloomMap         <- the randomized language model
 model.counts.sorted    <- n-gram counts file
 model.stats            <- statistics file (counts of counts)
 model.vcb              <- vocabulary file (not needed)

model.BloomMap: This randomized language model is ready to use on its own (see 'Querying a randomized language model' below).

model.counts.sorted: This is a file in the RandLM 'counts' format with one count followed by one n-gram per line. It can be specified as shown in Example 3 below to avoid recomputation when building multiple randomized language models from the same corpus.

model.stats: This statistics file contains counts of counts and can be specified via the optional parameter '-statspath' as shown in Example 3 to avoid recomputation when building multiple randomized language models from the same data.

Example 2: Building from an ARPA file (from another toolkit)

The command

 ./buildlm -struct BloomMap -falsepos 8 -values 8 -output-prefix model -order 3 \
   -input-path precomputed.bo -input-type arpa

(where precomputed.bo contains an ARPA-formatted backoff model) would produce the following files:

 model.BloomMap	<- the randomized language model
 model.backoff  <- RandLM backoff file
 model.stats    <- statistics file (counts of counts)
 model.vcb      <- vocabulary file (not needed)

model.backoff is a RandLM formatted copy of the ARPA model. It can be reused in the same manner as the model.counts.sorted file (see Example 3).

Example 3: Building a second randomized language model from the same data

The command

 ./buildlm -struct BloomMap -falsepos 4 -values 8 -output-prefix model4 -order 3
   -input-path model.counts.sorted -input-type counts -stats-path model.stats

would construct a new randomized language model (model4.BloomMap) from the same data as used in Example 1 but with a different error rate (here -falsepos 4). This usage avoids re-tokenizing the corpus and recomputing the statistics file.

Building Randomised LMs from 100 Billion Words using Hadoop

At some point you will discover that you cannot build a LM using your data. RandLM natively uses a disk-based method for creating n-grams and counts, but this will be slow for large corpora. Instead you can create these ngram-count pairs using Hadoop (Map-Reduce). The RandLM release has Hadoop scripts which take raw text files and create ngram-counts. We have built randomised LMs this way using more than 110 billion tokens.

The procedure for using Hadoop is as follows:

  • You first load raw and possibly tokenised text files onto the Hadoop Distributed File System (DFS). This will probably involve commands such as:
 Hadoop dfs -put myFile data/
  • You then create ngram-counts using Hadoop (here a 5-gram):
 perl hadoop-lm-count.prl data data-counts 5 data-counting
  • You then upload the counts to the Unix filesystem:
 perl hadoopRead.prl data-counts | gzip - > /unix/path/to/counts.gz
  • These counts can then be passed to RandLM:
 ./buildlm -estimator batch  -smoothing WittenBell  -order 5 \
 -values 12 -struct LogFreqBloomFilter -tmp-dir /disk5/miles \
 -output-prefix giga3.rlm -output-dir /disk5/miles -falsepos 12 \
 -keep-tmp-files -sorted-by-ngram -input-type counts \
 -input-path /disk5/miles/counts.gz

Querying Randomised Language Models

Moses uses its own interface to the randLM, but it may be interesting to query the language model directly. The querylm binary (in randlm/bin) allows a randomized language model to be queried. Unless specified the scores provided by the tool will be conditional log probabilities (subject to randomisation errors).

The following parameters are available:-

  • randlm: The path of the randomized language model built using the buildlm tool as described above.
  • test-path: The location of test data to be scored by the model.
  • test-type: The format of the test data: currently corpus and ngrams are supported. corpus will treat each line in the test file as a sentence and provide scores for all n-grams (adding <s> and </s>). ngrams will score each line once treating each as an independent n-gram.
  • get-counts: Return the counts of n-grams rather than conditional log probabilities (only supported by CountRandLM).
  • checks: Applies sequential checks to n-grams to avoid unnecessary false positives.

Example: The command

 ./querylm -randlm model.BloomMap -test-path testfile -test-type ngrams -order 3 > scores

would write out conditional log probabilities one for each line in the file test-file.

  • Finally, you then tell randLM to use these pre-computed counts.

KenLM

KenLM is a language model that is simultaneously fast and low memory. The probabilities returned are the same as SRI, up to floating point rounding. It is maintained by Ken Heafield, who provides additional information on his website, such as benchmarks comparing speed and memory use against the other language model implementations. KenLM is distributed with Moses and compiled by default. KenLM is fully thread-safe for use with multi-threaded Moses.

Estimation

The lmplz program estimates language models with Modified Kneser-Ney smoothing and no pruning. Pass the order (-o), an amount of memory to use for building (-S), and a location to place temporary files (-T). Note that -S is compatible with GNU sort so e.g. 1G = 1 gigabyte and 80% means 80% of physical RAM. It scales to much larger models than SRILM can handle and does not resort to approximation like IRSTLM does.

 bin/lmplz -o 5 -S 80% -T /tmp <text >text.arpa

See the page on estimation for more.

Using the EMS

To use lmplz in EMS set the following three parameters to your needs and copy the fourth one as is.

 # path to lmplz binary
 lmplz = $moses-bin-dir/lmplz
 # order of the language model
 order = 3
 # additional parameters to lmplz (check lmplz help message)
 settings = "-T $working-dir/tmp -S 10G"
 # this tells EMS to use lmplz and tells EMS where lmplz is located
 lm-training = "$moses-script-dir/generic/trainlm-lmplz.perl -lmplz $lmplz"

Querying

ARPA files can be read directly:

 KENLM factor=<factor> order=<order> path=filename.arpa

but the binary format loads much faster and provides more flexibility. The <order> field is ignored. By contrast, SRI silently returns incorrect probabilities if you get it wrong (Kneser-Ney smoothed probabilties for lower-order n-grams are conditioned on backing off).

Binary file

Using the binary format significantly reduces loading time. It also exposes more configuration options. The kenlm/build_binary program converts ARPA files to binary files:

 kenlm/build_binary filename.arpa filename.binary

This will build a binary file that can be used in place of the ARPA file. Note that, unlike IRST, the file extension does not matter; the binary format is recognized using magic bytes. You can also specify the data structure to use:

 kenlm/build_binary trie filename.arpa filename.binary

where valid values are probing, sorted, and trie. The default is probing. Generally, I recommend using probing if you have the memory and trie if you do not. See benchmarks for details. To determine the amount of RAM each data structure will take, provide only the arpa file:

 kenlm/build_binary filename.arpa

Bear in mind that this includes only language model size, not the phrase table or decoder state.

Building the trie entails an on-disk sort. You can optimize this by setting the sorting memory with -S using the same options as GNU sort e.g. 100M, 1G, 80%. Final model building will still use the amount of memory needed to store the model. The -T option lets you customize where to place temporary files (the default is based on the output file name).

 kenlm/build_binary -T /tmp/trie -S 1G trie filename.arpa filename.binary

Full or lazy loading

KenLM supports lazy loading via mmap. This allows you to further reduce memory usage, especially with trie which has good memory locality. This is specified by another arguments in the feature function for the KENLM feature function:

   KENLM ... lazyken=<true/false>

I recommend fully loading if you have the RAM for it; it actually takes less time to load the full model and use it because the disk does not have to seek during decoding. Lazy loading works best with local disk and is not recommended for networked filesystems.

Probing

Probing is the fastest and default data structure. Unigram lookups happen by array index. Bigrams and longer n-grams are hashed to 64-bit integers which have very low probability of collision, even with the birthday attack. This 64-bit hash is the key to a probing hash table where values are probability and backoff.

A linear probing hash table is an array consisting of blanks (zeros) and entries with non-zero keys. Lookup proceeds by hashing the key modulo the array size, starting at this point in the array, and scanning forward until the entry or a blank is found. The ratio of array size to number of entries is controlled by the probing multiplier parameter p. This is a time-space tradeoff: space is linear in p and time is O(p/(p-1)). The value of p can be set at binary building time e.g.

 kenlm/build_binary -p 1.2 probing filename.arpa filename.binary

sets a value of 1.2. The default value is 1.5 meaning that one third of the array is blanks.

Trie

The trie data structure uses less memory than all other options (except RandLM with stupid backoff), has the best memory locality, and is still faster than any other toolkit. However, it does take longer to build. It works in much the same way as SRI and IRST's inverted option. Like probing, unigram lookup is an array index. Records in the trie have a word index, probability, backoff, and pointer. All of the records for n-grams of the same order are stored consecutively in memory. An n-gram's pointer is actually the index into the (n+1)-gram array where block of (n+1)-grams with one more word of history starts. The end of this block is found by reading the next entry's pointer. Records within the block are sorted by word index. Because the vocabulary ids are randomly permuted, a uniform key distribution applies. Interpolation search within each block finds the word index and its correspoding probability, backoff, and pointer. The trie is compacted by using the minimum number of bits to store each integer. Probability is always non-positive, so the sign bit is also removed.

Since the trie stores many vocabulary ids and uses the minimum number of bits to do so, vocabulary filtering is highly effective for reducing overall model size even if less n-grams of higher order are removed.

Quantization

The trie supports quantization to any number of bits from 1 to 25. To quantize to 8 bits, use -q 8. If you want to separately control probability and backoff quantization, use -q for probability and -b for backoff.

Array compression (also known as Chop)

The trie pointers comprise a sorted array. These can be compressed using a technique from Raj and Whittaker by chopping off bits and storing offsets instead. The -a option acts as an upper bound on the number of bits to chop; it will never chop more bits than minimizes memory use. Since this is a time-space tradeoff (time is linear in the number of bits chopped), you can set the upper bound number of bits to chop using -a. To minimize memory, use -a 64. To save time, specify a lower limit e.g. -a 10.

Vocabulary lookup

The original strings are kept at the end of the binary file and passed to Moses at load time to obtain or generate Moses IDs. This is why lazy binary loading still takes a few seconds. KenLM stores a vector mapping from Moses ID to KenLM ID. The cost of this vector and Moses-side vocabulary word storage are not included in the memory use reported by build_binary. However, benchmarks report the entire cost of running Moses.

Edit - History - Print
Page last modified on January 13, 2014, at 01:46 PM