FactoredTraining » BuildingLanguageModel

Building a Language Model

Content

- Language Models in Moses
- Enabling the LM OOV Feature
- Building a LM with the SRILM Toolkit
- On the IRSTLM Toolkit
- Compiling IRSTLM
- Building Huge Language Models
- Binary Language Models
- Quantized Language Models
- Memory Mapping
- Class Language Models and more
- Chunk Language Models
- RandLM
- Installing RandLM
- Building a randomized language model
- Example 1: Building directly from corpora
- Example 2: Building from an ARPA file (from another toolkit)
- Example 3: Building a second randomized language model from the same data
- Building Randomised LMs from 100 Billion Words using Hadoop
- Querying Randomised Language Models
- KenLM
- Estimation
- Using the EMS
- Querying
- Binary file
- Full or lazy loading
- Probing
- Trie
- Quantization
- Array compression (also known as ''Chop'')
- Vocabulary lookup
- OxLM
- NPLM
- Training
- future cost estimates (<null> word)
- Querying
- Use in EMS
- Bilingual Neural LM
- Training
- Querying
- Use in EMS
- Bilingual N-gram LM (OSM)
- Training
- Querying
- Interpolated OSM Model
- Dependency Language Model (RDLM)
- Training
- Decoding

Language Models in Moses

The language model should be trained on a corpus that is suitable to the domain. If the translation model is trained on a parallel corpus, then the language model should be trained on the output side of that corpus, although using additional training data is often beneficial.

Our decoder works with the following language models:

the SRI language modeling toolkit, which is freely available.
the IRST language modeling toolkit, which is freely available and open source.
the RandLM language modeling toolkit, which is freely available and open source.
the KenLM language modeling toolkit, which is included in Moses by default.
the DALM language modeling toolkit, which is freely available and open source.
the OxLM language modeling toolkit, which is freely available and open source.
the NPLM language modeling toolkit, which is freely available and open source.

To use these language models, they have to be compiled with the proper option:

--with-srilm=<root dir of the SRILM toolkit>
--with-irstlm=<root dir of the IRSTLM toolkit>
--with-randlm=<root dir of the RandLM toolkit>
--with-dalm=<root dir of the DALM toolkit>
--with-oxlm=<root dir of the OxLM toolkit>
--with-nplm=<root dir of the NPLM toolkit>

KenLM is compiled by default. In the Moses configuration file, the type (SRI/IRST/RandLM/KenLM/DALM) of the LM is specified by the feature function class, eg.

 [feature]
 SRILM path=filename.srilm order=3 .....

 IRSTLM path=filename.irstlm ...

 RANDLM path=filename.irstlm ...

 KENLM path=filename.arpa ...

 DALM path=filename.dalm ...

The toolkits all come with programs that create a language model file, as required by our decoder. ARPA files are generally exchangeable, so you can estimate with one toolkit and query with a different one.

Enabling the LM OOV Feature

Moses offers the option to add an additional LM feature that counts the number of occurrences of unknown words in a hypothesis. Most language model implementations in Moses support this feature. To enable it, add oov-feature=1 to the language model line in moses.ini:

 ...LM path=... oov-feature=1 ...

Building a LM with the SRILM Toolkit

A language model can be created by calling:

 ngram-count -text CORPUS_FILE -lm SRILM_FILE

The command works also on compressed (gz) input and output. There are a variety of switches that can be used, we recommend -interpolate -kndiscount.

On the IRSTLM Toolkit

Moses can also use language models created with the IRSTLM toolkit (see Federico & Cettolo, (ACL WS-SMT, 2007)). The commands described in the following are supplied with the IRSTLM toolkit that has to be downloaded and compiled separately.

IRSTLM toolkit handles LM formats which permit to reduce both storage and decoding memory requirements, and to save time in LM loading. In particular, it provides tools for:

building (huge) LMs
quantizing LMs
compiling LMs (possibly quantized) into a binary format
accessing binary LMs through the memory mapping mechanism
query class and chunk LMs

Compiling IRSTLM

Compiling IRSTLM requires:

   1. automake 1.9 or higher
   2. autoconf 2.59 or higher
   3. libtool 2.2.6 or higher

Download IRSTLM as follows:

   wget http://sourceforge.net/projects/irstlm/files/irstlm/irstlm-5.80/irstlm-5.80.06.tgz/download

Un-archive the file irstlm-5.80.06.tgz

   tar zxvf irstlm-5.80.06.tgz

To install IRSTLM, follow the instruction reported in the irstlm-5.80.06/README.

The binaries and libraries will be installed in the installation directory, in bin/ and lib/, respectively.

Building Huge Language Models

Training a language model from huge amounts of data can be definitively memory and time expensive. The IRSTLM toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs. IRSTLM is open source and can be downloaded from here.

Typically, LM estimation starts with the collection of n-grams and their frequency counters. Then, smoothing parameters are estimated for each n-gram level; infrequent n-grams are possibly pruned and, finally, a LM file is created containing n-grams with probabilities and back-off weights. This procedure can be very demanding in terms of memory and time if applied to huge corpora. IRSTLM provides a simple way to split LM training into smaller and independent steps, which can be distributed among independent processes.

The procedure relies on a training script that makes little use of computer memory and implements the Witten-Bell smoothing method. (An approximation of the modified Kneser-Ney smoothing method is also available.) First, create a special directory stat under your working directory, where the script will save lots of temporary files; then, simply run the script build-lm.sh as in the example:

 build-lm.sh -i "gunzip -c corpus.gz" -n 3 -o train.irstlm.gz -k 10

The script builds a 3-gram LM (option -n) from the specified input command (-i), by splitting the training procedure into 10 steps (-k). The LM will be saved in the output (-o) file train.irstlm.gz with an intermediate ARPA format. This format can be properly managed through the compile-lm command in order to produce a compiled version or a standard ARPA version of the LM.

For a detailed description of the procedure and other commands available under IRSTLM please refer to the user manual supplied with the package.

Binary Language Models

You can convert your language model file (created either with the SRILM ngram-count command or with the IRSTLM toolkit) into a compact binary format with the command:

  compile-lm language-model.srilm language-model.blm

Moses compiled with the IRSTLM toolkit is able to properly handle that binary format; the setting of moses.ini for that file is:

 IRSTLM order=3 factor=0 path=language-model.blm

The binary format allows LMs to be efficiently stored and loaded. The implementation privileges memory saving rather than access time.

Quantized Language Models

Before compiling the language model, you can quantize (see Federico & Bertoldi, (ACL WS-SMT, 2006)) its probabilities and back-off weights with the command:

 quantize-lm language-model.srilm language-model.qsrilm

Hence, the binary format for this file is generated by the commmand:

 compile-lm language-model.qsrilm language-model.qblm

The resulting language model requires less memory because all its probabilities and back-off weights are now stored in 1 byte instead of 4. No special setting of the configuration file is required: Moses compiled with the IRSTLM toolkit is able to read the necessary information from the header of the file.

Memory Mapping

It is possible to avoid the loading of the LM into the central memory by exploiting the memory mapping mechanism. Memory mapping permits the decoding process to directly access the (binary) LM file stored on the hard disk.

Warning: In case of parallel decoding in a cluster of computers, each process will access the same file. The possible large number of reading requests could overload the driver of the hard disk which the LM is stored on, and/or the network. One possible solution to such a problem is to store a copy of the LM on the local disk of each processing node, for example under the /tmp/ directory.

In order to activate the access through the memory mapping, simply add the suffix .mm to the name of the LM file (which must be stored in the binary format) and update the Moses configuration file accordingly.

As an example, let us suppose that the 3gram LM has been built and stored in binary format in the file

 language-model.blm

Rename it for adding the .mm suffix:

 mv language-model.blm  language-model.blm.mm

or create a properly named symbolic link to the original file:

 ln -s language-model.blm  language-model.blm.mm

Now, the activation of the memory mapping mechanism is obtained simply by updating the Moses configuration file as follows:

 IRSTLM order=3 factor=0 path=language-model.blm.mm

Class Language Models and more

Typically, LMs employed by Moses provide the probability of n-grams of single factors. In addition to the standard way, the IRSTLM toolkit allows Moses to query the LMs in other different ways. In the following description, it is assumed that the target side of training texts contains words which are concatenation of N>=1 fields separated by the character #. Similarly to factored models, where the word is not anymore a simple token but a vector of factors that can represent different levels of annotation, here the word can be the concatenation of different tags for the surface form of a word, e.g.:

 word#lemma#part-of-speech#word-class

Specific LMs for each tag can be queried by Moses simply by adding a fourth parameter in the line of the configuration file devoted to the specification of the LM. The additional parameter is a file containing (at least) the following header:

 FIELD <int>

Possibly, it can also include a one-to-one map which is applied to each component of n-grams before the LM query:

 w1 class(w1)
 w2 class(w2)
 ...
 wM class(wM)

The value of <int> determines the processing applied to the n-gram components, which are supposed to be strings like field0#field1#...#fieldN:

-1: the strings are used are they are; if the map is given, it is applied to the whole string before the LM query
0-9: the field number <int> is selected; if the map is given, it is applied to the selected field
00-99: the two fields corresponding to the two digits are selected and concatenated together using the character _ as separator. For example, if <int>=21, the LM is queried with n-grams of strings field2_field1. If the map is given, it is applied to the field corresponding to the first digit.

The last case is useful for lexicalization of LMs: if the fields n. 2 and 1 correspond to the POS and lemma of the actual word respectively, the LM is queried with n-grams of POS_lemma.

Chunk Language Models

A particular processing is performed whenever fields are supposed to correspond to microtags, i.e. the per-word projections of chunk labels. The processing aims at collapsing the sequence of microtags defining a chunk to the label of that chunk. The chunk LM is then queried with n-grams of chunk labels, in an asynchronous manner with respect to the sequence of words, as in general chunks consist of more words.

The collapsing operation is automatically activated if the sequence of microtags is:

 (TAG TAG+ TAG+ ... TAG+ TAG)

 TAG( TAG+ TAG+ ... TAG+ TAG)

Both those sequences are collapsed into a single chunk label (let us say CHNK) as long as (TAG / TAG(, TAG+ and TAG) are all mapped into the same label CHNK. The map into different labels or a different use/position of characters (, + and ) in the lexicon of tags prevent the collapsing operation.

Currently (Aug 2008), lexicalized chunk LMs are still under investigation and only non-lexicalized chunk LMs are properly handled; then, the range of admitted <int> values for this kind of LMs is -1...9, with the above described meaning.

RandLM

If you really want to build the largest LMs possible (for example, a 5-gram trained on one hundred billion words then you should look at the RandLM. This takes a very different approach to either the SRILM or the IRSTLM. It represents LMs using a randomized data structure (technically, variants of Bloom filters). This can result in LMs that are ten times smaller than those created using the SRILM (and also smaller than IRSTLM), but at the cost of making decoding about four times slower. RandLM is multithreaded now, so the speed reduction should be less of a problem.

Technical details of randomized language modelling can be found in a ACL paper (see Talbot and Osborne, (ACL 2007))

Installing RandLM

RandLM is available at Sourceforge.

After extracting the tar ball, go to the directory src and type make.

For integrating RandLM into Moses, please see above.

Building a randomized language model

The buildlm binary (in randlm/bin) preprocesses and builds randomized language models.

The toolkit provides three ways for building a randomized language models:

from a tokenised corpus (this is useful for files around 100 million words or less)
from a precomputed backoff language model in ARPA format (this is useful if you want to use a precomputed SRILM model)
from a set of precomputed ngram-count pairs (this is useful if you need to build LMs from billions of words. RandLM has supporting Hadoop scripts).

The former type of model will be referred to as a CountRandLM while the second will be referred to as a BackoffRandLM. Models built from precomputed ngram-count pairs are also of type "CountRandLM". CountRandLMs use either StupidBackoff or else Witten-Bell smoothing. BackoffRandLM models can use any smoothing scheme that the SRILM implements. Generally, CountRandLMs are smaller than BackoffRandLMs, but use less sophisticated smoothing. When using billions of words of training material there is less of a need for good smoothing and so CountRandLMs become appropriate.

The following parameters are important in all cases:

struct: The randomized data structure used to represent the language model (currently only BloomMap and LogFreqBloomFilter).
order: The order of the n-gram model e.g., 3 for a trigram model.
falsepos: The false positive rate of the randomized data structure on an inverse log scale so -falsepos 8 produces a false positive rate of 1/2⁸.
values: The quantization range used by the model. For a CountRandLM quantisation is performed by taking a logarithm. The base of the logarithm is set as 2^1/values. For a BackoffRandLM a binning quantisation algorithm is used. The size of the codebook is set as 2^values. A reasonable setting in both cases is -values 8.
input-path: The location of data to be used to create the language model.
input-type: The format of the input data. The following four formats are supported
- for a CountRandLM:
  - corpus tokenised corpora one sentence per line;
  - counts n-gram counts file (one count and one n-gram per line);
- Given a 'corpus' file the toolkit will create a 'counts' file which may be reused (see examples below).
- for a BackoffRandLM:
  - arpa an ARPA backoff language model;
  - backoff language model file (two floats and one n-gram per line).
- Given an arpa file the toolkit will create a 'backoff' file which may be reused (see examples below).
output-prefix:Prefix added to all output files during the construction of a randomized language model.

Example 1: Building directly from corpora

The command

 ./buildlm -struct BloomMap -falsepos 8 -values 8 -output-prefix model -order 3 < corpus

would produce the following files:-

 model.BloomMap         <- the randomized language model
 model.counts.sorted    <- n-gram counts file
 model.stats            <- statistics file (counts of counts)
 model.vcb              <- vocabulary file (not needed)

model.BloomMap: This randomized language model is ready to use on its own (see 'Querying a randomized language model' below).

model.counts.sorted: This is a file in the RandLM 'counts' format with one count followed by one n-gram per line. It can be specified as shown in Example 3 below to avoid recomputation when building multiple randomized language models from the same corpus.

model.stats: This statistics file contains counts of counts and can be specified via the optional parameter '-statspath' as shown in Example 3 to avoid recomputation when building multiple randomized language models from the same data.

Example 2: Building from an ARPA file (from another toolkit)

The command

 ./buildlm -struct BloomMap -falsepos 8 -values 8 -output-prefix model -order 3 \
   -input-path precomputed.bo -input-type arpa

(where precomputed.bo contains an ARPA-formatted backoff model) would produce the following files:

 model.BloomMap	<- the randomized language model
 model.backoff  <- RandLM backoff file
 model.stats    <- statistics file (counts of counts)
 model.vcb      <- vocabulary file (not needed)

model.backoff is a RandLM formatted copy of the ARPA model. It can be reused in the same manner as the model.counts.sorted file (see Example 3).

Example 3: Building a second randomized language model from the same data

The command

 ./buildlm -struct BloomMap -falsepos 4 -values 8 -output-prefix model4 -order 3
   -input-path model.counts.sorted -input-type counts -stats-path model.stats

would construct a new randomized language model (model4.BloomMap) from the same data as used in Example 1 but with a different error rate (here -falsepos 4). This usage avoids re-tokenizing the corpus and recomputing the statistics file.

Building Randomised LMs from 100 Billion Words using Hadoop

At some point you will discover that you cannot build a LM using your data. RandLM natively uses a disk-based method for creating n-grams and counts, but this will be slow for large corpora. Instead you can create these ngram-count pairs using Hadoop (Map-Reduce). The RandLM release has Hadoop scripts which take raw text files and create ngram-counts. We have built randomised LMs this way using more than 110 billion tokens.

The procedure for using Hadoop is as follows:

You first load raw and possibly tokenised text files onto the Hadoop Distributed File System (DFS). This will probably involve commands such as:

 Hadoop dfs -put myFile data/

You then create ngram-counts using Hadoop (here a 5-gram):

 perl hadoop-lm-count.prl data data-counts 5 data-counting

You then upload the counts to the Unix filesystem:

 perl hadoopRead.prl data-counts | gzip - > /unix/path/to/counts.gz

These counts can then be passed to RandLM:

 ./buildlm -estimator batch  -smoothing WittenBell  -order 5 \
 -values 12 -struct LogFreqBloomFilter -tmp-dir /disk5/miles \
 -output-prefix giga3.rlm -output-dir /disk5/miles -falsepos 12 \
 -keep-tmp-files -sorted-by-ngram -input-type counts \
 -input-path /disk5/miles/counts.gz

Querying Randomised Language Models

Moses uses its own interface to the randLM, but it may be interesting to query the language model directly. The querylm binary (in randlm/bin) allows a randomized language model to be queried. Unless specified the scores provided by the tool will be conditional log probabilities (subject to randomisation errors).

The following parameters are available:-

randlm: The path of the randomized language model built using the buildlm tool as described above.
test-path: The location of test data to be scored by the model.
test-type: The format of the test data: currently corpus and ngrams are supported. corpus will treat each line in the test file as a sentence and provide scores for all n-grams (adding <s> and </s>). ngrams will score each line once treating each as an independent n-gram.
get-counts: Return the counts of n-grams rather than conditional log probabilities (only supported by CountRandLM).
checks: Applies sequential checks to n-grams to avoid unnecessary false positives.

Example: The command

 ./querylm -randlm model.BloomMap -test-path testfile -test-type ngrams -order 3 > scores

would write out conditional log probabilities one for each line in the file test-file.

Finally, you then tell randLM to use these pre-computed counts.

KenLM

KenLM is a language model that is simultaneously fast and low memory. The probabilities returned are the same as SRI, up to floating point rounding. It is maintained by Ken Heafield, who provides additional information on his website, such as benchmarks comparing speed and memory use against the other language model implementations. KenLM is distributed with Moses and compiled by default. KenLM is fully thread-safe for use with multi-threaded Moses.

Estimation

The lmplz program estimates language models with Modified Kneser-Ney smoothing and no pruning. Pass the order (-o), an amount of memory to use for building (-S), and a location to place temporary files (-T). Note that -S is compatible with GNU sort so e.g. 1G = 1 gigabyte and 80% means 80% of physical RAM. It scales to much larger models than SRILM can handle and does not resort to approximation like IRSTLM does.

 bin/lmplz -o 5 -S 80% -T /tmp <text >text.arpa

See the page on estimation for more.

Using the EMS

To use lmplz in EMS set the following three parameters to your needs and copy the fourth one as is.

 # path to lmplz binary
 lmplz = $moses-bin-dir/lmplz
 # order of the language model
 order = 3
 # additional parameters to lmplz (check lmplz help message)
 settings = "-T $working-dir/tmp -S 10G"
 # this tells EMS to use lmplz and tells EMS where lmplz is located
 lm-training = "$moses-script-dir/generic/trainlm-lmplz.perl -lmplz $lmplz"

Querying

ARPA files can be read directly:

 KENLM factor=<factor> order=<order> path=filename.arpa

but the binary format loads much faster and provides more flexibility. The <order> field is ignored. By contrast, SRI silently returns incorrect probabilities if you get it wrong (Kneser-Ney smoothed probabilties for lower-order n-grams are conditioned on backing off).

Binary file

Using the binary format significantly reduces loading time. It also exposes more configuration options. The kenlm/build_binary program converts ARPA files to binary files:

 kenlm/build_binary filename.arpa filename.binary

This will build a binary file that can be used in place of the ARPA file. Note that, unlike IRST, the file extension does not matter; the binary format is recognized using magic bytes. You can also specify the data structure to use:

 kenlm/build_binary trie filename.arpa filename.binary

where valid values are probing, sorted, and trie. The default is probing. Generally, I recommend using probing if you have the memory and trie if you do not. See benchmarks for details. To determine the amount of RAM each data structure will take, provide only the arpa file:

 kenlm/build_binary filename.arpa

Bear in mind that this includes only language model size, not the phrase table or decoder state.

Building the trie entails an on-disk sort. You can optimize this by setting the sorting memory with -S using the same options as GNU sort e.g. 100M, 1G, 80%. Final model building will still use the amount of memory needed to store the model. The -T option lets you customize where to place temporary files (the default is based on the output file name).

 kenlm/build_binary -T /tmp/trie -S 1G trie filename.arpa filename.binary

Full or lazy loading

KenLM supports lazy loading via mmap. This allows you to further reduce memory usage, especially with trie which has good memory locality. This is specified by another arguments in the feature function for the KENLM feature function:

   KENLM ... lazyken=<true/false>

I recommend fully loading if you have the RAM for it; it actually takes less time to load the full model and use it because the disk does not have to seek during decoding. Lazy loading works best with local disk and is not recommended for networked filesystems.

Probing

Probing is the fastest and default data structure. Unigram lookups happen by array index. Bigrams and longer n-grams are hashed to 64-bit integers which have very low probability of collision, even with the birthday attack. This 64-bit hash is the key to a probing hash table where values are probability and backoff.

A linear probing hash table is an array consisting of blanks (zeros) and entries with non-zero keys. Lookup proceeds by hashing the key modulo the array size, starting at this point in the array, and scanning forward until the entry or a blank is found. The ratio of array size to number of entries is controlled by the probing multiplier parameter p. This is a time-space tradeoff: space is linear in p and time is O(p/(p-1)). The value of p can be set at binary building time e.g.

 kenlm/build_binary -p 1.2 probing filename.arpa filename.binary

sets a value of 1.2. The default value is 1.5 meaning that one third of the array is blanks.

Trie

The trie data structure uses less memory than all other options (except RandLM with stupid backoff), has the best memory locality, and is still faster than any other toolkit. However, it does take longer to build. It works in much the same way as SRI and IRST's inverted option. Like probing, unigram lookup is an array index. Records in the trie have a word index, probability, backoff, and pointer. All of the records for n-grams of the same order are stored consecutively in memory. An n-gram's pointer is actually the index into the (n+1)-gram array where block of (n+1)-grams with one more word of history starts. The end of this block is found by reading the next entry's pointer. Records within the block are sorted by word index. Because the vocabulary ids are randomly permuted, a uniform key distribution applies. Interpolation search within each block finds the word index and its correspoding probability, backoff, and pointer. The trie is compacted by using the minimum number of bits to store each integer. Probability is always non-positive, so the sign bit is also removed.

Since the trie stores many vocabulary ids and uses the minimum number of bits to do so, vocabulary filtering is highly effective for reducing overall model size even if less n-grams of higher order are removed.

Quantization

The trie supports quantization to any number of bits from 1 to 25. To quantize to 8 bits, use -q 8. If you want to separately control probability and backoff quantization, use -q for probability and -b for backoff.

Array compression (also known as Chop)

The trie pointers comprise a sorted array. These can be compressed using a technique from Raj and Whittaker by chopping off bits and storing offsets instead. The -a option acts as an upper bound on the number of bits to chop; it will never chop more bits than minimizes memory use. Since this is a time-space tradeoff (time is linear in the number of bits chopped), you can set the upper bound number of bits to chop using -a. To minimize memory, use -a 64. To save time, specify a lower limit e.g. -a 10.

Vocabulary lookup

The original strings are kept at the end of the binary file and passed to Moses at load time to obtain or generate Moses IDs. This is why lazy binary loading still takes a few seconds. KenLM stores a vector mapping from Moses ID to KenLM ID. The cost of this vector and Moses-side vocabulary word storage are not included in the memory use reported by build_binary. However, benchmarks report the entire cost of running Moses.

OxLM

Please consult the documentation at https://github.com/pauldb89/oxlm

NPLM

NPLM is a neural network language model toolkit (homepage). We currently recommend installing a fork which allows pre-multiplication of the input embedding and training with a single hidden layer for faster decoding.

Training

[the steps below are automated in the script mosesdecoder/scripts/training/train-neurallm.py]

first, prepare the training data by extracting the vocabulary and converting it into numberized n-grams:

  prepareNeuralLM --train_text mydata.txt --ngram_size 5 \ 
                --vocab_size 100000 --write_words_file words \ 
                --train_file train.ngrams \ 
                --validation_size 500 --validation_file validation.ngrams

this generates the files train.ngrams, validation.ngrams, and words.

then, train the actual neural network. This step can take very long, and you may need to adjust the amount of training data or number of epochs accordingly.

  trainNeuralNetwork --train_file train.ngrams \ 
                   --validation_file validation.ngrams \ 
                   --num_epochs 10 \ 
                   --words_file words \ 
                   --model_prefix model \ 
                   --input_embedding_dimension 150 \ 
                   --num_hidden 0 \ 
                   --output_embedding_dimension 750

This configuration roughly corresponds to the settings described in (Vaswani et al. 2013), except that '--num_hidden 0' results in a model with a single hidden layer, which is recommended for decoder integration.

future cost estimates (<null> word)

Vaswani et al. (2013) recommend using special null words which are the weighted average of all input embeddings to pad lower-order estimates. To do this, add <null> to the vocabulary file 'words' (before training the network), and perform the following command after training:

 mosesdecoder/scripts/training/bilingual-lm/averageNullEmbedding.py \ 
     -p /path/to/nplm/python \ 
     -i input_model.nnlm \ 
     -o output_model.nnlm \ 
     -t train.ngrams

Querying

to use a NPLM model during decoding, define it as a feature function in the moses configuration file:

 [feature]
 NeuralLM factor=<factor> order=<order> path=filename

Use in EMS

Minimally, add the following to the [LM] section of your EMS config:

 [LM:neural]
 nplm-dir = /path/to/nplm/install
 raw-corpus = /path/to/training/corpus
 nplm = yes

You can use the variables epochs, order and nplm-settings to configure NPLM.

Bilingual Neural LM

An implementation of Devlin et al. (2014), a neural network language model that uses a target-side history as well as source-side context, is implemented in Moses as BilingualLM. It uses NPLM as back-end (check its installation instructions).

Training

The BilingualLM requires aligned parallel text for training. It uses the same format conventions as the train-model script.

First, extract the numberized n-grams:

  mosesdecoder/scripts/training/bilingual-lm/extract_training.py --working-dir <working_dir> \ 
      --corpus <corpus_file_prefix> \ 
      --source-language <L1> \ 
      --target-language <L2> \ 
      --align <aligned_file.grow-diag-final-and> \ 
      --prune-target-vocab 100000 \ 
      --prune-source-vocab 100000 \ 
      --target-context 5 \ 
      --source-context 4

this configuration is for a 5-gram language model with 9 source context words (the affiliated source word and a window of 4 words to its left and right) for a total n-gram-size of 14.

then, train a neural network model:

  mosesdecoder/scripts/training/bilingual-lm/train_nplm.py \ 
      --working-dir <working_dir> \ 
      --corpus <corpus_file_prefix> \ 
      --nplm-home </path/to/nplm> \ 
      --ngram-size 14 \ 
      --hidden 0 \ 
      --output-embedding 750 \ 
      --threads <number_of_threads>

'--hidden 0' results in a neural network with a single hidden layer, which is recommended for fast SMT decoding.

lastly, average the <null> word embedding as per the instructions here.

Querying

to use a bilingual NPLM model during decoding, define it as a feature function in the moses configuration file:

 [feature]
 BilingualNPLM order=5 source_window=4 path=/path/to/model source_vocab=/path/to/vocab.source target_vocab=/path/to/vocab.target

the model, vocab.source and vocab.target file are all in the working directory used for training the bilingual LM.

Use in EMS

Minimally, add the following to the [LM] section of your EMS config:

 [LM:comb]
 nplm-dir = /path/to/nplm/install
 order = 5
 source-window = 4
 bilingual-lm = yes

You can use the variables epochs to set the training epochs, bilingual-lm-settings to pass settings to the extraction script, and nplm-settings to control NPLM training.

Bilingual N-gram LM (OSM)

The Operation Sequence Model as described in Durrani et al. (2011) and Durrani et al. (2013) is a bilingual language model that also integrates reordering information.

To enable the OSM model in phrase-based decoder, just put the following in the EMS config file:

 operation-sequence-model = "yes"
 operation-sequence-model-order = 5
 operation-sequence-model-settings = ""

If the data has been augmented with additional factors, then use (for example)

 operation-sequence-model-settings = "--factor 0-0+1-1"

"0-0" will learn OSM model over lexical forms and "1-1" will learn OSM model over second factor (POS/Morph/Cluster-id etc.). Learning operation sequences over generalized representations such as POS/Morph tags/word classes, enables the model to overcome data sparsity Durrani et al. (2014).

If you want to train OSM model manually:

Training

/path-to-moses/scripts/OSM/OSM-Train.perl --corpus-f corpus.fr --corpus-e corpus.en --alignment aligned.grow-diag-final-and --order 5 --out-dir /path-to-experiment/model/OSM --moses-src-dir /path-to-moses/ --srilm-dir /path-to-srilm/bin/i686-m64 --factor 0-0

Querying

Added to model/moses.ini

[feature]
...
OpSequenceModel name=OpSequenceModel0 num-features=5 path=/path-to-experiment/model/OSM/operationLM.bin
...
[weight]
...
OpSequenceModel0= 0.08 -0.02 0.02 -0.001 0.03
...

Interpolated OSM Model

OSM model trained from the plain concatenation of in-domain data with large and diverse multi-domain data is sub-optimal. When other domains are sufficiently larger and/or different than the in-domain, the probability distribution can skew away from the target domain resulting in poor performance. The LM-like nature of the model provides motivation to apply methods such as perplexity optimization for model weighting. The idea is to train OSM model on each domain separately and interpolate them by optimizing perplexity on held-out tuning set. To know more read Durrani et al. (2015).

Usage

Provide tuning files as additional parameter in the settings. For example:

 interpolated-operation-sequence-model = "yes"
 operation-sequence-model-order = 5
 operation-sequence-model-settings = "--factor 0-0 --tune /path-to-tune-folder/tune_file --srilm-dir /path-to-srilm/bin/i686-m64"

This method requires word-alignment for the source and reference tuning files to generate operation sequences. This can be done using force-decoding of tuning set or by aligning tuning sets along with the training. The folder should contain files as (for example (tune.de , tune.en , tune.align).

Interpolation script does not work with LMPLZ and will require SRILM installation.

Dependency Language Model (RDLM)

RDLM (Sennrich 2015) is a language model for the string-to-tree decoder with a dependency grammar. It should work with any corpus with projective dependency annotation in ConLL format, converted into the Moses format with the script mosesdecoder/scripts/training/wrappers/conll2mosesxml.py It depends on NPLM for neural network training and querying.

Training

RDLM is trained on a corpus annotated with dependency syntax. The training scripts support the same format as used for training a string-to-tree translation model. An example EMS for string-to-dependency training with Moses is provided here. To train RDLM on additional monolingual data, or test it on some held-out test/dev data, parse and process it in the same way that the parallel corpus has been processed. This includes tokenization, parsing, truecasing, compound splitting etc.

RDLM is split into two neural network models, which can be trained with train_rdlm.py. An example command for training follows:

  mkdir working_dir_head
  mkdir working_dir_label
  mosesdecoder/scripts/training/rdlm/train_rdlm.py --nplm-home /path/to/nplm --working-dir working_dir_head  \ 
      --output-dir /path/to/output_directory --output-model rdlm_head  \ 
      --mode head  --output-vocab-size 500000 --noise-samples 100
  mosesdecoder/scripts/training/rdlm/train_rdlm.py --nplm-home /path/to/nplm --working-dir working_dir_label \ 
      --output-dir /path/to/output_directory --output-model rdlm_label \ 
      --mode label --output-vocab-size 75 --noise-samples 50

for more options, run train_rdlm.py --help. Parameters you may want to adjust include the size of the vocabulary and the neural network layers, and the number of training epochs.

Decoding

To use RDLM during decoding, add the following line to your moses.ini config:

  [feature]
  RDLM path_head_lm=/path/to/rdlm_head.model.nplm path_label_lm=/path/to/rdlm_label.model.nplm context_up=2 context_left=3

  [weight]
  RDLM 0.1 0.1

Mosesstatisticalmachine translationsystem

1. Moses

2. Getting Started

3. Tutorials

4. Training

5. User Documentation

6. Development

7. Background

Building a Language Model

Content

Language Models in Moses

Enabling the LM OOV Feature

Building a LM with the SRILM Toolkit

On the IRSTLM Toolkit

Compiling IRSTLM

Building Huge Language Models

Binary Language Models

Quantized Language Models

Memory Mapping

Class Language Models and more

Chunk Language Models

RandLM

Installing RandLM

Building a randomized language model

Example 1: Building directly from corpora

Example 2: Building from an ARPA file (from another toolkit)

Example 3: Building a second randomized language model from the same data

Building Randomised LMs from 100 Billion Words using Hadoop

Querying Randomised Language Models

KenLM

Estimation

Using the EMS

Querying

Binary file

Full or lazy loading

Probing

Trie

Quantization

Array compression (also known as Chop)

Vocabulary lookup

OxLM

NPLM

Training

future cost estimates (<null> word)

Querying

Use in EMS

Bilingual Neural LM

Training

Querying

Use in EMS

Bilingual N-gram LM (OSM)

Training

Querying

Interpolated OSM Model

Dependency Language Model (RDLM)

Training

Decoding

Moses
statistical
machine translation
system