Moses
statistical
machine translation
system

Efficient Phrase and Rule Storage

Content

Binary Phrase Tables with On-demand Loading

For larger tasks the phrase tables usually become huge, typically too large to fit into memory. Therefore, Moses supports a binary phrase table with on-demand loading, i.e. only the part of the phrase table that is required to translate a sentence is loaded into memory.

There are currently 3 binary formats to do this:

  • OnDisk phrase-table. Works with SCFG models and phrase-based models.
  • Binary phrase-table. Works with phrase-based models only.
  • Compact phrase-table. Works with phrase-based models only (may be extended in the near future). Small and fast. Described below.

On-Disk Phrase table

This phrase-table can be used for both phrase-based models and hierarchical models. (It can be used for fully syntax models too, but is likely to be very slow).

You first need to convert the rule table into a binary prefix format. This is done with the command CreateOnDiskPt:

 CreateOnDiskPt [#source factors] [#target factors] [#scores] [ttable-limit]  \ 
    [index of p(e|f) (usually 2)] [input text pt] [output directory]

e.g.

  ~/CreateOnDiskPt 1 1 4 100 2 phrase-table.1.gz phrase-table.1.folder

This will create a directory, phrase-table.1.folder, with the following files:

  Misc.dat
  Source.dat
  TargetColl.dat
  TargetInd.dat
  Vocab.dat

The configuration file moses.ini should also be changed so that the binary files is used instead of the text file. You should change it from:

   [feature]
   PhraseDictionaryMemory path=phrase-table.1.gz ....

to

   [feature]
   PhraseDictionaryOnDisk path=phrase-table.1.folder ....

Compact Phrase Table

A compact phrase table implementation is available that is around 6 to 7 times smaller and than the original binary phrase table. It can be used in-memory and for on-demand loading. Like the original phrase table, this can only be used for phrase-based models. If you use this or the compact lexical reordering table below, please cite:

Download the CMPH library from http://sourceforge.net/projects/cmph/ and install according to the included instructions. Make sure the installation target directory contains an "include" and a "lib" directory. Next you need to recompile Moses with

  ./bjam --with-cmph=/path/to/cmph

Now, you can convert the standard ASCII phrase tables into the compact format. Phrase tables are required to be sorted as above. For a maximal compression effect, it is advised to generate a phrase table with phrase-internal word alignment (this is the default). If you want to compress a phrase table without alignment information, rather use -encoding None (see advanced options below). It is possible to use the default encoding (PREnc) without alignment information, but it will take much longer. For now, there may be problems to compact phrase tables on 32-bit systems since virtual memory usage quickly exceeds the 3 GB barrier.

Here is an example (standard phrase table phrase-table, with 4 scores) which produces a single file phrase-table.minphr:

  mosesdecoder/bin/processPhraseTableMin -in phrase-table.gz -out phrase-table -nscores 4 -threads 4

In the Moses config file, specify the WHOLE file name of the phrase table:

 [feature]
 PhraseDictionaryCompact path=phrase-table.minphr ...

Options:

  • -in string -- input table file name
  • -out string -- prefix of binary table file
  • -nscores int -- number of score components in phrase table
  • -no-alignment-info -- do not include alignment info in the binary phrase table
  • -threads int -- number of threads used for conversion
  • -T string -- path to custom temporary directory

As for the original phrase table, the option -no-alignment-info omits phrase internal alignment information in the phrase table and should also be used if you provide a phrase table without alignment information in the phrase table. Also if no alignment data is given in the phrase table you should use -encoding None (see below), since the default compression method assumes that alignment information is present.

Since compression is quite processor-heavy, it is advised to use the -threads option to increase speed.

Advanced options: Default settings should be fine for most of your needs, but the size of the phrase table can be tuned to your specific needs with the help of the advanced options.

Options:

  • -encoding string -- encoding type: PREnc REnc None (default PREnc)
  • -rankscore int -- score index of P(t|s) (default 2)
  • -maxrank int -- maximum rank for PREnc (default 100)
  • -landmark int -- use landmark phrase every 2^n source phrases (default 10)
  • -fingerprint int -- number of bits used for source phrase fingerprints (default 16)
  • -join-scores -- single set of Huffman codes for score components
  • -quantize int -- maximum number of scores per score component
  • -no-warnings -- suppress warnings about missing alignment data

Encoding methods: There are two encoding types that can be used on-top the standard compression methods, Phrasal Rank-Encoding (PREnc) and word-based Rank Encoding (REnc). PREnc (see Junczys-Dowmunt (MT Marathon 2012) for details) is used by default and requires a phrase table with phrase-internal alignment to reach its full potential. PREnc can also work without explicit alignment information, but encoding is slower and the resulting file will be bigger, but smaller than without PREnc. The tool will warn you about every line that misses alignment information if you use PREnc or REnc. These warnings can be suppressed with -no-warnings. If you use PREnc with non-standard scores you should specify which score type is used for sorting with -rankscore int. By default this is P(t|s) which in the standard phrase table is the third score (index 2).

Basically with PREnc around, there is no reason to use REnc unless you really want to. It requires the lexical translation table lex.f2e to be present in the same directory as the text version phrase table. If no alignment information is available it falls back to None (See Junczys-Dowmunt (EAMT 2012) for details on REnc and None).

None is the fasted encoding method, but produces the biggest files. Concerning translation speed, there is virtually no difference between the encoding methods when the phrase tables are later used with Moses, but smaller files result in lesser memory-usage, especially if the phrase tables are loaded entirely in-memory.

Indexing options: The properties of the source phrase index can be modified with the -landmark and -fingerprint options, changing these options can affect file size and translation quality, so do it carefully. Junczys-Dowmunt (TSD 2012) contains a discussion of these values and their effects.

Scores and quantization: You can reduce the file size even more by using score quantization. E.g. with -quantize 1000000, a phrase table is generated that uses at most one million different scores for each score type. Be careful, low values will affect translation quality. By default, each score type is encoded with an own set of Huffman codes, with the -join-scores option only one set is used. If this option is combined with -quantize N, the summed number of different scores for all scores types will not exceed N.

In-memory loading: You can start Moses with the option -minphr-memory to load the compact phrase table directly into memory at start up. Without this option, on-demand loading is used by default.

Compact Lexical Reordering Table

The compact lexical reordering table produces files about 12 to 15 times smaller than the original Moses binary implementation. As for the compact phrase table you need to install CMPH and link against it. Reordering tables must be sorted in the same way as the phrase tables above. The command below produces a single file reordering-table.minlexr.

  mosesdecoder/bin/processLexicalTableMin -in reordering-table.gz -out reordering-table -threads 4

If you include the prefix in the Moses config file, the compact reordering table will be recognized and loaded automatically. You can start Moses with the option -minlexr-memory to load the compact lexical reordering table directly into memory at start up.

Options: See the compact phrase table above for a description of available common options.

Pruning the Translation Table

The translation table contains all phrase pairs found in the parallel corpus, which includes a lot of noise. To reduce the noise, recent work by Johnson et al. has suggested to prune out unlikely phrase pairs. For more detail, please refer to the paper:

H. Johnson, J. Martin, G. Foster and R. Kuhn. (2007) '''Improving Translation Quality by Discarding Most of the Phrasetable'''. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 967-975.

Build Instructions

Moses includes a re-implementation of this method in the directory contrib/sigtest-filter. You first need to build it from the source files.

This implementation relies on Joy Zhang's. The source code can be download from github. Joy's original code is here.

  1. download and extract the SALM source release.
  2. in SALM/Distribution/Linux type: make
  3. enter the directory contrib/sigtest-filter in the main Moses distribution directory
  4. type make SALMDIR=/path/to/SALM

Usage Instructions

Using the SALM/Bin/Linux/Index/IndexSA.O32, create a suffix array index of the source and target sides of your training bitext (SOURCE, TARGET).

 % SALM/Bin/Linux/Index/IndexSA.O32 TARGET
 % SALM/Bin/Linux/Index/IndexSA.O32 SOURCE

Prune the phrase table:

 % cat phrase-table | ./filter-pt -e TARGET -f SOURCE -l FILTER-VALUE > phrase-table.pruned

FILTER-VALUE is the -log prob threshold described in Johnson et al. (2007)'s paper. It may be either 'a+e', 'a-e', or a positive real value. Run with no options to see more use-cases. A good setting is -l a+e -n 30, which also keeps only the top 30 phrase translations for each source phrase, based on p(e|f).

If you filter an hierarchical model, add the switch -h.

Using the EMS

To use this method in experiment.perl, you will have to add two settings in the TRAINING section:

 salm-index = /path/to/project/salm/Bin/Linux/Index/IndexSA.O64
 sigtest-filter = "-l a+e -n 50"

The setting salm-index points to the binary to build the suffix array, and sigtest-filter contains the options for filtering (excluding -e, -f, -h). EMS detects automatically, if you filter a phrase-based or hierarchical model and if a reordering model is used.

Pruning the Phrase Table based on Relative Entropy

While the pruning method in Johnson et al. (2007) is designed to remove spurious phrase pairs due to noisy data, it is also possible to remove phrase pairs that are redundant. That is, phrase pairs that can be composed by smaller phrase pairs in the model with similar probabilities. For more detail please refer to the following papers:

Ling, W., Graša, J., Trancoso, I., and Black, A. (2012). Entropy-based Pruning for Phrase-based Machine Translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 962-971.

Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase Table Pruning Technique. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 972-983.

The code from Ling et al. (2012)'s paper is available at contrib/relent-filter.

Update The code in contrib/relent-filter no longer works with the current version of Moses. To compile it, use an older version of Moses with this command:

    git checkout RELEASE-0.91

Build Instructions

The binaries for Relative Entropy-based Pruning are built automatically with Moses. However, this implementation also calculates the significance scores (Johnson et al., 2007), using a slightly modified version of the code by Chris Dyer, which is in contrib/relent-filter/sigtest-filter. This must be built using the same procedure:

  1. Download and build SALM available here
  2. Run "make SALMDIR=/path/to/SALM" in "contrib/relent-filter/sigtest-filter" to create the executable filter-pt

Usage Instructions

Checklist of required files (I will use <varname> to refer to these vars):

  1. s_train - source training file
  2. t_train - target training file
  3. moses_ini - path to the Moses configuration file ( after tuning )
  4. pruning_binaries - path to the relent pruning binaries ( should be "bin" if no changes were made )
  5. pruning_scripts - path to the relent pruning scripts ( should be "contrib/relent-filter/scripts" if no changes were made )
  6. sigbin - path to the sigtest filter binaries ( should be "contrib/relent-filter/sigtest-filter" if no changes were made )
  7. output_dir - path to write the output

Build suffix arrays for the source and target parallel training data

 % SALM/Bin/Linux/Index/IndexSA.O32 <s_train>
 % SALM/Bin/Linux/Index/IndexSA.O32 <t_train>

Calculate phrase pair scores by running:

 % perl <pruning_scripts>/calcPruningScores.pl -moses_ini <moses_ini> \
   -training_s <s_train> -training_t <t_train> \
   -prune_bin <pruning_binaries> -prune_scripts <pruning_scripts> \
   -moses_scripts <path_to_moses>/scripts/training/ \
   -workdir <output_dir> -dec_size 10000

This will create the following files in the <output_dir>/scores/ dir:

  1. count.txt - counts of the phrase pairs for N(s,t) N(s,*) and N(*,t)
  2. divergence.txt - negative log of the divergence of the phrase pair
  3. empirical.txt - empirical distribution of the phrase pairs N(s,t)/N(*,*)
  4. rel_ent.txt - relative entropy of the phrase pairs
  5. significance.txt - significance of the phrase pairs

You can use any one of these files for pruning and also combine these scores using the script <pruning_scripts>/interpolateScores.pl.

To actually prune a phrase table, run <pruning_scripts>/prunePT.pl, this will prune phrase pairs based on the score file that is used. This script will prune the phrase pairs with lower scores first.

For instance, to prune 30% of the phrase table using relative entropy run:

 % perl <pruning_scripts>/prunePT.pl -table <phrase_table_file> \
 -scores <output_dir>/scores/rel_ent.txt -percentage 70 > <pruned_phrase_table_file>

You can also prune by threshold

 % perl <pruning_scripts>/prunePT.pl -table <phrase_table_file> \
 -scores <output_dir>/scores/rel_ent.txt -threshold 0.1 > <pruned_phrase_table_file>

The same must be done for the reordering table by replacing <phrase_table_file> with the <reord_table_file>

 % perl <pruning_scripts>/prunePT.pl -table <reord_table_file> \
 -scores <output_dir>/scores/rel_ent.txt -percentage 70 > <pruned_reord_table_file>

Parallelization

The script <pruning_scripts>/calcPruningScores.pl requires the forced decoding of the whole set of phrase pairs in the phrase table, so unless it is used for a small corpora, it usually requires large amounts of time to process. Thus, we recommend users to run multiple instances of <pruning_scripts>/calcPruningScores.pl in parallel to process different parts of the phrase table.

To do this, run:

 % perl <pruning_scripts>/calcPruningScores.pl -moses_ini <moses_ini> \
 -training_s <s_train> -training_t <t_train> \
 -prune_bin <pruning_binaries> -prune_scripts <pruning_scripts> \
 -moses_scripts <path_to_moses>/scripts/training/ \
 -workdir <output_dir> -dec_size 10000 -start 0 -end 100000

The -start and -end options tell the script to only calculate the results for phrase pairs between 0 and 99999.

Thus, an example of a shell script to run for the whole phrase table would be:

 size=`wc <phrase_table_file> | gawk '{print $1}'`
 phrases_per_process=100000

 for i in $(seq 0 $phrases_per_process $size)
 do
   end=`expr $i + $phrases_per_process`
   perl <pruning_scripts>/calcPruningScores.pl -moses_ini <moses_ini> \
   -training_s <s_train> -training_t <t_train> \
   -prune_bin <pruning_binaries> -prune_scripts <pruning_scripts> \
   -moses_scripts <path_to_moses>/scripts/training/ 
   -workdir <output_dir>.$i-$end -dec_size 10000 -start $i -end $end
 done

After all processes finish, simply join the partial score files together in the same order.

Pruning Rules based on Low Scores

Rules can be also removed simply because some of their scores are too low. This can be done at the time of the phrase table creation.

  train-model.perl [...]  \
  -score-options="-MinScore FIELD1:THRESHOLD2[,FIELD2:THRESHOLD2[,FIELD3:THRESHOLD3]]"

where FIELDn is the position of the score (typically 2 for the direct phrase probability p(e|f), or 0 for the indirect phrase probability p(f|e)) and THRESHOLD the maximum probability allowed. A good setting is 2:0.0001, which removes all rules, where the direct phrase translation probability is below 0.0001.

In EMS, this can be specified in the TRAINING:score-settings setting, for instance

  score-settings = "--MinScore 2:0.0001"
Edit - History - Print
Page last modified on March 11, 2015, at 05:39 PM