machine translation

Training: Overview

We will start with an overview of the training process. This should give a feel for what is going on and what files are produced. In the following, we will go into more details of the options of the training process and additional tools.

The training process takes place in 7 steps, all of them executed by the script


The eight steps are

  1. Prepare data (45 minutes)
  2. Run GIZA++ (16 hours)
  3. Align words (2:30 hours)
  4. Get lexical translation table (30 minutes)
  5. Extract phrases (10 minutes)
  6. Score phrases (1:15 hours)
  7. Build lexicalized reordering model (1 hour)
  8. Build generation models
  9. Create configuration file (1 second)

The run times mentioned in the steps refer to a recent training run on the 751'000 sentence, 16 million word German-English Europarl corpus, on a 3GHz Linux machine.

Running the training script

For an standard phrase model, you will run the training script like this:

Alignment factors

It is usually better to carry out the word alignment (step 2-3 of the training process) on more general word representations with rich statistics. Even successful word alignment with words stemmed to 4 characters have been reported. For factored models, this suggests that word alignment should be done only on either the surface form or the stem/lemma.

Which factors are used during word alignment is set with the --alignment-factors switch. Let us formally define the parameter syntax:

  • FACTOR = [ 0 - 9 ]+

The switch requires a FACTORMAP as argument, for instance 0-0 (using only factor 0 from source and target language) or 0,1,2-0,1 (using factors 0, 1, and 2 from the source language and 0 and 1 from the target language).

Translation factors

Purpose of training factored translation model training is to create one or more translation tables between a subset of the factors. All translation tables are trained from the same word alignment, and are specified with the switch --translation-factors.

To define the syntax, we have to extend our parameter syntax with


since we want to specify multiple mappings.

One example is 0-0+1-1, which create the two tables


Reordering factors

Reordering tables can be trained with --reordering-factors, but this is currently not supported by any decoder. Syntax is the same as for translation factors.

Generation factors

Finally, we also want to create generation tables between target factors. Which tables to generate is specified with --generation-factors, which takes a FACTORMAPSET as a parameter. Note that this time the mapping is between target factors, not between source and target factors.

One example is 0-1 with creates a generation table between factor factor 0 and 1.

Edit - History - Print
Page last modified on July 28, 2013, at 08:08 AM