Moses
statistical
machine translation
system

Tutorial for Using Factored Models

Note: There may be some discrepancies between this description and the actual workings of the training script.

To work through this tutorial, you first need to have the data in place. The instructions also assume that you have the training script and the decoder in you executable path.

You can obtain the data as follows:

For more information on the training script, check the documentation, which is linked to on the right navigation column under "Training".

Train an unfactored model

The corpus package contains language models and parallel corpora with POS and lemma factors. Before playing with factored models, let us start with training a traditional phrase-based model:

 % train-model.perl \
    --root-dir unfactored \
    --corpus factored-corpus/proj-syndicate \
    --f de --e en \
    --lm 0:3:factored-corpus/surface.lm \
    --external-bin-dir .../tools \
    --input-factor-max 4

This creates a phrase-based model in the directory unfactored/model in about 20 minutes (on a 2.8GHZ machine). For a quicker training run that only takes a few minutes (with much worse results) use the just the first 1000 sentence pairs of the corpus, contained in factored-corpus/proj-syndicate.1000.

 % train-model.perl \
    --root-dir unfactored \
    --corpus factored-corpus/proj-syndicate.1000 \
    --f de --e en \
    --lm 0:3:factored-corpus/surface.lm \
    --external-bin-dir .../tools \
    --input-factor-max 4

This creates a typical phrase-based model, as specified in the created configuration file moses.ini. Here the part of the file that points to the phrase table:

 [feature]
 PhraseDictionaryMemory ... path=/.../phrase-table.gz ...

You can take a look at the generated phrase table, which starts as usual with rubbish but then occasionally contains some nice entries. The scores ensure that during decoding the good entries are preferred.

 ! ||| ! ||| 1 1 1 1 2.718
 " ( ||| " ( ||| 1 0.856401 1 0.779352 2.718
 " ) , ein neuer film ||| " a new film ||| 1 0.0038467 1 0.128157 2.718
 " ) , ein neuer film über ||| " a new film about ||| 1 0.000831718 1 0.0170876 2.71
 [...]
 frage ||| issue ||| 0.25 0.285714 0.25 0.166667 2.718
 frage ||| question ||| 0.75 0.555556 0.75 0.416667 2.718

Train a model with POS tags

Take a look at the training data. Each word is not only represented by its surface form (as you would expect in raw text), but also with additional factors.

 % tail -n 1 factored-corpus/proj-syndicate.??
 ==> factored-corpus/proj-syndicate.de <==
 korruption|korruption|nn|nn.fem.cas.sg floriert|florieren|vvfin|vvfin .|.|per|per

 ==> factored-corpus/proj-syndicate.en <==
 corruption|corruption|nn flourishes|flourish|nns .|.|.

The German factors are

  • surface form,
  • lemma,
  • part of speech, and
  • part of speech with additional morphological information.

The English factors are

  • surface form,
  • lemma, and
  • part of speech.

Let us start simple and build a translation model that adds only the target part-of-speech factor on the output side:

 % train-model.perl \
    --root-dir pos \
    --corpus factored-corpus/proj-syndicate.1000 \
    --f de --e en \
    --lm 0:3:factored-corpus/surface.lm \
    --lm 2:3:factored-corpus/pos.lm \
    --translation-factors 0-0,2 \
    --external-bin-dir .../tools

Here, we specify with --translation-factors 0-0,2 that the input factor for the translation table is the (0) surface form, and the output factor is (0) surface form and (2) part of speech.

 [feature]
 PhraseDictionaryMemory ... input-factor=0 output-factor=0,2

The resulting phrase table looks very similar, but now also contains part-of-speech tags on the English side:

 ! ||| !|. ||| 1 1 1 1 2.718
 " ( ||| "|" (|( ||| 1 0.856401 1 0.779352 2.718
 " ) , ein neuer film ||| "|" a|dt new|jj film|nn ||| 1 0.00403191 1 0.128157 2.718
 " ) , ein neuer film über ||| "|" a|dt new|jj film|nn about|in ||| 1 0.000871765 1 0.0170876 2.718
 [...]
 frage ||| issue|nn ||| 0.25 0.285714 0.25 0.166667 2.718
 frage ||| question|nn ||| 0.75 0.625 0.75 0.416667 2.718

We also specified two language models. Besides the regular language model based on surface forms, we have a second language model that is trained on POS tags. In the configuration file this is indicated by two lines in the LM section:

 [feature]
 KENLM name=LM0 ...
 KENLM name=LM1 ...

Also, two language model weights are specified:

 [weight]
 LM0= 0.5
 LM1= 0.5

The part-of-speech language model includes preferences such as that determiner-adjective is likely followed by a noun, and less likely by a determiner:

 -0.192859       dt jj nn
 -2.952967       dt jj dt

This model can be used just like normal phrase based models:

 % echo 'putin beschreibt menschen .' > in
 % moses -f pos/model/moses.ini < in
 [...]
 BEST TRANSLATION: putin|nnp describes|vbz people|nns .|. [1111]  [total=-6.049]
 <<0.000, -4.000, 0.000, -29.403, -11.731, -0.589, -1.303, -0.379, -0.556, 4.000>>
 [...]

During the decoding process, not only words (putin), but also part-of-speech are generated (nnp).

Let's take a look what happens, if we input a German sentence that starts with the object:

 % echo 'menschen beschreibt putin .' > in
 % moses -f pos/model/moses.ini < in
 BEST TRANSLATION: people|nns describes|vbz putin|nnp .|. [1111]  [total=-8.030]
 <<0.000, -4.000, 0.000, -31.289, -17.770, -0.589, -1.303, -0.379, -0.556, 4.000>>

Now, this is not a very good translation. The model's aversion to do reordering trumps our ability to come up with a good translation. If we downweight the reordering model, we get a better translation:

 % moses -f pos/model/moses.ini < in -d 0.2
 BEST TRANSLATION: putin|nnp describes|vbz people|nns .|. [1111]  [total=-7.649]
 <<-8.000, -4.000, 0.000, -29.403, -11.731, -0.589, -1.303, -0.379, -0.556, 4.000>>

Note that this better translation is mostly driven by the part-of-speech language model, which prefers the sequence nnp vbz nns . (-11.731) over the sequence nns vbz nnp . (-17.770). The surface form language model only shows a slight preference (-29.403 vs. -31.289). This is because these words have not been seen next to each other before, so the language model has very little to work with. The part-of-speech language model is aware of the count of the nouns involved and prefers a singular noun before a singular verb (nnp vbz) over a plural noun before a singluar verb (nns vbz).

To drive this point home, the unfactored model is not able to find the right translation, even with downweighted reordering model:

 % moses -f unfactored/model/moses.ini < in -d 0.2
 people describes putin . [1111]  [total=-11.410]
 <<0.000, -4.000, 0.000, -31.289, -0.589, -1.303, -0.379, -0.556, 4.000>>

Train a model with generation and translation steps

Let us now train a slightly different factored model with the same factors. Instead of mapping from the German input surface form directly to the English output surface form and part of speech, we now break this up into two mapping steps, one translation step that maps surface forms to surface forms, and a second step that generates the part of speech from the surface form on the output side:

 % train-model.perl \
    --root-dir pos-decomposed \
    --corpus factored-corpus/proj-syndicate.1000 \
    --f de --e en \
    --lm 0:3:factored-corpus/surface.lm \
    --lm 2:3:factored-corpus/pos.lm \
    --translation-factors 0-0 \
    --generation-factors 0-2 \
    --decoding-steps t0,g0 \
    --external-bin-dir .../tools

Now, the translation step is specified only between surface forms (--translation-factors 0-0) and a generation step is specified (--generation-factors 0-2), mapping (0) surface form to (2) part of speech. We also need to specified in which order the mapping steps are applied (--decoding-steps t0,g0).

Besides the phrase table that has the same format as the unfactored phrase table, we now also have a generation table. It is referenced in the configuration file:

 [feature]
 Generation ... input-factor=0 output-factor=2

 [weight]
 GenerationModel0= 0.3 0

Let us take a look at the generation table:

 % more pos-decomposed/model/generation.0-2
 nigerian nnp 1.0000000  0.0008163
 proven vbn 1.0000000  0.0021142
 issue nn 1.0000000  0.0021591
 [...]
 control vb 0.1666667  0.0014451
 control nn 0.8333333  0.0017992
 [...]

The beginning is not very interesting. As most words, nigerian, proven, and issue occur only with one part of speech, e.g., p(nnp|nigerian) = 1.0000000. Some words, however, such as control occur with multiple part of speech, such as base form verb (vb) and single noun (nn).

The table also contains the reverse translation probability p(nigerian|nnp) = 0.0008163. In our example, this may not be a very useful feature. It basically hurts open class words, especially unusual ones. If we do not want this feature, we can also train the generation model as single-featured by the switch --generation-type single.

Train a morphological analysis and generation model

Translating surface forms seems to be a somewhat questionable pursuit. It does not seem to make much sense to treat different word forms of the same lemma, such as mensch and menschen differently. In the worst case, we will have seen only one of the word forms, so we are not able to translate the other. This is what in fact happens in this example:

 % echo 'ein mensch beschreibt putin .' > in
 % moses.1430.srilm -f unfactored/model/moses.ini < in
 a mensch|UNK|UNK|UNK describes putin . [11111]  [total=-158.818] 
 <<0.000, -5.000, -100.000, -127.565, -1.350, -1.871, -0.301, -0.652, 4.000>>

Factored translation models allow us to create models that do morphological analysis and decomposition during the translation process. Let us now train such a model:

 % train-model.perl \
    --root-dir morphgen \
    --corpus factored-corpus/proj-syndicate.1000 \
    --f de --e en \
    --lm 0:3:factored-corpus/surface.lm \
    --lm 2:3:factored-corpus/pos.lm \
    --translation-factors 1-1+3-2 \
    --generation-factors 1-2+1,2-0 \
    --decoding-steps t0,g0,t1,g1 \
    --external-bin-dir .../tools

We have a total of four mapping steps:

  • a translation step that maps lemmas (1-1),
  • a generation step that sets possible part-of-speech tags for a lemma (1-2),
  • a translation step that maps morphological information to part-of-speech tags (3-2), and
  • a generation step that maps part-of-speech tag and lemma to a surface form (1,2-0).

This enables us now to translate the sentence above:

 % echo 'ein|ein|art|art.indef.z mensch|mensch|nn|nn.masc.nom.sg \
   beschreibte|beschreiben|vvfin|vvfin putin|putin|nn|nn.masc.cas.sg \
   .|.|per|per' > in
 % moses -f morphgen/model/moses.ini < in
 BEST TRANSLATION: a|a|dt individual|individual|nn describes|describe|vbz \
   putin|putin|nnp .|.|. [11111]  [total=-17.269] 
 <<0.000, -5.000, 0.000, -38.631, -13.357, -2.773, -21.024, 0.000, -1.386, \
   -1.796, -4.341, -3.189, -4.630, 4.999, -13.478, -14.079, -4.911, -5.774, 4.999>>

Note that this is only possible, because we have seen an appropriate word form in the output language. The word individual occurs as single noun in the parallel corpus, as translation of einzelnen. To overcome this limitation, we may train generation models on large monolingual corpora, where we expect to see all possible word forms.

Train a model with multiple decoding paths

Decomposing translation into a process of morphological analysis and generation will make our translation model more robust. However, if we have seen a phrase of surface forms before, it may be better to take advantage of such rich evidence.

The above model poorly translates sentences, as it does use the source surface form at all, relying on translating the properties of the surface forms.

In practice, we fair better when we allow both ways to translate in parallel. Such a model is trained by the introduction of decoding paths. In our example, one decoding path is the morphological analysis and generation as above, the other path the direct mapping of surface forms to surface forms (and part-of-speech tags, since we are using a part-of-speech tag language model):

 % train-model.perl \
    --corpus factored-corpus/proj-syndicate.1000 \
    --root-dir morphgen-backoff \
    --f de --e en \
    --lm 0:3:factored-corpus/surface.lm \
    --lm 2:3:factored-corpus/pos.lm \
    --translation-factors 1-1+3-2+0-0,2 \
    --generation-factors 1-2+1,2-0 \
    --decoding-steps t0,g0,t1,g1:t2 \
    --external-bin-dir .../tools

This command is almost identical to the previous training run, except for the additional translation table 0-0,2 and its inclusion as a different decoding path :t2.

A strategy for translating surface forms which have not been seen in the training corpus is to translate its lemma instead. This is especially useful for translation from morphologically rich languages to simpler languages, such as German to English translation.

 % train-model.perl \
    --corpus factored-corpus/proj-syndicate.1000 \
    --root-dir lemma-backoff \
    --f de --e en \
    --lm 0:3:factored-corpus/surface.lm \
    --lm 2:3:factored-corpus/pos.lm \
    --translation-factors 0-0,2+1-0,2 \
    --decoding-steps t0:t1 \
    --external-bin-dir .../tools
Edit - History - Print
Page last modified on May 29, 2016, at 10:02 PM