Note: There may be some discrepancies between this description and the actual workings of the training script.
To work through this tutorial, you first need to have the data in place. The instructions also assume that you have the training script and the decoder in you executable path.
You can obtain the data as follows:
tar xzf factored-corpus.tgz
For more information on the training script, check the documentation, which is linked to on the right navigation column under "Training".
The corpus package contains language models and parallel corpora with POS and lemma factors. Before playing with factored models, let us start with training a traditional phrase-based model:
% train-model.perl \ --root-dir unfactored \ --corpus factored-corpus/proj-syndicate \ --f de --e en \ --lm 0:3:factored-corpus/surface.lm:0 \ --external-bin-dir .../tools
This creates a phrase-based model in the directory
unfactored/model in about 20 minutes (on a 2.8GHZ machine). For a quicker training run that only takes a few minutes (with much worse results) use the just the first 1000 sentence pairs of the corpus, contained in
% train-model.perl \ --root-dir unfactored \ --corpus factored-corpus/proj-syndicate.1000 \ --f de --e en \ --lm 0:3:factored-corpus/surface.lm:0 \ --external-bin-dir .../tools
This creates a typical phrase-based model, as specified in the created configuration file
moses.ini. Here the part of the file that points to the phrase table:
[feature] PhraseDictionaryMemory ... path=/.../phrase-table.gz ...
You can take a look at the generated phrase table, which starts as usual with rubbish but then occasionally contains some nice entries. The scores ensure that during decoding the good entries are preferred.
! ||| ! ||| 1 1 1 1 2.718 " ( ||| " ( ||| 1 0.856401 1 0.779352 2.718 " ) , ein neuer film ||| " a new film ||| 1 0.0038467 1 0.128157 2.718 " ) , ein neuer film über ||| " a new film about ||| 1 0.000831718 1 0.0170876 2.71 [...] frage ||| issue ||| 0.25 0.285714 0.25 0.166667 2.718 frage ||| question ||| 0.75 0.555556 0.75 0.416667 2.718
Take a look at the training data. Each word is not only represented by its surface form (as you would expect in raw text), but also with additional factors.
% tail -n 1 factored-corpus/proj-syndicate.?? ==> factored-corpus/proj-syndicate.de <== korruption|korruption|nn|nn.fem.cas.sg floriert|florieren|vvfin|vvfin .|.|per|per ==> factored-corpus/proj-syndicate.en <== corruption|corruption|nn flourishes|flourish|nns .|.|.
The German factors are
The English factors are
Let us start simple and build a translation model that adds only the target part-of-speech factor on the output side:
% train-model.perl \ --root-dir pos \ --corpus factored-corpus/proj-syndicate.1000 \ --f de --e en \ --lm 0:3:factored-corpus/surface.lm:0 \ --lm 2:3:factored-corpus/pos.lm:0 \ --translation-factors 0-0,2 \ --external-bin-dir .../tools
Here, we specify with
--translation-factors 0-0,2 that the input factor for the translation table is the (0) surface form, and the output factor is (0) surface form and (2) part of speech.
[feature] PhraseDictionaryMemory ... input-factor=0 output-factor=0,2
The resulting phrase table looks very similar, but now also contains part-of-speech tags on the English side:
! ||| !|. ||| 1 1 1 1 2.718 " ( ||| "|" (|( ||| 1 0.856401 1 0.779352 2.718 " ) , ein neuer film ||| "|" a|dt new|jj film|nn ||| 1 0.00403191 1 0.128157 2.718 " ) , ein neuer film über ||| "|" a|dt new|jj film|nn about|in ||| 1 0.000871765 1 0.0170876 2.718 [...] frage ||| issue|nn ||| 0.25 0.285714 0.25 0.166667 2.718 frage ||| question|nn ||| 0.75 0.625 0.75 0.416667 2.718
We also specified two language models. Besides the regular language model based on surface forms, we have a second language model that is trained on POS tags. In the configuration file this is indicated by two lines in the LM section:
[feature] SRILM name=LM0 ... SRILM name=LM1 ...
Also, two language model weights are specified:
[weight] LM0= 0.5 LM1= 0.5
The part-of-speech language model includes preferences such as that determiner-adjective is likely followed by a noun, and less likely by a determiner:
-0.192859 dt jj nn -2.952967 dt jj dt
This model can be used just like normal phrase based models:
% echo 'putin beschreibt menschen .' > in % moses -f pos/model/moses.ini < in [...] BEST TRANSLATION: putin|nnp describes|vbz people|nns .|.  [total=-6.049] <<0.000, -4.000, 0.000, -29.403, -11.731, -0.589, -1.303, -0.379, -0.556, 4.000>> [...]
During the decoding process, not only words (
putin), but also part-of-speech are generated (
Let's take a look what happens, if we input a German sentence that starts with the object:
% echo 'menschen beschreibt putin .' > in % moses -f pos/model/moses.ini < in BEST TRANSLATION: people|nns describes|vbz putin|nnp .|.  [total=-8.030] <<0.000, -4.000, 0.000, -31.289, -17.770, -0.589, -1.303, -0.379, -0.556, 4.000>>
Now, this is not a very good translation. The model's aversion to do reordering trumps our ability to come up with a good translation. If we downweight the reordering model, we get a better translation:
% moses -f pos/model/moses.ini < in -d 0.2 BEST TRANSLATION: putin|nnp describes|vbz people|nns .|.  [total=-7.649] <<-8.000, -4.000, 0.000, -29.403, -11.731, -0.589, -1.303, -0.379, -0.556, 4.000>>
Note that this better translation is mostly driven by the part-of-speech language model, which prefers the sequence
nnp vbz nns . (-11.731) over the sequence
nns vbz nnp . (-17.770). The surface form language model only shows a slight preference (-29.403 vs. -31.289). This is because these words have not been seen next to each other before, so the language model has very little to work with. The part-of-speech language model is aware of the count of the nouns involved and prefers a singular noun before a singular verb (
nnp vbz) over a plural noun before a singluar verb (
To drive this point home, the unfactored model is not able to find the right translation, even with downweighted reordering model:
% moses -f unfactored/model/moses.ini < in -d 0.2 people describes putin .  [total=-11.410] <<0.000, -4.000, 0.000, -31.289, -0.589, -1.303, -0.379, -0.556, 4.000>>
Let us now train a slightly different factored model with the same factors. Instead of mapping from the German input surface form directly to the English output surface form and part of speech, we now break this up into two mapping steps, one translation step that maps surface forms to surface forms, and a second step that generates the part of speech from the surface form on the output side:
% train-model.perl \ --root-dir pos-decomposed \ --corpus factored-corpus/proj-syndicate.1000 \ --f de --e en \ --lm 0:3:factored-corpus/surface.lm:0 \ --lm 2:3:factored-corpus/pos.lm:0 \ --translation-factors 0-0 \ --generation-factors 0-2 \ --decoding-steps t0,g0 \ --external-bin-dir .../tools
Now, the translation step is specified only between surface forms (
--translation-factors 0-0) and a generation step is specified (
--generation-factors 0-2), mapping (0) surface form to (2) part of speech. We also need to specified in which order the mapping steps are applied (
Besides the phrase table that has the same format as the unfactored phrase table, we now also have a generation table. It is referenced in the configuration file:
[feature] Generation ... input-factor=0 output-factor=2 [weight] GenerationModel0= 0.3 0
Let us take a look at the generation table:
% more pos-decomposed/model/generation.0-2 nigerian nnp 1.0000000 0.0008163 proven vbn 1.0000000 0.0021142 issue nn 1.0000000 0.0021591 [...] control vb 0.1666667 0.0014451 control nn 0.8333333 0.0017992 [...]
The beginning is not very interesting. As most words,
issue occur only with one part of speech, e.g., p(
nigerian) = 1.0000000. Some words, however, such as
control occur with multiple part of speech, such as base form verb (
vb) and single noun (
The table also contains the reverse translation probability p(
nnp) = 0.0008163. In our example, this may not be a very useful feature. It basically hurts open class words, especially unusual ones. If we do not want this feature, we can also train the generation model as single-featured by the switch
Translating surface forms seems to be a somewhat questionable pursuit. It does not seem to make much sense to treat different word forms of the same lemma, such as
menschen differently. In the worst case, we will have seen only one of the word forms, so we are not able to translate the other. This is what in fact happens in this example:
% echo 'ein mensch beschreibt putin .' > in % moses.1430.srilm -f unfactored/model/moses.ini < in a mensch|UNK|UNK|UNK describes putin .  [total=-158.818] <<0.000, -5.000, -100.000, -127.565, -1.350, -1.871, -0.301, -0.652, 4.000>>
Factored translation models allow us to create models that do morphological analysis and decomposition during the translation process. Let us now train such a model:
% train-model.perl \ --root-dir morphgen \ --corpus factored-corpus/proj-syndicate.1000 \ --f de --e en \ --lm 0:3:factored-corpus/surface.lm:0 \ --lm 2:3:factored-corpus/pos.lm:0 \ --translation-factors 1-1+3-2 \ --generation-factors 1-2+1,2-0 \ --decoding-steps t0,g0,t1,g1 \ --external-bin-dir .../tools
We have a total of four mapping steps:
This enables us now to translate the sentence above:
% echo 'ein|ein|art|art.indef.z mensch|mensch|nn|nn.masc.nom.sg \ beschreibte|beschreiben|vvfin|vvfin putin|putin|nn|nn.masc.cas.sg \ .|.|per|per' > in % moses -f morphgen/model/moses.ini < in BEST TRANSLATION: a|a|dt individual|individual|nn describes|describe|vbz \ putin|putin|nnp .|.|.  [total=-17.269] <<0.000, -5.000, 0.000, -38.631, -13.357, -2.773, -21.024, 0.000, -1.386, \ -1.796, -4.341, -3.189, -4.630, 4.999, -13.478, -14.079, -4.911, -5.774, 4.999>>
Note that this is only possible, because we have seen an appropriate word form in the output language. The word
individual occurs as single noun in the parallel corpus, as translation of
einzelnen. To overcome this limitation, we may train generation models on large monolingual corpora, where we expect to see all possible word forms.
Decomposing translation into a process of morphological analysis and generation will make our translation model more robust. However, if we have seen a phrase of surface forms before, it may be better to take advantage of such rich evidence.
The above model poorly translates sentences, as it does use the source surface form at all, relying on translating the properties of the surface forms.
In practice, we fair better when we allow both ways to translate in parallel. Such a model is trained by the introduction of decoding paths. In our example, one decoding path is the morphological analysis and generation as above, the other path the direct mapping of surface forms to surface forms (and part-of-speech tags, since we are using a part-of-speech tag language model):
% train-model.perl \ --corpus factored-corpus/proj-syndicate.1000 \ --root-dir morphgen-backoff \ --f de --e en \ --lm 0:3:factored-corpus/surface.lm:0 \ --lm 2:3:factored-corpus/pos.lm:0 \ --translation-factors 1-1+3-2+0-0,2 \ --generation-factors 1-2+1,2-0 \ --decoding-steps t0,g0,t1,g1:t2 \ --external-bin-dir .../tools
This command is almost identical to the previous training run, except for the additional translation table
0-0,2 and its inclusion as a different decoding path
A strategy for translating surface forms which have not been seen in the training corpus is to translate its lemma instead. This is especially useful for translation from morphologically rich languages to simpler languages, such as German to English translation.
% train-model.perl \ --corpus factored-corpus/proj-syndicate.1000 \ --root-dir lemma-backoff \ --f de --e en \ --lm 0:3:factored-corpus/surface.lm:0 \ --lm 2:3:factored-corpus/pos.lm:0 \ --translation-factors 0-0,2+1-0,2 \ --decoding-steps t0:t1 \ --external-bin-dir .../tools