Moses is the implemented of a factored translation model. This means that each word is represented by a vector of factors, which are typically word, part-of-speech tags, etc. It also means that the implementation is a bit more complicated than a non-factored translation model.
This section intends to provide some documentation of how factors, words, and phrases are implemented in Moses.
Factor implements the most basic unit of representing text in Moses. In essence it is a string.
Factors do not know about their own type (which component in the word vector they represent), this is referred to as its
FactorType when needed. This factor type is implemented as a
size_t, i.e. an integer. What a factor really represents (be it a surface form or a part of speech tag), does not concern the decoder at all. All the decoder knows is that there are a number of factors that are referred to by their factor type, i.e. an integer index.
Since we do not want to store the same strings over and over again, the
FactorCollection contains all known factors. The class has one global instance, and it provides the essential functions to check if a newly constructed factor already exists and to add a factor. This enables the comparison of factors by the cheaper comparison of the pointers to factors. Think of the
FactorCollection as the global factor dictionary.
A word is, as we said, a vector of factors. The class
Word implements this. As data structure, it is a array over pointers to factors. This does require the code to know what the array size is, which is set by the global
The word class implements a number of functions for comparing and copying words, and the addressing of individual factors.
Again, a word does not know, how many factors it really has. So, for instance, when you want to print out a word with all its factors, you need to provide also the factor types that are valid within the word. See the function
Word::GetString for details.
This is a good place to note that referring to words gets a bit more complicated. If more than one factor is used, it does not mean that all the words in the models have all the factors. Take again the example of a two-factored representation of words as surface form and part-of-speech. We may still use a simple surface word language model, so for that language model, a word only has one factor.
We expect the input to the decoder to have all factors specified and during decoding the output will have all factors of all words set. The process may not be a straight-forward mapping of the input word to the output word, but it may be decomposed into several mapping steps that either translate input factors into output factors, or generate additional output factors from existing output factors.
At this point, keep on mind that a
Factor has a
FactorType and a
Word has a
vector<FactorType>, but these are not internally stored with the
Factor and the
Related to factor types is the class
FactorMask, which is a bit array indicating which factors are valid for a particular word.
Since decoding proceeds in the translation of input phrases to output phrases, a lot of operation involve the
Since the total number of input and output factors is known to the decoder (it has to be specified in the configuration file
moses.ini), phrases are also a bit smarter about copying and comparing.
Phrase class implements many useful functions, and two other classes are derived from it:
TargetPhrasemay be somewhat misleadingly named, since it not only contains a output phrase, but also a phrase translation score, future cost estimate, pointer to source phrase, and potentially word alignment information.