Moses
statistical
machine translation
system

Cache-based Models

Contents

Dynamic Cache-Based Phrase Table

A cache-based implementation of phrase table is available; such phrase table can be updated on-the-fly without the need of re-loading data and re-starting the decoder. It is considered dynamic in two respects:

  • entries can be inserted and deleted at any time
  • scores can change over time.

From the perspective of Moses, the cache-based dynamic phrase table (CBPT) is simply an other type of phrase table; hence, during the pre-fetching phase, Moses collects translation options from the CBPT as well as from any other phrase table.

Entries of CBPT can be inserted and deleted by means of xml-based annotations read from input. Furthermore, the CBPT can also be pre-populated loading entries from a file during the Moses start-up.

Each phrase pair of CBPT is associated to an age, corresponding to the time it has been inserted in the cache, and its score depends on this age according to a parametrizable scoring function. According to the setting of CBPT, the age of all entries increases by 1 whenever a new entry is inserted, or is kept fixed to its original value. Consequently also the corresponding scores change or are constant over time. See below the section on ageing for further comments.

In order to activate the CBPT feature, specify parameters and weight for the CBPT in the Moses config file.

 [feature]
 PhraseDictionaryDynamicCacheBased name=CBPT0 num-features=1 [feature-parameters]

 [weight]
 CBPT0= 1.0

Moreover, enable the facility to interpret xml-based tags

 [xml-input]
 inclusive

Finally, if you use the CBPT in addition to other phrase tables (one in this example), add an additional translation step

 [mapping]
 0 T 0
 1 T 1

Feature Parameters

CBPT exposes the following parameters:

  • name string -- Moses feature name
  • num-features int -- number of score components in phrase table [1, fixed value]
  • cbtm-name string -- internal PBPT name ["default", by default]
  • cbtm-file string -- file name of the entries to pre-populate the cache
  • cbtm-score-type -- scoring type ["0", by default]
  • cbtm-max-age -- maximum age of an entry ["1000", by default]
  • cbtm-constant -- flag to disable ageing of entries ["false", by default]

Moses handles multiple CBPTs; to this purpose, they are identified by an internal parametrizable name to specify in the annotation string (see below). If so, please use different internal name (cbtm-name) to refer to different CBPTs. It is worth stressing that the value of cbtm-name is an internal parameter of each single CBPT and it is different from the value assigned to the ]parameter name, which is used at the higher level of Moses to distinguish features.

Ageing of the entries

The ageing of the entries, i.e. the fact that their associated ages are increased after each new insertion, is useful in those scenarios, like news and computer assisted translation, where the domain, lexicon, and/or genre may change over time, and the older entries may be no more valid. Note that an entry which becomes too old, i.e. older than a parametrizable threshold, is removed by the cache. On the other side, having constant and pre-defined ages (and hence scores) can be useful in those scenario, like the translation of manuals of a product, where human-approved lexicons is mostly required. The ageing of the entries, enabled by default, is controlled by the parameter cbtm-constant (false by default).

Scoring function

The score associated to an entry depends on its age 'x' on the basis of following pre-defined functions: Scoring functions are classified into two main classes:

  • penalties, which always give a negative score to the entry according to the policy ”the less recent, the more penalized”; entries which are not present receive the lowest score (i.e. the highest penalty);
  • reward, which always give a positive score if the entry is present in the cache or 0 otherwise.
indexscore typefunction
  matchno match
0hyperbola-based penaltyx(-1) - 1.0maxAge(-1) - 1.0
1power-based penaltyx(-1/4) - 1.0maxAge(-1/4) - 1.0
2exponential-based penaltyexp( x(-1) )/exp( 1.0 ) - 1.0exp( maxAge(-1) )/exp( 1.0 ) - 1.0
3cosine-based penaltycos( 3.14/2 * (x-1) / maxAge ) - 1.0cos( 3.14/2 * (maxAge-1) / maxAge ) - 1.0
10hyperbola-based rewardx(-1)0.0
11power-based rewardx(-1/4)0.0
12exponential-based rewardexp( x(-1) )/exp( 1.0 )0.0


The indexes in the first columns identify the scoring function to be set in the configuration file with the parameter cbtm-score-type.

Annotation

The content of the cache of CBPT can be changed feeding the decoder with xml-based annotations from stdin.

The annotation mandatorily contains the fields:

  • type, which identified the type of features it refers to; the type of any CBPT is cbtm
  • id which identifies which specific CBPT (in case of multiple CBPTs) it refers to; the value is equal to the internal name in the Moses configuration file (cbtm-name) ("myCBPT" in the following examples). <dlt type="cbtm" id="myCBPT" ....

Note that dlt stands for Document Level Translation because originally the dynamic models were intended for that task; cbtm stand for Cache-Based Translation Model.

More annotations can be provided in the same line; in this case, annotations are processed sequentially left-to-right.

Inserting entries

With the following annotation, 3 entries are added contemporarily, i.e. they are associated with the same age 1. Quadruple vertical lines separate phrase pairs; triple vertical lines separate source and target sides of a phrase pair.

  <dlt type="cbtm" id="myCBPT" cbtm="The crude face of supremacy ||| Le visage rustre de 
la domination |||| of supremacy ||| de la domination |||| face ||| visage"/>

Optionally, the word-to-word alignment can be specified between source and target words of any phrase pair. In this case, the word alignment is placed after source and target separated by a triple vertical bars. Word alignments are represented by a list of dash-separated indexes of the source and target words; indexes start from 0. The previous example could become

  <dlt type="cbtm" id="myCBPT" cbtm="The crude face of supremacy ||| Le visage rustre de 
la domination |||| of supremacy ||| de la domination ||| 0-0 0-1 1-2 |||| face ||| visage"/>

In this case, the alignment of only one phrase pair is specified; more precisely the tuple "of supremacy ||| de la domination ||| 0-0 0-1 1-2" means that word "of" is aligned to "de la" and word "supremacy" to "domination".

With the following annotation 3 entries are added sequentially left-to-right. Hence, the most-left insertion ("The crude face of domination ||| Le visage rustre de la domination") is the oldest and the phrase pair is associated to an age of 3, while the right-most insertion ("face|||visage") is the newest and the phrase pair is associated to the age 1.

  <dlt type="cbtm" id="myCBPT" cbtm="The crude face of supremacy ||| Le visage rustre de 
la domination"/><dlt cbtm="of supremacy ||| de la domination ||| 0-0 0-1 1-2"/>
<dlt cbtm="face|||visage"/>

Entries to be inserted can be also loaded from file(s). Double vertical lines separate filenames. File format is described below.

  <dlt type="cbtm" id="myCBPT" cbtm-file="filename-1 || filename-2"/>

Deleting entries

With the following annotation 3 entries are deleting contemporarily. Quadruple vertical lines separate phrase pairs; triple vertical lines separate source and target sides of a phrase pair.

  <dlt type="cbtm" id="myCBPT" cbtm-clear-option="of supremacy ||| de la domination
|||| The crude face ||| Le visage rustre |||| face ||| visage"/>

Similarly, the same 3 entries are deleted sequentially.

  <dlt type="cbtm" id="myCBPT" cbtm-clear-option="of supremacy ||| de la domination"/>
<dlt type="cbtm" id="myCBPT" cbtm-clear-option="The crude face ||| Le visage rustre"/>
<dlt type="cbtm" id="myCBPT" cbtm-clear-option="face |||| visage"/>

Note that the previous two examples above make no difference because deletion of entries from the CBPT have no impact on the remaining.

With the following annotation all entries associated the specified source phrases are deleted.

 <dlt cbtm-clear-source="The crude face |||| of supremacy"/>

or similarly

 <dlt cbtm-clear-source="The crude face"/><dlt cbtm-clear-source="of supremacy"/>

With either the two annotation below, all entries in CBPT are deleted

  <dlt cbtm-clear-all=""/>
  <dlt cbtm-command="clear"/>

Important: there is no way to recover the deleted entries.

File format

[Note that the file format was changed on July 2014.]

CBPT can be also populated by loading entries from file either during the start-up of Moses, or even during decoding using the ad-hoc annotation string. Each line must contain one field with the age (at the beginning) and a list of one or more tuples representing the phrase pairs to insert with the specified age. The tuple must contain the source and the target phrase, and optionally their word-to-word alignment, in the format explained above (see Section "Inserting entries").

Age and tuples must be separated by quadruple vertical bars. Source phrase, target phrase and their alignment (if any) must be separated by triple vertical bars.

  age |||| src_phr ||| trg_phr ||| wa_align |||| src_phr ||| trg_phr ||| wa_align |||| ....

Here is an example:

  1 |||| The crude face ||| Le visage rustre ||| 0-0 1-1 2-2
  3 |||| supremacy ||| la domination
  2 |||| of supremacy ||| de la domination ||| 0-0 0-1 1-2 |||| crude face ||| visage rustre ||| 0-0 1-1
  ...

Note that the tuple "of supremacy ||| de la domination ||| 0-0 0-1 1-2" means that word "of" is aligned to "de la" and word "supremacy" to "domination"

In case of multiple entries, the last value is considered.

General notes

At the time being, CBTM (together with CBLM) is the only phrase table implementation in Moses, which can be modified on-the-fly by means of commands passed through the input channel.

Moses is already able to modify its behaviour at run-time, by means of the "xml-input" function. Phrase pairs and scores can be provided to the decoder, and used as exclusive or additional options for the sake of the translation. Nevertheless, this approach has few weaknesses:

  • the suggested options refer to a specific input span;
  • it is not possible to provide options for overlapping spans;
  • the suggested options are at disposal only for the current sentence;
  • it has no impact on the language model; hence, if any words within the suggested option is unknown, the language model still penalizes it.

Moses also includes an implementation of the phrase table based on a suffix-array data structure. The phrase table is not created in the training phase; the translation options are instead collected by sampling and scored on-the-fly at translation time, by means of an extremely efficient method of storing and searching the training corpus. Recently, the suffix-array phrase table has been enhanced so that new options can be dynamically added to the training corpus (see here for details). In this way, it can be exploited for the sake of incremental training. Nevertheless, this implementation has few weaknesses:

  • as suggested options are merged together with the training corpus, it is not trivial rewarding them with respect to those already existing;
  • assuming that the corpus could be quickly extended, the modification would be persistent forever.

CBPT overcomes the drawbacks of the mentioned approaches. In particular,

  • the entries inserted in CBPT are available for the translation of the future sentences, but it is also possible to remove them at any time;
  • if the available suggested options refer to overlapping spans, the choice of the best alternative is made in the decoding phase by avoiding any potentially dangerous greedy decision;
  • thanks to the age-dependent scoring function, it is possible to reward specific translation options, with respect to others.

Dynamic Cache-Based Language Model

The cache-based dynamic language model (CBLM) is a novel feature to score the target n-grams of the translation alternatives. This feature is based on caches and can be updated on-the-fly without the need of re-loading data and re-starting the decoder. It is considered dynamic in two respects:

  • entries can be inserted and deleted at any time
  • scores can change over time.

Although CBLM evokes the characteristics of a language model, CBLM is currently implemented as a stateless feature; indeed, it does not support the computation of scores for n-grams across different translation options. This implementation choice is mainly justified by an efficiency reason: the lookup in the dynamic language model is performed only once and only for the n-grams included in the pre-fetched translation options; if we admitted the lookup of all possible n-grams created at translation time, like for a standard LM feature, the computational cost could become unaffordable. In fact, the structure was not developed to achieve extreme speed performance.

The entries of CBLM consist of target n-grams of any length.

Similarly to CBPT, the entries of CBLM can be inserted and deleted by means of xml-based annotations read from input. Furthermore, the CBLM can also be pre-populated loading entries from a file during the Moses start-up.

Each n-gram of CBLM is associated to an age, corresponding to the time it has been inserted in the cache, and its score depends on this age according to a parametrizable scoring function. According to the setting of CBLM, the age of all entries increases by 1 whenever a new entry is inserted, or is kept fixed to its original value. Consequently also the corresponding scores change or are constant over time. See section about the CBPT ageing for some comments about that.

In order to activate the CBLM feature, specify its parameters and weight in the Moses config file.

 [feature]
 DynamicCacheBasedLanguageModel name=CBLM0 num-features=1 [feature-parameters]

 [weight]
 CBLM0= 1.0

Moreover, enable the facility to interpret xml-based tags

 [xml-input]
 inclusive

Feature Parameters

CBLM exposes the following parameters:

  • name string -- Moses feature name
  • num-features int -- number of score components in CBLM feature [1, fixed value]
  • cblm-name string -- internal CBLM name ["default", by default]
  • cblm-file string -- file name of the entries to pre-populate the cache
  • cblm-score-type -- scoring type ["0", by default]
  • cblm-query-type -- querying type ["0", by default]
  • cblm-max-age -- maximum age of an entry ["1000", by default]
  • cblm-constant -- flag to disable ageing of entries ["false", by default]

Moses handles multiple CBLMs; to this purpose, they are identified by an internal parametrizable name to specify in the annotation string (see below). If so, please use different internal name (cblm-name) as well as Moses feature name (name) to refer to different CBLMs.

Ageing of the entries

Similarly to CBPT, entries of CBPT are also subject to the ageing. Please refer to the Section about CBPT ageing for details. The ageing of the CBLM entries, enabled by default, is controlled by the parameter cblm-constant (false by default).

Scoring function

The score associated to an n-gram depends on its age 'x' on the basis of the same scoring functions of CBPT. The type of the scoring function is set with the parameter cblm-score-type.

Querying type

CBLM provides two modalities for computing the score of a target n-gram (w1, ..., wn) of age x. In the first modality (cblm-query-type=0), all its substrings of any length (wi, ..., wj) (1<=i<=j<=n) are searched in the cache, their scores are computed according to the chosen scoring function, and averaged according to the following formula:

 avg_score(w1, ..., wn) = 
1/n * ( score(w1) + score(w2) + ... + score(wn) )
1/(n-1) * ( score(w1, w2) + score(w2, w3) + ... + score(w_(n-1),wn) )
1/(n-2) * ( score(w1, w2, w3) + score(w2, w3, w4) + ... + score(w_(n-2), w_(n-1), wn) )
...
( score(w1, w2, ..., wn) )

The average score avg_score(w1, ..., wn) is then associated to the full n-gram. Note that the average score computes a normalization among the number of substrings of a specific length.

In the second modality (cblm-query-type=1), the whole string is looked up in the cache, and its score is computed according to the chosen scoring function.

The querying type is selected by means of the parameter (cblm-query-type), whose default is 0.

Annotation

The content of the cache of CBLM can be changed feeding the decoder with xml-based annotations from stdin.

The annotation mandatorily contains the fields:

  • type, which identified the type of feature it refers to; the type of any CBLM is calm
  • id which identifies which specific CBLM (in case of multiple CBPTs) it refers to; the value is equal to the internal name in the Moses configuration file (cblm-name) ("myCBLM" in the following examples). <dlt type="cblm" id="myCBLM"

Note that dlt stands for Document Level Translation because originally the dynamic models were intended for that task; cblm stand for Cache-Based Translation Model.

More annotations can be provided in the same line; in this case, annotations are processed sequentially left-to-right.

Inserting entries

With the following annotation, 3 entries are added contemporarily, i.e. they are associated with the same age 1. Double vertical lines separate n-grams.

  <dlt type="cblm" id="myCBLM" cblm="Le visage rustre de la domination 
|| de la domination || visage"/>

With the following annotation 3 entries are added sequentially left-to-right. Hence, the most-left insertion ("Le visage rustre de la domination") is the oldest and the n-gram is associated to an age of 3, while the right-most insertion ("visage") is the newest and the n-gram is associated to the age 1.

  <dlt type="cblm" id="myCBLM" cblm="Le visage rustre de la domination"/>
<dlt cblm="de la domination"/> <dlt cblm="visage"/>

Entries to be inserted can be also loaded from file(s). as follows. Double vertical lines separate filenames. File format is described below.

  <dlt type="cblm" id="myCBLM" cblm-file="filename-1 || filename-2"/>

Deleting entries

With the following annotation 3 entries are deleting contemporarily. Double vertical lines separate n-grams.

  <dlt type="cblm" id="myCBLM" cblm-clear-entry ="de la domination || Le visage rustre || visage"/>

Similarly, the same 3 entries are deleted sequentially.

  <dlt type="cblm" id="myCBLM" cblm-clear-entry ="de la domination"/>
<dlt type="cblm" id="myCBLM" cblm-clear-entry ="Le visage rustre"/>
<dlt type="cblm" id="myCBLM" cblm-clear-entry ="visage"/>

Note that the previous two examples above make no difference because deletion of entries from the CBLM have no impact on the remaining.

With either the two annotation below, all entries in CBLM are deleted

  <dlt cblm-clear-all=""/>
  <dlt cblm-command="clear"/>

Important: there is no way to recover the deleted entries.

File format

[Note that the file format was changed on July 2014.]

CBLM can be also populated by loading entries from file either during the start-up of Moses, or even during decoding using the ad-hoc annotation string. Each line must contain one field with the age (at the beginning) and a list of and one or more fields with the n-grams to insert with the specified age. Age and n-grams must be separated by double vertical bars.

  age || n-gram || n-gram || ...

Here is an example:

  1 || Le visage rustre
  3 || la domination
  2 || de la domination || visage rustre
  ... || ...

In case of multiple entries, the last value is considered.

General notes

At the time being, CBLM (together with CBPT) is the only feature in Moses, which can be modified on-the-fly by means of commands passed through the input channel. However, as mentioned above, CBLM is not actually a language model, because it does not compute scores for n-grams across different translation options. Furthermore, the computed scores are not related to any probability distribution, anc can change over time.

Edit - History - Print
Page last modified on May 11, 2015, at 08:54 AM