machine translation

Tree-Based Model Decoding

The chart decoder is a recursive variant of CKY+ parse/decoding which is able to process arbitrary context free grammars with no limitations on the number of terminals or non-terminals in a rule. During decoding, all contiguous spans over the input spans are filled with hypotheses (partial translations). Rules are stored in a prefix tree, which is processed incrementally, in a way akin to Early parsing. Once rules are looked up, cube pruning is applied to pick off the most likely applicable rules and hypotheses from underlying spans.


Looping over the Spans


The main loop of the decoding process fills up the stack bottom up: first looping of the width of the span, and then over the starting position of the span.

  for (size_t width = 1; width <= size; ++width) {
    for (size_t startPos = 0; startPos <= size-width; ++startPos) {

For each span, first the applicable rules are created and then the rules are applied.

       m_transOptColl.CreateTranslationOptionsForRange(startPos, endPos);
       ChartCell &cell = m_hypoStackColl.Get(range);

Processing a span concludes with pruning, cleaning, and sorting of the hypotheses that were placed into the span.


Consulting Rule Tables


Get existing data (or pointer?) from global rule collection

 ChartTranslationOptionList &chartRuleColl = GetTranslationOptionList(startPos, endPos);

Multiple rule tables may be consulted. In fact, in most setups there will be a main rule table and a rule table for glue rules. So, we need to consult each of them.

 ChartRuleLookupManager &ruleLookupManager;
 ruleLookupManager.GetChartRuleCollection(wordsRange, true, chartRuleColl);

Looking up Applicable Rules


There are multiple implementations of the rule table. At the time of this writing, there are two: one that is kept entirely in memory and a second one that keeps the data on disk. There is an obvious RAM/speed trade-off here: the in-memory rule table is faster, but for some models there may not be sufficient RAM. Both are implemented very similarly, however.

For a given span, there may be many rules that could apply. Each rule may consume input words directly or build on any number of hypotheses that were generated for sub-spans. There is a combinatorial explosion of sub-spans and input words that could be combined, if there is a rule in the rule table.

The implementation of rule lookup is inspired by Early parsing, which allows for incremental lookup. Consider the English-German rule:

 PP/NP -> of the JJ_1 problem of NP_2 ; des ADJ_1 Problems NP-GEN_2 

This is a good time to clarify some terminology:

  • PP is the source side parent non-terminal
  • NP is the target side parent non-terminal
  • JJ and NP are the source side child non-terminals
  • of, the, problem and of are the source side child terminals (or words)
  • ADJ and NP-GEN are the target side child non-terminals
  • des and Problems are the target side child terminals (or words)

Instead of child, the term right hand side, and correspondingly instead of parent, the term left hand side is often used.

To check if the rule matches, we have to see if there are indeed English words of, the, problem, and of in the input and the intervening spans have the label JJ and NP. In addition, the rule can only apply if we have hypotheses with the constituent labels ADJ and NP-GEN in the corresponding spans.

We store the rule in a prefix structure with the following nodes:

 of -> the -> JJ/ADJ -> problem -> of -> NP/NP-GEN -> des ADJ_1 Problems NP-GEN_2

The key insight is the following: If there is such an applicable rule for the span under consideration, then a sub-span with the same start position matched part of this prefix path, namely:

 of -> the -> JJ/ADJ -> problem -> of

If, by the time we build the prefix, we know all possible terminal or non-terminal extensions (such as NP/NP-GEN) that match the prefix and start at the end of the sub-span, we can recursively find all rule applications, visiting each prefix and sub-span exactly once. Note: It does not matter if there is a translation rule associated with the prefix path, all it matters that it matches the sub-span.

We traverse the chart in a depth-first right-to-left order to ensure that all chart cells starting at the end position of the sub-span under considerations have been processed before (and that we can immediately find all applicable extensions of the prefix, such as NP/NP-GEN).

The code in ChartRuleLookupManagerMemory::GetChartRuleCollection implements the lookup that we just described. It is extensively commented, so that is should be understandable without further explanation.

From the outside, each time the function is called for a span, it updates its internal data structures for processed rules and returns applicable rules for the span (with their target side) as a ChartTranslationOptionList.

Let us take a closer look at the data structures used in rule lookup in the function ChartRuleLookupManagerMemory::GetChartRuleCollection.


Recall that processed rules are lookup up using a prefix tree, for example:

 of -> the -> JJ/ADJ -> problem -> of -> NP/NP-GEN -> des ADJ_1 Problems NP-GEN_2

To use the rule, we need to know how each of these nodes in this path matches the chart, in other words: how each node matches a span. This is encoded in a linked list of CoveredChartSpans.

Each CoveredChartSpan contains the start and end position in the span (store in a WordsRange, which is essential a pair of integers with additional utility functions), the source word or source span label that is matched (as a Word), and a back-pointer to the previous CoveredChartSpan (hence the linked list).

The linked list of CoveredChartSpans contains all the necessary information about applying the rule to the source side. Target side information is stored elsewhere. Note that target-side non-terminals are not stored anymore, since they will be contained in the target side information.

The core operations of the data structure are GetSourceWord, GetWordsRange, and GetPrevCoveredChartSpan. There is also the utility function IsNonTerminal which checks if the last word in the linked list is a non-terminal.

ChartTranslationOptionList and ChartTranslationOption

A ChartTranslationOptionList is a vector of ChartTranslationOptions, with additional utility functions.

Rules are added to this in batches, since rules are stored in the prefix tree first by matching the source words and child non-terminals, pointing towards a list of target sides.

The list contains fully fledged out rules with target sides (opposed to the cube pruning algorithm described by Chiang, where rules are groups modulo the target side words). The list also observes rule limits, i.e., the maximum number of rules considered for a span. It sorts the rules by the future score of the target side (weighted translation model costs and language model estimates), and prunes out the worst ones when the limit is exceeded. Internally a score threshold is kept, to not even add rules that would be pruned out anyway.

Note that when adding ChartTranslationOption ultimately some additional processing has to be done --- the computation of the alignment between non-terminals by calling ChartTranslationOption::CreateNonTermIndex. This is done lazily, once the list finalized and pruned.


A ChartTranslationOption contains all the information about a rule and its application to the span. It contains a linked list of CoveredChartSpan which details how the rule matches the input side, a TargetPhrase with the output, and the span it applies to (in a WordsRange).

This information has to be specified at instantiation and can be queried with the functions GetLastCoveredChartSpan, GetTargetPhrase, and GetSourceWordsRange.

Once created, the mapping between the source and target non-terminals is computed by calling CreateNonTermIndex and can be queried with GetCoveredChartSpanTargetOrder. That information is already stored with the TargetPhrase, but it is reformated here for easier querying (at the cost of a higher memory footprint).


This class implements the prefix tree that contains the rules.



Applying the Rules: Cube Pruning

Above, we described how all the rules that apply to the current span are retrieved. In fact, more than that is done: we also annotate each rule how it applies to the span, especially how the non-terminals match the sub-spans.

Applying a rule now only requires the selection of the hypotheses in the specified sub-spans that match the non-terminals in the rule.

To repeat our example:

 PP/NP -> of the JJ_1 problem of NP_2 ; des ADJ_1 Problems NP-GEN_2 

We have already identified which sub-spans the target non-terminals ADJ and NP-GEN apply to. For each of these, however, we may have multiple choices (note that there will be at least one for each, otherwise we would not consider the rule).

The term cube pruning derives from the fact that we explore for each rule a multi-dimensional space, with one dimension for each non-terminal (and, in the original Chiang algorithm, but not here, one dimension for the target phrase). This space is not always a cube, only if there are three dimensions (two non-terminals in the Chiang algorithm), and even then it is not a cube because the dimensions typically have differing lengths. And it is not technically a pruning algorithm (which removes bad examples after the fact), but a greedy search for the best rule applications. But hey, what's in a name?

Given the cube, we sort each dimension, so that the most promising rule application is in the top left front corner. Most promising, because for each of the non-terminals, we use the matching hypothesis with the best score (and the target phrase with the best future cost estimate).

Finally recall that there are multiple cubes, one for each applicable rule.


The cube pruning algorithm is given two data structures, a ChartTranslationOptionList that contains all applicable rules (they are now called ChartTranslationOption) and the ChartCellCollection that contains the chart as it has been filled in so far. The cube pruning algorithm is housed in the ChartCell that corresponds to the span we are now filling.

First, we have to build the RuleCubeQueue. Recall, how each applicable rule has a cube. Well, we throw them all together into one big cube (not really a regular geometrical shape, since the dimensions differ for each rule application). Be that as it may, the first part of the algorithm loops through the ChartTranslationOptionList and creates a RuleCube and adds it to the cube.

The RuleCubeQueue keeps the RuleCubes sorted, so that we can always pop off the most promising rule application with its most promising underlying sub-span hypotheses. For a specified number of times (staticData.GetCubePruningPopLimit())

  • we pop off the most promising RuleCube (by calling ruleCubeQueue.Pop())
  • build a new ChartHypothesis and calculate its score (hypo->CalcScore())
  • add the hypothesis to the ChartCell
  • add add the neighbors of the hypothesis to the RuleCubeQueue (ruleCube->CreateNeighbors(ruleCubeQueue))


A chart cell contains the hypothesis that were created for a span. These hypothesis are grouped by their target side non-terminal.


RuleCubeQueue is a priority queue of RuleCubes (candidate rule applications). Initially it contains the top-left-front rule application for each ChartTranslationOption. When these are expanded, their neighbors are added to the RuleCubeQueue.

Note that the same neighbor might be reached in multiple ways. If the rule applications (0,0,0), (1,0,0) and (0,1,0) are popped off, then the latter two point to (1,1,0). This is checked in RuleCubeQueue, which does not allow insertion of duplicates.


This class contains the cube for a rule. It contains information about the ChartTranslationOption and the a list of underlying sub-span hypotheses for each non-terminal in the rule. The latter is represented as a vector of ChildEntrys, which are essentially ordered lists of hypotheses with additional utility functions.

When the RuleCube is created from a ChartTranslationOption, the vector of ChildEntrys is assembled from the information in the chart. Also, the estimated score of the top-left-front rule application is computed and stored. Note that this is a estimated score, it does not have the real language model cost.

A RuleCube always points to a particular rule application (i.e., particular sub-span hypotheses) in the cube. If it is picked to create an hypothesis, then its neighbors are added to the RuleCubeQueue --- this is implemented in the function CreateNeigbors. Consequently, for a particular ChartTranslationOption, there may be multiple RuleCubes in the RuleCubeQueue.

Hypotheses and Pruning

New hypotheses are build in the ChartCell::ProcessSentence function from a RuleCube.

 ChartHypothesis *hypo = new ChartHypothesis(*ruleCube, m_manager);

A ChartHypothesis contains various type of information about a entry in the chart, i.e., a translation of the covered span.

  • Book-keeping
    • size_t m_id hypothesis ID
    • Manager& m_manager reference to manager
    • WordsRange m_currSourceWordsRange covered span
    • ChartTranslationOption &m_transOpt rule that created it
    • vector<size_t> &m_coveredChartSpanTargetOrder covered sub-spans
  • Scores
    • ScoreComponentCollection m_scoreBreakdown all scores
    • ScoreComponentCollection m_lmNGram language model scores
    • ScoreComponentCollection m_lmPrefix estimated language model scores for prefix
    • float m_totalScore total weighted score
  • Information relevant for recombination and later use
    • Phrase m_contextPrefix first words (not yet fully LM-scored)
    • Phrase m_contextSuffix last words (affect later attached words)
    • size_t m_numTargetTerminals length of phrase (number of words)
  • Back-tracking
    • ChartHypothesis *m_winningHypo points to superior hypothesis if recombined away
    • ArcList *m_arcList all arcs that end at the same trellis point as this hypothesis
    • vector<const ChartHypothesis*> m_prevHypos underlying hypotheses

When a hypothesis is created, the book-keeping and information relevant for recombination and later use is set.

Scores are computed by the function CalcScore, by adding up the scores from the underlying hypothesis, the rule application, and language model scoring of the resulting phrase. Language model scoring (in function CalcLMScore) is a bit complex, since we do not want to re-compute any of the language model scores that we already computed for the underlying hypotheses. See the documented code for details.

Hypothesis recombination is handled by ChartHypothesisCollection. The function Add is called to check if the hypothesis is recombinable with anything already in the collection (the class ChartHypothesisRecombinationOrderer handles state matching and calls ChartHypothesis::LMContextCompare). AddHypothesis calls Add, and handles the recombination by potentially replacing the existing hypothesis and setting arcs (ChartHypothesis::AddArc).

Edit - History - Print
Page last modified on September 29, 2014, at 10:29 AM