Moses » SparseFeatureTraining

Sparse Features and the Moses Training Pipeline

This is a design discussion document about the best way to incorporate sparse features into the Moses experimental pipeline. Any design for sparse features may need to trade off performance (both decoding and training speed) with ease of implementation and experimentation.

Feature Implementation Points

New features can be added to the decoder by implementing the FeatureFunction interface, and adding appropriate initialisation in Moses' God class (Static Data). They can also be added to the phrase table during scoring (in score.cpp) but this is currently only possible for phrase-based Moses.

Types of Features

The information required for a feature function dictates where and how it is implemented.

Stateful Feature Functions

Are those which cause extra state-splitting during decoding, e.g. language model features. They can only be added to the decoder and cannot be pre-calculated.

Stateless Feature Functions

Are those which don't cause extra state-splitting. Most features fall into this category.

Features which depend on the search graph: For example, features which depend on the coverage vector (for phrase-based) or span length (for chart-based) can only be calculated during decoding.
Features which depend on the source sentence: For example, Eva's topic-based word translation features. These must be calculated in the decoder, but can be pre-calculated when the translation options are loaded, since at that point the source sentence is available. This pre-calculation has been implemented, and seems offer a significant speed-up.
Features which only depend on the phrase-pair / rule: For example word-translation and phrase-pair features. Currently these are all implemented in the decoder, but could be inserted during scoring (as in cdec and Joshua). Inserting them during scoring means that the phrase table has to be rebuilt if you want to vary the options used to build the feature (e.g. the vocabulary for a word-translation feature)
Features which depend on extraction: For example, features which indicate what corpora a phrase-pair was found in, or the contexts in which it was found. These features can only really be calculated at scoring time, unless the extra information required for them is carried through on the phrase table. We already add counts and (optionally) alignments to the phrase table, but we don't necesarily want to add any more information to the phrase table.

EMS Integration

What would be the ideal way of configuring extra features in EMS? Ideally, each one would be turned on with a single configuration, either in the EMS config file itself or in a separate configuration file. The advantage of having features configured in the EMS config file is that it makes it easier for EMS to know what to rebuild if the configuration changes.

Many of the feature functions depend on having certain other options on at other points in the training pipeline, or having other files created that they can use. For example some feature functions require alignments included in the phrase table, and some require a list of (say) the 50 most common source words. All the features require certain options on at tuning time to make sure that the sparse values are included in the n-best list, and they require an extra sparse-weights file. I think this in principle this could all be taken care off with EMS, but it can be headache keeping track of which information is required by which feature.

The difficulty with EMS integration is that adding a feature may trigger several things to be added at several points, and ems does not support this very well. Let's see what is required by each of the extra features:

domain features

Added at scoring time.
Require sentence id in extract
They may be sparse (in which case the word 'sparse' should be added to the phrase-table line) or dense (in which case the number of ttable scores changes).
They trigger an extra EMS step (building the domain table)

sparse lexical features

include word translation, source word deletion, target word insertion and phrase length
require alignments
EMS has build-sparse-lexical-features.perl, handles building of vocab files etc
This generates additional-ini to be passed in to create-config
word translation feature has many options which cannot be configured from ems. The wt configuration is created in build-sparse-lexical-features.perl, so it would have to know about the wt configuration options

target ngram features

No current support in EMS.
requires the report-sparse-features in the ini file - could use -additional-ini
can use a vocabulary file, but no support for building one

phrase pair feature

Again, no support in EMS
Similar requirements to word translation feature
can use a vocab file (restricted) or 'domain source triggers'
Also requires report-sparse-features to the ini file

So that gives the following EMS extension points for adding extra features:

Adding an extra step - this is always possible using experiment.meta and possibly a perl function in experiment.perl
Adding arguments to extraction.
Adding arguments to scoring.
Changing the phrase-table config line (adding the keyword 'sparse')
Changing the number of phrase-table features.
Passing additional-ini to create-config
Adding report-sparse-features to the ini file - always required for sparse features. For sparse features added in scoring, this is "stm".

In addition, we must ensure that the correct steps are rerun if the feature configuration is changed. This is automatic if the configuration is inserted through the standard EMS mechanisms.

A Proposal for EMS Integration

The idea is to make most of the work in adding a new feature function to EMS declarative. So there would be a new file for experiment.perl to process, called features.meta. It would have a section for each feature, specifying (optionally) what needs to be added for the feature at each extension point mentioned in the previous section.

In the config of experiment.perl, there would be an additional section (say, [EXTRA-FEATURES]), listing the features and their associated options. For example:

 [EXTRA-FEATURES]

 features = wordtranslation domain phrase-pair

 domain-type = subset
 domain-sparse = yes

 wordtranslation-factor = 1

In this case, there are three extra features added: word translation, domain and phrase pair. The domain feature receives the additional configuration {type=subset, sparse=yes}, and the word translation feature gets {factor=1}. The phrase pair features gets the default configuration.

Why doesn't this work?

The problem is that a lot of the information is not really declarative. For example, the domain feature needs to construct arguments like --SparseDomainSubset. Also figuring out how many phrase features to add requires counting the domains. There's also the problem of ensuring that the correct steps get rerun when the feature configuration changes.

Ideally, what I'd like to be able to do is specify an "interface" which should be implemented for each new feature. However it's not clear to me what the perl idiom is for this.

Decoder Configuration Files

There's really three types of information that get passed to the decoder at runtime:

Information about the model (e.g. which feature functions, options for feature functions, locations of tables)
Weights for the model
Decoding options (e.g. stack size, verbosity, threads etc)

Actually the division into those three types is debatable, but it is somewhat useful. The first type of configuration is what you set when you're designing the model. The second is what gets set during discriminative training . And the third is things you might vary in a trained model.

Distinguishing different types of configuration information is useful because they are generated and used differently, and maybe should be configured separately. In particular, should all the weights be stored in their own (separate) file? At the moment there's a distinction between core and sparse features in the way that they are configured and this makes handling the weights during tuning awkward. On the other hand (as Eva found) different types of weights do sometimes need to be treated differently.

Hieu has made some progress in moving towards a common weight file, in the mert-new branch, but this is now going to have to be merged with the sparse feature code. Moses used to support a weights file for core features, but it didn't work properly and got removed.

Managing Feature Functions at Runtime

This is done by the increasingly omnipotent StaticData object. Really, feature management (and weight management?) should be offloaded to another class. In fact TranslationSystem already contains pointers to all the feature functions so maybe it could be co-opted (and renamed) for this purpose? Using extra feature functions interacts badly with the multiple models functionality (which it was added to support) but then perhaps it's time to retire the multiple models feature? There's much less need for it now that we have kenlm and memory mapping.

Moses
statistical
machine translation
system

1. Moses

2. Getting Started

3. Tutorials

4. Training

5. User Documentation

6. Development

7. Background

Sparse Features and the Moses Training Pipeline

Feature Implementation Points

Types of Features

EMS Integration

A Proposal for EMS Integration

Decoder Configuration Files

Managing Feature Functions at Runtime

Mosesstatisticalmachine translationsystem

1. Moses

2. Getting Started

3. Tutorials

4. Training

5. User Documentation

6. Development

7. Background

Sparse Features and the Moses Training Pipeline

Feature Implementation Points

Types of Features

EMS Integration

A Proposal for EMS Integration

Decoder Configuration Files

Managing Feature Functions at Runtime

Moses
statistical
machine translation
system