Sometimes we have external knowledge that we want to bring to the decoder. For instance, we might have a better translation system for translating numbers of dates. We would like to plug in these translations to the decoder without changing the model.
-xml-input flag is used to activate this feature. It can have one of four values:
exclusiveOnly the XML-specified translation is used for the input phrase. Any phrases from the phrase table that overlap with that span are ignored.
inclusiveThe XML-specified translation competes with all the phrase table choices for that span.
constraintThe XML-specified translation competes with phrase table choices that contain the specified translation.
ignoreThe XML-specified translation is ignored completely.
pass-through(default) For backwards compatibility, the XML data is fed straight through to the decoder. This will produce erroneous results if the decoder is fed data that contains XML markup.
The decoder has an XML markup scheme that allows the specification of translations for parts of the sentence. In its simplest form, we can tell the decoder what to use to translate certain words or phrases in the sentence:
% echo 'das ist <np translation="a cute place">ein kleines haus</np>' \ | moses -xml-input exclusive -f moses.ini this is a cute place % echo 'das ist ein kleines <n translation="dwelling">haus</n>' \ | moses -xml-input exclusive -f moses.ini this is a little dwelling
The words have to be surrounded by tags, such as
</np>. The name of the tags can be chosen freely. The target output is specified in the opening tag as a parameter value for a parameter that is called
english for historical reasons (the canonical target language).
We can also provide a probability along with these translation choice. The parameter must be named
prob and should contain a single float value. If not present, an XML translation option is given a probability of 1.
% echo 'das ist ein kleines <n translation="dwelling" prob="0.8">haus</n>' \ | moses -xml-input exclusive -f moses.ini \ this is a little dwelling
This probability isn't very useful without letting the decoder have other phrase table entries "compete" with the XML entry, so we switch to
inclusive mode. This allows the decoder to use either translations from the model or the specified xml translation:
% echo 'das ist ein kleines <n translation="dwelling" prob="0.8">haus</n>' \ | moses -xml-input inclusive -f moses.ini this is a small house
-xml-input inclusive gives the decoder a choice between using the specified translations or its own. This choice, again, is ultimately made by the language model, which takes the sentence context into account.
This doesn't change the output from the non-XML sentence because that
prob value is first logged, then split evenly among the number of scores present in the phrase table. Additionally, in the toy model used here, we are dealing with a very dumb language model and phrase table. Setting the probability value to something astronomical forces our option to be chosen.
% echo 'das ist ein kleines <n translation="dwelling" prob="0.8">haus</n>' \ | moses -xml-input inclusive -f moses.ini this is a little dwelling
Multiple translation can be specified if separated by two bars (
% echo 'das ist ein kleines <n translation="dwelling||house" prob="0.8||0.2">haus</n>' \ | moses -xml-input inclusive -f moses.ini
The XML-input implementation is NOT currently compatible with factored models or confusion networks.
pass-through' (default), '
For various reasons, it may be useful to specify reordering constraints to the decoder, for instance because of punctuation. Consider the sentence:
I said " This is a good idea . " , and pursued the plan .
The quoted material should be translated as a block, meaning that once we start translating some of the quoted words, we need to finish all of them. We call such a block a zone and allow the specification of such constraints using XML markup.
I said <zone> " This is a good idea . " </zone> , and pursued the plan .
Another type of constraints are walls which are hard reordering constraints: First all words before a wall have to be translated, before words afterwards are translated. For instance:
This is the first part . <wall /> This is the second part .
Walls may be specified within zones, where they act as local walls, i.e. they are only valid within the zone.
I said <zone> " <wall /> This is a good idea . <wall /> " </zone> , and pursued the plan .
If you add such markup to the input, you need to use the option
-xml-input with either
inclusive (there is no difference between these options in this context).
Specifying reordering constraints around punctuation is often a good idea.
-monotone-at-punctuation introduces walls around the punctuation tokens
-xml-input-- needs to be
-mp) -- adds walls around punctuation
To use this extraction method in the decoder, add this to the
[feature] PhraseDictionaryFuzzyMatch source=<source/path> target=<target/path> alignment=<alignment/path>
It has not yet been integrated into the EMS.
Note: The translation rules generated by this algorith is intended to be used in the chart decoder. It can't be used in the phrase-based decoder.
Placeholders are symbols that replaces a word or phrase. For example, numbers ('42.85') can be replaced with a symbol '@num@'. Other words and phrases that can potentially be replaced with placeholders include dates and time, and named-entities.
This is good in training since the sparse numbers are replaced with more numerous placeholders symbols, producing more reliable statistics for the MT models.
The same reason also applies during decoding - the raw number is often an unknown symbol in the phrase-table and language models. Unknown symbols are translated as single words, disabling the advantage of phrasal translation. The reordering of unknown symbols can also be unreliable as we don't have statistics for it.
However, 2 issues arises using placeholder:
1. Translate the original word or phrase. In the example, '42.85' should be translated. If the language pair is en-fr, then it may be translated as '42,85'. 2. How do we insert this translation into the output if the word has be replaced with the placeholder.
Moses has support for placeholders in training and decoding.
When preparing your data, process with data with the script
The script was designed to run after tokenization, that is, instead of tokenizing like this:
cat [RAW-DATA] | ./scripts/tokenizer/tokenizer.perl -a -l en > TOK-DATA
cat [RAW-DATA] | ./scripts/tokenizer/tokenizer.perl -a -l en | scripts/generic/ph_numbers.perl -c > TOK-DATA
Do this for both source and target language, for parallel and monolingual data.
The script will replace numbers with the symbol @num@.
NB. - this script is currently very simple and language independent. It can be improved to create better translations.
During extraction, add the following to the extract command (phrase-based only for now):
./extract --Placeholders @num@ ....
This will discard any extracted translation rule which are non-consistent with the placeholders. That is, all placeholders must be aligned to 1-to-1 with a placeholder in the other language.
The input sentence must also be processed with the placeholder script to replace numbers with placeholder symbol. However, don't add the -c argument so that the original number will be retained in the output as an XML entry. For example,
generic $echo "you owe me $ 100 ." | ./ph_numbers.perl
you owe me $ <ne translation="@num@" entity="100">@num@</ne> .
Add this to the decoder command when executing the decoder (phrase-based only for now):
./moses -placeholder-factor 1 -xml-input exclusive
The factor must NOT be one which is being used by the source side of the translation model. For vanilla models, only factor 0 is used.
The argument -xml-input can be any permitted value, except 'pass-through'.
The output from the decoder will contain the number, not the placeholder. The is the case in the best output, and the n-best list.
The above changes can be added to the EMS config file.
For my (Hieu) experiment, these are the changes I made:
1. In the [GENERAL section, change input-tokenizer = "$misc-script-dir/normalize-punctuation.perl $input-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $input-extension" to input-tokenizer = "$misc-script-dir/normalize-punctuation.perl $input-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $input-extension | $moses-script-dir/generic/ph_numbers.perl -c" and change output-tokenizer = "$misc-script-dir/normalize-punctuation.perl $output-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $output-extension | $moses-script-dir/generic/ph_numbers.perl -c" to output-tokenizer = "$misc-script-dir/normalize-punctuation.perl $output-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $output-extension | $moses-script-dir/generic/ph_numbers.perl -c" 2. In the [TRAINING] section, add extract-settings = "--Placeholders @num@" 3. In the [TUNING] section, change decoder-settings = "-threads 8" to decoder-settings = "-threads 8 -placeholder-factor 1 -xml-input exclusive" And in the [EVALUATION] section, change decoder-settings = "-mbr -mp -search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000 -threads 8" to decoder-settings = "-mbr -mp -search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000 -threads 8 -placeholder-factor 1 -xml-input exclusive" 4. In the [EVALUATION] section, add input-tokenizer = "$misc-script-dir/normalize-punctuation.perl $input-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $input-extension | $moses-script-dir/generic/ph_numbers.perl" output-tokenizer = "$misc-script-dir/normalize-punctuation.perl $output-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $output-extension
This was tested on some experiments, trained with Europarl data. It didn't have a positive effect on BLEU score, even reducing it slightly.
However, it may still be helpful to users who translate text with lots of numbers or dates etc. Also, the recognizer script could be improved.
en-es: baseline: 24.59. with placeholder: 24.68 es-en: baseline: 23.00 with placeholder: 22.84 en-cs: baseline: 11.05 with placeholder: 10.62 cs-en: baseline: 15.80 with placeholder: 15.62