--root-dir
-- root directory, where output files are stored
--corpus
-- corpus file name (full pathname), excluding extension
--e
-- extension of the English corpus file
--f
-- extension of the foreign corpus file
--lm
-- language model: <factor>:<order>:<filename> (option can be repeated)
--first-step
-- first step in the training process (default 1)
--last-step
-- last step in the training process (default 7)
--parts
-- break up corpus in smaller parts before GIZA++ training
--corpus-dir
-- corpus directory (default $ROOT/corpus
)
--lexical-dir
-- lexical translation probability directory (default $ROOT/model
)
--model-dir
-- model directory (default $ROOT/model
)
--extract-file
-- extraction file (default $ROOT/model/extract
)
--giza-f2e
-- GIZA++ directory (default $ROOT/giza.$F-$E
)
--giza-e2f
-- inverse GIZA++ directory (default $ROOT/giza.$E-$F
)
--alignment
-- heuristic used for word alignment: intersect
, union
, grow
, grow-final
, grow-diag
, grow-diag-final
(default), grow-diag-final-and
, srctotgt
, tgttosrc
--max-phrase-length
-- maximum length of phrases entered into phrase table (default 7)
--giza-option
-- additional options for GIZA++ training
--verbose
-- prints additional word alignment information
--no-lexical-weighting
-- only use conditional probabilities for the phrase table, not lexical weighting
--parts
-- prepare data for GIZA++ by running snt2cooc
in parts
--direction
-- run training step 2 only in direction 1 or 2 (for parallelization)
--reordering
-- specifies which reordering models to train using a comma-separated list of config-strings, see FactoredTraining.BuildReorderingModel. (default distance)
--reordering-smooth
-- specifies the smoothing constant to be used for training lexicalized reordering models. If the letter "u" follows the constant, smoothing is based on actual counts. (default 0.5)
--alignment-factors
--
--translation-factors
--
--reordering-factors
--
--generation-factors
--
--decoding-steps
--
A number of parameters are required to point the training script to the correct training data. We will describe them in this section. Other options allow for partial training runs and alternative settings.
As mentioned before, you want to create a special directory for
training.
The path to that directory has to be specified with the parameter
--root-dir
.
The root directory has to contain a sub directory (called corpus
)
that contains the training data. The training data is a parallel
corpus, stored in two files, one for the English
sentences, one for the foreign sentences. The corpus has to be
sentence-aligned, meaning that the 1624th line in the English file
is the translation of the 1624th line in the foreign file.
Typically, the data is lowercased, no empty lines are allowed, and having multiple spaces between words may cause problems. Also, sentence length is limited to 100 words per sentence. The sentence length ratio for a sentence pair can be at most 9 (i.e, having a 10-word sentence aligned to a 1-word sentence is disallowed). These restrictions on sentence length are caused by GIZA++ and may be changed (see below).
The two corpus files have a common file stem (say, euro
)
and extensions indicating the language (say, en
and de
).
The file stem (--corpus-file
), and the language extensions
(--e
and --f
) have to be specified to the training script.
In summary, the training script may be invoked as follows:
train-model.perl --root-dir . --f de --e en --corpus corpus/euro >& LOG
After training, typically the following files can be found in the root directory (note the time stamps that tell you something about how much time was spent on each step took for this data):
> ls -lh * -rw-rw-r-- 1 koehn user 110K Jul 13 21:49 LOG corpus: total 399M -rw-rw-r-- 1 koehn user 104M Jul 12 19:58 de-en-int-train.snt -rw-rw-r-- 1 koehn user 4.2M Jul 12 19:56 de.vcb -rw-rw-r-- 1 koehn user 3.2M Jul 12 19:42 de.vcb.classes -rw-rw-r-- 1 koehn user 2.6M Jul 12 19:42 de.vcb.classes.cats -rw-rw-r-- 1 koehn user 104M Jul 12 19:59 en-de-int-train.snt -rw-rw-r-- 1 koehn user 1.1M Jul 12 19:56 en.vcb -rw-rw-r-- 1 koehn user 793K Jul 12 19:56 en.vcb.classes -rw-rw-r-- 1 koehn user 614K Jul 12 19:56 en.vcb.classes.cats -rw-rw-r-- 1 koehn user 94M Jul 12 18:08 euro.de -rw-rw-r-- 1 koehn user 84M Jul 12 18:08 euro.en giza.de-en: total 422M -rw-rw-r-- 1 koehn user 107M Jul 13 03:57 de-en.A3.final.gz -rw-rw-r-- 1 koehn user 314M Jul 12 20:11 de-en.cooc -rw-rw-r-- 1 koehn user 2.0K Jul 12 20:11 de-en.gizacfg giza.en-de: total 421M -rw-rw-r-- 1 koehn user 107M Jul 13 11:03 en-de.A3.final.gz -rw-rw-r-- 1 koehn user 313M Jul 13 04:07 en-de.cooc -rw-rw-r-- 1 koehn user 2.0K Jul 13 04:07 en-de.gizacfg model: total 2.1G -rw-rw-r-- 1 koehn user 94M Jul 13 19:59 aligned.de -rw-rw-r-- 1 koehn user 84M Jul 13 19:59 aligned.en -rw-rw-r-- 1 koehn user 90M Jul 13 19:59 aligned.grow-diag-final -rw-rw-r-- 1 koehn user 214M Jul 13 20:33 extract.gz -rw-rw-r-- 1 koehn user 212M Jul 13 20:35 extract.inv.gz -rw-rw-r-- 1 koehn user 78M Jul 13 20:23 lex.f2n -rw-rw-r-- 1 koehn user 78M Jul 13 20:23 lex.n2f -rw-rw-r-- 1 koehn user 862 Jul 13 21:49 pharaoh.ini -rw-rw-r-- 1 koehn user 1.2G Jul 13 21:49 phrase-table
Summary
--root-dir
-- root directory, where output files are stored
--corpus
-- corpus, expected in $ROOT/corpus
--e
-- extension of the English corpus file
--f
-- extension of the foreign corpus file
--lm
-- language model file
More on factored translation models in the Overviev.
Summary
--alignment-factors
--
--translation-factors
--
--reordering-factors
--
--generation-factors
--
--decoding-steps
--
More on lexicalized reording on the description of Training step 7: build reordering model.
Summary
--reordering
--
--reordering-smooth
--
You may have better ideas how to do word alignment,
extract phrases or score phrases. Since the training is modular,
you can start training at any of the seven training steps
--first-step
and
end it at any subsequent step --last-step
.
Again, the nine training steps are:
For instance, if you may have your own method to generate a word alignment, you want to skip these training steps and start with lexical translation table generation, you may specify this by
train-model.perl [...] --first-step 4
Summary
--first-step
-- first step in the training process (default 1)
--last-step
-- last step in the training process (default 7)
A number of parameters allow you to break out of the rigid file name conventions of the training script. A typical use for this is that you want to try alternative training runs, but there is no need to repeat all the training steps.
For instance, you may want to try an alternative alignment heuristic.
There is no need to rerun GIZA++. You could copy the
necessary files from the corpus
and the giza.*
directories into a new root directory, but this takes up a lot
of additional disk space and makes the file organization
unnecessarily complicated.
Since you only need a new model directory, you can specify
this with the parameter --model-dir
, and stay within the
precious root directory structure:
train-model.perl [...] --first-step 3 --alignment union --model-dir model-union
The other parameters for file and directory names fullfill similar purposes.
Summary
--corpus-dir
-- corpus directory (default $ROOT/corpus
)
--lexical-dir
-- lexical translation probability directory (default $ROOT/model
)
--model-dir
-- model directory (default $ROOT/model
)
--extract-file
-- extraction file (default $ROOT/model/extract
)
--giza-f2e
-- GIZA++ directory (default $ROOT/giza.\$F-\$E
}
--giza-e2f
-- inverse GIZA++ directory (default $ROOT/giza.\$E-\$F
)
A number of different word alignment heuristics are implemented, and can be
specified with the parameter --alignment
.
The options are:
intersect
-- the intersection of the two GIZA++ alignments is taken. This usually creates a lot of extracted phrases, since the unaligned words create a lot of freedom to align phrases.
union
-- the union of the two GIZA++ alignments is taken
grow-diag-final
-- the default heuristic
grow-diag
-- same as above, but without a call to function FINAL()
(see background to word alignment).
grow
-- same as above, but with a different definition of neighboring. Now diagonally adjacent alignment points are excluded.
grow
-- no diagonal neighbors, but with FINAL()
Different heuristic may show better performance for a specific language pair or corpus, so some experimentation may be useful.
Summary
--alignment
-- heuristic used for word alignment: intersect, union, grow, grow-final, grow-diag, grow-diag-final (default)
The maximum length of phrases is limited to 7 words. The maximum phrase length impacts the size of the phrase translation table, so shorter limits may be desirable, if phrase table size is an issue. Previous experiments have shown that performance increases only slightly when including phrases of more that 3 words.
Summary
--max-phrase-length
-- maximum length of phrases entered into phrase table (default 7)
GIZA++ takes a lot of parameters to specify the behavior of the training process and limits on sentence length, etc. Please refer to the corresponding documentation for details on this.
Parameters can be passed on to GIZA++ with the switch --giza-option
.
For instance, if you want to the change the number of iterations for the different IBM models to 4 iterations of Model 1, 0 iterations of Model 2, 4 iterations of the HMM Model, 0 iterations of Model 3, and 3 iterations of Model 4, you can specify this by
train-model.perl [...] --giza-option m1=4,m2=0,mh=4,m3=0,m4=3
Summary
--giza-option
-- additional options for GIZA++ training
Training on large training corpora may become a problem for the GIZA++ word alignment tool. Since it stores the word translation table in memory, the size of this table may become too large for the available RAM of the machine. For instance, the data sets for the NIST Arabic-English and Chinese-English competitions require more than 4 GB of RAM, which is a problem for current 32-bit machines.
This problem can be remedied to some degree by a more efficient data structure in GIZA++, which requires the run of snt2cooc
in advance on the corpus in parts and the merging on the resulting output. All you need to know is that running the training script with the option --parts n
, e.g. --parts 3
may allow you to train on a corpus that was too large for a regular run.
Somewhat related to this problem caused by large training corpora is the problem of the large run time of GIZA++. It is possible to run the two GIZA++ separately on two machines with the switch --direction
. When running one of the runs on one machine with --direction 1
and the other run on a different machine or CPU with --direction 2
, the processing time for training step 2 can be cut in half.
Summary
--parts
-- prepare data for GIZA++ by running snt2cooc
in parts
--direction
-- run training step 2 only in direction 1 or 2 (for parallelization)