We will start with an overview of the training process. This should give a feel for what is going on and what files are produced. In the following, we will go into more details of the options of the training process and additional tools.
The training process takes place in nine steps, all of them executed by the script
train-model.perl
The nine steps are
If you are running on a machine with multiple processors, some of these steps can be considerably sped up with the following option:
--parallel
The run times mentioned in the steps refer to a recent training run on the 751'000 sentence, 16 million word German-English Europarl corpus, on a 3GHz Linux machine.
If you wish to experiment with translation in both directions, step 1 and 2 can be reused, starting from step 3 the contents of the model directory get direction-dependent. In other words run steps 1 and 2, then make a copy of the whole experiment directory and continue two trainings from step 3.
For an standard phrase model, you will typically run the training script as follows.
Run the training script:
train-model.perl -root-dir . --corpus corpus/euro --f de --e en
There should be two files in the corpus/ directory called euro.de and euro.en. These files should be sentence-aligned halfs of the parallel corpus. euro.de should contain the German sentences, and euro.en should contain the corresponding English sentences.
More on the training parameters at the end of this manual. For corpus preparation, see the section on how to prepare training data.