Moses
statistical
machine translation
system

Training Step 1: Prepare Data

The parallel corpus has to be converted into a format that is suitable to the GIZA++ toolkit. Two vocabulary files are generated and the parallel corpus is converted into a numberized format.

The vocabulary files contain words, integer word identifiers and word count information:

 ==> corpus/de.vcb <==
 1       UNK     0
 2       ,       928579
 3       .       723187
 4       die     581109
 5       der     491791
 6       und     337166
 7       in      230047
 8       zu      176868
 9       den     168228
 10      ich     162745

 ==> corpus/en.vcb <==
 1       UNK     0
 2       the     1085527
 3       .       714984
 4       ,       659491
 5       of      488315
 6       to      481484
 7       and     352900
 8       in      330156
 9       is      278405
 10      that    262619

The sentence-aligned corpus now looks like this:

 > head -9 corpus/en-de-int-train.snt
 1
 3469 5 2049
 4107 5 2 1399
 1
 10 3214 4 116 2007 2 9 5254 1151 985 6447 2049 21 44 141 14 2580 3
 14 2213 1866 2 1399 5 2 29 46 3256 18 1969 4 2363 1239 1111 3
 1
 7179
 306

A sentence pair now consists of three lines: First the frequency of this sentence. In our training process this is always 1. This number can be used for weighting different parts of the training corpus differently. The two lines below contain word ids of the foreign and the English sentence. In the sequence 4107 5 2 1399 we can recognize of (5) and the (2).

GIZA++ also requires words to be placed into word classes. This is done automatically by calling the mkcls program. Word classes are only used for the IBM reordering model in GIZA++. A peek into the foreign word class file:

 > head corpus/de.vcb.classes
 !       14
 "       14
 #       30
 %       31
 &       10
 '       14
 (       10
 )       14
 +       31
 ,       11
Edit - History - Print
Page last modified on July 14, 2006, at 01:07 AM