FactoredTraining » PrepareTraining

Preparing Training Data

Training data has to be provided sentence aligned (one sentence per line), in two files, one for the foreign sentences, one for the English sentences:

 >head -3 corpus/euro.*
 ==> corpus/euro.de <==
 wiederaufnahme der sitzungsperiode
 ich erklaere die am donnerstag , den 28. maerz 1996 unterbrochene 
 sitzungsperiode des europaeischen parlaments fuer wiederaufgenommen .
 begruessung

 ==> corpus/euro.en <==
 resumption of the session
 i declare resumed the session of the european parliament adjourned 
 on thursday , 28 march 1996 .
 welcome

A few other points have to be taken care of:

unix commands require the environment variable LC_ALL=C
one sentence per line, no empty lines
sentences longer than 100 words (and their corresponding translations) have to be eliminated (note that a shorter sentence length limit will speed up training
everything lowercased (use lowercase.perl)

Training data for factored models

You will have to provide training data in the format

 word0factor0|word0factor1|word0factor2 word1factor0|word1factor1|word1factor2 ...

instead of the un-factored

 word0 word1 word2

Cleaning the corpus

The script clean-corpus-n.perl is small script that cleans up a parallel corpus, so it works well with the training script.

It performs the following steps:

removes empty lines
removes redundant space characters
drops lines (and their corresponding lines), that are empty, too short, too long or violate the 9-1 sentence ratio limit of GIZA++

The command syntax is:

 clean-corpus-n.perl CORPUS L1 L2 OUT MIN MAX

For example: clean-corpus-n.perl raw de en clean 1 50 takes the corpus files raw.de and raw.en, deletes lines longer than 50, and creates the output files clean.de and clean.en.

Moses
statistical
machine translation
system

1. Moses

2. Getting Started

3. Tutorials

4. Training

5. User Documentation

6. Development

7. Background

Preparing Training Data

Training data for factored models

Cleaning the corpus

Mosesstatisticalmachine translationsystem

1. Moses

2. Getting Started

3. Tutorials

4. Training

5. User Documentation

6. Development

7. Background

Preparing Training Data

Training data for factored models

Cleaning the corpus

Moses
statistical
machine translation
system