Moses
statistical
machine translation
system

Training Step 3: Align Words

To establish word alignments based on the two GIZA++ alignments, a number of heuristics may be applied. The default heuristic grow-diag-final starts with the intersection of the two alignments and then adds additional alignment points.

Other possible alignment methods:

  • intersection
  • grow (only add block-neighboring points)
  • grow-diag (without final step)
  • union
  • srctotgt (only consider word-to-word alignments from the source-target GIZA++ alignment file)
  • tgttosrc (only consider word-to-word alignments from the target-source GIZA++ alignment file)

Alternative alignment methods can be specified with the switch --alignment.

Here, the pseudo code for the default heuristic:

 GROW-DIAG-FINAL(e2f,f2e):
  neighboring = ((-1,0),(0,-1),(1,0),(0,1),(-1,-1),(-1,1),(1,-1),(1,1))
  alignment = intersect(e2f,f2e); 
  GROW-DIAG(); FINAL(e2f); FINAL(f2e);

 GROW-DIAG():
  iterate until no new points added
    for english word e = 0 ... en
      for foreign word f = 0 ... fn
        if ( e aligned with f )
          for each neighboring point ( e-new, f-new ):
            if ( ( e-new not aligned or f-new not aligned ) and
                 ( e-new, f-new ) in union( e2f, f2e ) ) 
              add alignment point ( e-new, f-new )
 FINAL(a):
  for english word e-new = 0 ... en
    for foreign word f-new = 0 ... fn
      if ( ( e-new not aligned or f-new not aligned ) and
           ( e-new, f-new ) in alignment a )
        add alignment point ( e-new, f-new )

To illustrate this heuristic, see the example in the Figure below with the intersection of the two alignments for the second sentence in the corpus above

and then add some additional alignment points that lie in the union of the two alignments

This alignment has a blatant error: the alignment of the two verbs is mixed up. resumed is aligned to unterbrochene, and adjourned is aligned to wiederaufgenommen, but it should be the other way around.

To conclude this section, a quick look into the files generated by the word alignment process:

 ==> model/aligned.de <==
 wiederaufnahme der sitzungsperiode
 ich erklaere die am donnerstag , den 28. maerz 1996 unterbrochene sitzungsperiode
   des europaeischen parlaments fuer wiederaufgenommen .
 begruessung

 ==> model/aligned.en <==
 resumption of the session
 i declare resumed the session of the european parliament adjourned on 
   thursday , 28 march 1996 .
 welcome

 ==> model/aligned.grow-diag-final <==
 0-0 0-1 1-2 2-3
 0-0 1-1 2-3 3-10 3-11 4-11 5-12 7-13 8-14 9-15 10-2 11-4 12-5 12-6 13-7 
   14-8 15-9 16-9 17-16
 0-0

The third file contains alignment information, one alignment point at a time, in form of the position of the foreign and English word.

Edit - History - Print
Page last modified on April 26, 2012, at 05:17 PM