Alex Fraser
Institute for NLP (IMS)
University of Stuttgart, Germany
fraser@ims.uni-stuttgart.de


DEVELOPMENT OF AN OPEN SOURCE WORD ALIGNMENT FRAMEWORK
======================================================


Motivation
==========

Obtaining high quality word alignments is critical to the future
development of richer models of SMT. For instance, initial experiments
have shown that hierarchical phrase models are more sensitive to
alignment quality than non-hierarchical phrase models. The current
word alignment pipeline is unstable and difficult to optimize.


Summary
=======

We propose the creation of an open source word alignment framework
enabling the use of the LEAF model and EMD training algorithm to
produce alignments of higher quality than those produced by the
alignment pipeline used by most MT researchers. The work-intensive
portion of the proposal focused on the implementation of search for
the log-linear alignment model because the proposed system will
directly incorporate the improved open source MERT implementation
created at Euromatrix MTM2.


Details
=======

The EMD algorithm for MERT implements Expectation, Maximization and
Discrimination steps. The Expectation step implements a hill-climbing
search for the most probable alignment of a log-linear model
incorporating as sub-models the steps of the LEAF generative model,
and additional backed-off and heuristic sub-models. The Maximization
step calculates sufficient statistics from alignments generated by the
LEAF model. The Discrimination step requires maximizing the weights of
a log-linear model according to an error criterion calculated over a
small discriminative training corpus. We discuss the
implementation of these steps in additional detail.

In the E step step the Viterbi alignments for the full training corpus
are realized by finding the alignment which maximizes the log-linear
formulation of LEAF. This will be implemented by using a hill-climbing
local search which uses Tabu alignments (Tabu aligments implement a
local memory of which hypotheses not to expand during the search),
though we expect improvements in search quality to be an active area
of research.

The implementation of the E step involves:

- Definition of efficient data structures implementing the model,
including representations of the LEAF alignment structure and the
model parameters.

- Implementation of an optimized hill-climbing search algorithm and
the required search operations which makes small changes to a starting
alignment and efficiently scores these changes by looking at the
difference in log probability between the starting alignment and the
proposed change.

- Extension of the basic hill-climbing search to support two search
modes used for finding the most probable alignment according to the
log-linear model. A highly accurate search will be used to search over
the small discriminative training corpus in the D step, where search
errors should be minimized. A highly efficient search will be used in
the E step, where a small number of search errors are acceptable.

- Distribution of the search across multiple nodes of a
cluster. This can be implemented efficiently by filtering the model
parameters.

- Creation of tools for measuring known search errors as a function of
time, and the development of a framework for optimization of the
search operations.

The M step involves estimation of the (generative) model from
alignments generated using the system, which is straight-forward.

The D step can be implemented using the new flexible MERT framework
for maximizing a final performance criterion developed at MTM2. This
will require minimal work to support maximizing word alignment quality
metrics such as F-alpha. We will implement a function which attaches
the counts required to each hypothesized alignment generated during
the search. We will also provide the functions used during the
optimization step to evaluate the F-alpha function (these are a
function to calculate the error given the error counts, and a function
for calculating the delta in the error given a delta in error
counts). The D step also requires the adapation of the basic Viterbi
alignment search algorithm to produce the higher quality alignments
required for optimization of an error criterion.


Work effort
===========

M step implementation: 2 weeks

E step implementation: 3 months

D step adaptation and tuning: 4 weeks

Experimentation and improvement: 2 months. Language pairs: DE/EN and
ES/EN, Europarl data sets.


Contributions
=============

1) Higher quality alignments which will be made available for use in
MT research, including high quality N best alignment lists predicted
for WMT language pairs.

2) An improved alignment pipeline for use together with
MOSES+MERT. GIZA++ suffers from unstability resulting in run-time
errors, has many free parameters which are difficult to optimize and
can not be easily distributed across multiple compute nodes.

3) LEAF+EMD will be a useful framework for testing the generalization
of the MTM2 MERT framework to solving new problems (beyond optimizing
BLEU for decoding quality).