Alex Fraser Institute for NLP (IMS) University of Stuttgart, Germany fraser@ims.uni-stuttgart.de DEVELOPMENT OF AN OPEN SOURCE WORD ALIGNMENT FRAMEWORK ====================================================== Motivation ========== Obtaining high quality word alignments is critical to the future development of richer models of SMT. For instance, initial experiments have shown that hierarchical phrase models are more sensitive to alignment quality than non-hierarchical phrase models. The current word alignment pipeline is unstable and difficult to optimize. Summary ======= We propose the creation of an open source word alignment framework enabling the use of the LEAF model and EMD training algorithm to produce alignments of higher quality than those produced by the alignment pipeline used by most MT researchers. The work-intensive portion of the proposal focused on the implementation of search for the log-linear alignment model because the proposed system will directly incorporate the improved open source MERT implementation created at Euromatrix MTM2. Details ======= The EMD algorithm for MERT implements Expectation, Maximization and Discrimination steps. The Expectation step implements a hill-climbing search for the most probable alignment of a log-linear model incorporating as sub-models the steps of the LEAF generative model, and additional backed-off and heuristic sub-models. The Maximization step calculates sufficient statistics from alignments generated by the LEAF model. The Discrimination step requires maximizing the weights of a log-linear model according to an error criterion calculated over a small discriminative training corpus. We discuss the implementation of these steps in additional detail. In the E step step the Viterbi alignments for the full training corpus are realized by finding the alignment which maximizes the log-linear formulation of LEAF. This will be implemented by using a hill-climbing local search which uses Tabu alignments (Tabu aligments implement a local memory of which hypotheses not to expand during the search), though we expect improvements in search quality to be an active area of research. The implementation of the E step involves: - Definition of efficient data structures implementing the model, including representations of the LEAF alignment structure and the model parameters. - Implementation of an optimized hill-climbing search algorithm and the required search operations which makes small changes to a starting alignment and efficiently scores these changes by looking at the difference in log probability between the starting alignment and the proposed change. - Extension of the basic hill-climbing search to support two search modes used for finding the most probable alignment according to the log-linear model. A highly accurate search will be used to search over the small discriminative training corpus in the D step, where search errors should be minimized. A highly efficient search will be used in the E step, where a small number of search errors are acceptable. - Distribution of the search across multiple nodes of a cluster. This can be implemented efficiently by filtering the model parameters. - Creation of tools for measuring known search errors as a function of time, and the development of a framework for optimization of the search operations. The M step involves estimation of the (generative) model from alignments generated using the system, which is straight-forward. The D step can be implemented using the new flexible MERT framework for maximizing a final performance criterion developed at MTM2. This will require minimal work to support maximizing word alignment quality metrics such as F-alpha. We will implement a function which attaches the counts required to each hypothesized alignment generated during the search. We will also provide the functions used during the optimization step to evaluate the F-alpha function (these are a function to calculate the error given the error counts, and a function for calculating the delta in the error given a delta in error counts). The D step also requires the adapation of the basic Viterbi alignment search algorithm to produce the higher quality alignments required for optimization of an error criterion. Work effort =========== M step implementation: 2 weeks E step implementation: 3 months D step adaptation and tuning: 4 weeks Experimentation and improvement: 2 months. Language pairs: DE/EN and ES/EN, Europarl data sets. Contributions ============= 1) Higher quality alignments which will be made available for use in MT research, including high quality N best alignment lists predicted for WMT language pairs. 2) An improved alignment pipeline for use together with MOSES+MERT. GIZA++ suffers from unstability resulting in run-time errors, has many free parameters which are difficult to optimize and can not be easily distributed across multiple compute nodes. 3) LEAF+EMD will be a useful framework for testing the generalization of the MTM2 MERT framework to solving new problems (beyond optimizing BLEU for decoding quality).