Word Alignment Challenge

Aligning words is a key task in data-driven machine translation. We start with a large parallel corpus where the translation of each sentence is known. So the input consists of sentence pairs like this one:

le droit de permis passe donc de $ 25 à $ 500 . — we see the licence fee going up from $ 25 to $ 500 .

It is relatively easy to obtain data in this form, though not completely trivial. This particular example is from the Canadian Hansards, proceedings of government meetings that are required to be published in both English and French. To align the sentences we can exploit document boundaries and structural cues arising from the fact that the documents are translations of each other, such as the fact that translated sentences will be in the same order, of similar length, and so forth. The problem is that we don't know the word-to-word correspondences in each sentence. That's where you come in: your challenge is to write a program that aligns the words automatically. For the sentence above, you might output the following correspondences:

le — the, droit — fee, permis— license, passe—going, passe—up, donc—from, $ — $, 25 — 25, à — to, $ — $, 50 — 50

Some words might be unaligned (e.g. we and see), while some words might be aligned to multiple words in the corresponding sentence (e.g. passe is aligned to going up). Sometimes words aren't exact translations of each other. That's ok: even experienced bilinguals will sometimes disagree on the best alignment of a sentence. But while word alignment doesn't capture every nuance, it is very useful for learning translation models or bilingual dictionaries.

Getting Started

All you need to run the exercises is git and python (v. 2.7 recommended). To get started, run this command:

git clone git://github.com/alopez/dreamt.git

We have provided you with a very simple aligner written in Python and around two million words of the Canadian Hansards. The source code of the aligner is contained in file called align under the directory aligner. The baseline algorithm works in two phases. First, we train models. The aligner observes the training sentences and then gives a score, between 0 and 1, to each possible alignment. The baseline aligner simply computes Dice's coefficient between pairs of English and foreign words. It then aligns all pairs of words with a score over 0.5. Run the baseline heuristic model 1,000 sentences using the command:

align -n 1000 > dice.a

This runs the aligner and stores the output in dice.a. To view and score the alignments, run this command:

grade < dice.a

This command scores the alignment quality by comparing the output alignments against a set of human alignment annotations using a metric called the alignment error rate, which balances precision and recall of the guessed alignments (paper). Look at the terrible output of this heuristic model -- it's better than chance, but not any good. Try training on 10,000 sentences instead of 1,000, by specifying the change on the command line:

align -n 10000 | grade

Performance should improve, but only slightly! Another experiment that you can do is change the threshold criterion used by the aligner. How does this affect alignment error rate?

It is a good idea to experiment with variations on the algorithm on a small number of sentences, but in order to compete you will need to align the entire dataset. Aligning the complete data will take several minutes! Your alignment of the data should be saved in a file called result.out:

align > result.out

The Challenge

Your task for this assignment is to improve the alignment error rate as much as possible. It shouldn't be hard to improve the alignments at least some: you've probably noticed that thresholding a Dice coefficient is a bad idea because alignments don't compete against one another. A good way to correct this is with a probabilistic model like IBM Model 1. It forces all of the English words in a sentence to compete as the explanation for each foreign word.

Formally, IBM Model 1 is a probabilistic model that generates each word of the foreign sentence $ {\bf f} $ independently, conditioned on some word in the English sentence $ {\bf e} $. The likelihood of a particular alignment $ {\bf a} $ of the foreign sentence therefore factors across words: $ P({\bf f}, {\bf a} | {\bf e}) = \prod_i P(a_i = j | |{\bf e}|) \times P(f_i | e_j) $. In Model 1, we fix $ P(a_i = j | |e|) $ to be uniform (i.e. equal to $ \frac{1}{|{\bf e}|} $), so the likelihood only depends upon the conditional word translation parameters $ P(f | e) $.

The iterative EM update for this model is straightforward. For every pair of an English word type $ e $ and a French word type $ f $, you count up the expected (fractional) number of times tokens $f$ are aligned to tokens of $e$ and normalize over values of $e$. That will give you a new estimate of the translation probabilities $P (f |e)$, which leads to new alignment posteriors, and so on. If you're still confused, it may help to read a more detailed explanation.

We recommend developing on a small data set (1000 sentences) and only a few iterations of EM. When you are finished, you should see some improvements in your alignments. But alignment isn't a solved problem, and the goal of this challenge isn't for you to just implement a well-known algorithm (in fact, you are not required to implement Model 1 at all, but you should try to beat it). To do really well you'll need to experiment. Here are some ideas:

Implement an HMM aligner (paper).
Add a sparse prior to word alignment probabilities and learn the model using Bayesian inference (paper).
Train a French-English model and an English-French model, then combine their predictions in some way (paper).
Train a supervised discriminative feature-based aligner using the annotated development set. (paper).
Train an unsupervised feature-based aligner using the annotated development set. (paper).
Seek out additional inspiration.

But the sky's the limit! You can try anything you want. If you have any questions or you're confused about anything, just ask!