Joshua Baseline System - EMNLP 2011 Sixth Workshop on Statistical Machine Translation

EMNLP 2011 SIXTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Baseline System: Joshua

July 30 - 31, 2011
Edinburgh, UK

Joshua is an open-source MT system developed at Johns Hopkins University. It uses a hierarchical phrase-based translation model. What follows below are step-by-step instructions. This may look like a long list at first glance, but it should make it straightforward to build a machine translation system and all its components, and it should make the process of tuning, testing, and evaluating it transparent.

These instructions are adapted from Chris Callison-Burch's Joshua guide. More instructions and documentation for the use of Thrax, the translation model extractor, can be found on its github wiki.

If you have problems running this pipeline, please email jonny at cs dot jhu dot edu. Say something about WMT11 baseline in your subject line.

Installation

The joshua system has some requirements.

You will need Apache Ant.
You also need Swig to connect SRILM's C++ components to Joshua's Java components.
We use the Berkeley Aligner to do word-level alignment of parallel corpora.
Download SRILM and install it.
At this point you need to set some environment variables:
export SRILM=/path/to/srilm export JAVA_HOME=/Library/Java/Home (on OSX, other OSes are different)
Get the Joshua 1.3 tarball. You can install it with
tar xzf joshua.tar.gz cd joshua ant
If ant returns successfully, the decoder is ready to use. But in order to build translation models from the training data, we recommend using Thrax. (If you're also following ccb's guide, use of Thrax replaces step 5.)
To install Thrax:
Download and unpack the Hadoop tarball (or get access to a hadoop cluster)
wget http://apache.cs.utah.edu//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz tar -xzf hadoop-0.20.2.tar.gz
If you don't have a cluster, some basic hadoop setup for standalone mode is here.
Download and unpack the Amazon Web Services SDK. This is a compilation requirement for Thrax, even though you don't necessarily have to use Amazon's cloud services to run it.
wget http://ds60ft5bv5jal.cloudfront.net/aws-java-sdk-1.1.3.zip unzip aws-java-sdk-1.1.3.zip
A couple more environment variables:
export HADOOP=/path/to/hadoop export AWS_SDK=/path/to/aws/sdk
Compile Thrax.
git clone https://github.com/jweese/thrax.git ant

Install Additional Scripts

Download scripts.tgz and extract them:
tar xzf scripts.tgz
These scripts include:
- Tokenizer scripts/tokenizer.perl
- Lowercaser scripts/lowercase.perl
- SGML-Wrapper scripts/wrap-xml.perl

Prepare Data

Tokenize training data
mkdir -p working-dir/corpus scripts/tokenizer.perl -l fr < wmt08/training/europarl-v3.fr-en.fr > working-dir/corpus/europarl.tok.fr scripts/tokenizer.perl -l en < wmt08/training/europarl-v3.fr-en.en > working-dir/corpus/europarl.tok.en
Lowercase training data
scripts/lowercase.perl < working-dir/corpus/europarl.tok.fr > working-dir/corpus/europarl.lowercased.fr scripts/lowercase.perl < working-dir/corpus/europarl.tok.en > working-dir/corpus/europarl.lowercased.en

Align Parallel Corpus

We give Berkeley instructions here; GIZA++ could also be used.

Write a configuration file called word-align.conf (example here)
mkdir -p example/test java -d64 -Xmx10g -jar /path/to/aligner/berkeleyaligner.jar ++word-align.conf cp working-dir/alignments/europarl.align working-dir/corpus/europarl.fr-en.alignments

Build Language Model

Tokenize English language model data
mkdir -p working-dir/lm scripts/tokenizer.perl -l en < wmt08/training/europarl-v3.en > working-dir/lm/europarl.tok
Lowercase language model data
scripts/lowercase.perl < working-dir/lm/europarl.tok > working-dir/lm/europarl.lowercased
Use SRILM to build language model
SRILM makes a platform-specific folder within its bin directory, this instruction assumes i686.
/path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount -text working-dir/lm/europarl.lowercased -lm working-dir/lm/europarl.lm

Train Translation Model

This example will build a Hiero-style translation model.

Combine source, target, and alignments into one file:
paste working-dir/corpus/europarl.lowercased.fr working-dir/corpus/europarl.lowercased.en working-dir/corpus/europarl.fr-en.alignments | perl -pe 's/\t/ ||| /g' >working-dir/corpus/europarl.unified
Write thrax configuration file (example here, specification here)
Extract:
hadoop jar $THRAX/bin/thrax.jar thrax.conf europarl hadoop fs -getmerge europarl working-dir/corpus/grammar
Build glue grammar
$THRAX/scripts/create_glue_grammar.sh thrax.conf <working-dir/corpus/grammar >working-dir/corpus/glue.grammar

Tuning (i.e., Optimize System Component Weights, a.k.a. Minimum Error Rate Training)

Tokenize tuning sets
mkdir -p working-dir/tuning scripts/tokenizer.perl -l fr < wmt08/dev/dev2006.fr > working-dir/tuning/input.tok scripts/tokenizer.perl -l en < wmt08/dev/dev2006.en > working-dir/tuning/reference.tok
Lowercase tuning sets
scripts/lowercase.perl < working-dir/tuning/input.tok > working-dir/tuning/input scripts/lowercase.perl < working-dir/tuning/reference.tok > working-dir/tuning/reference
Filter translation model
$THRAX/scripts/filter_rules.sh 10 working-dir/tuning/input <working-dir/corpus/grammar >working-dir/corpus/grammar.dev2006
Create ZMERT configuration file (here), parameters file (here), decoder command (here), and joshua configuration file (here) in folder working-dir/mert
Run tuning script
Note that this step can take many hours, even days, to run.
java -cp $JOSHUA/bin joshua.zmert.ZMERT -maxMem 1500 mert/mert.config
Copy final joshua configuration file
mkdir -p working-dir/evaluation cp mert/joshua.config.ZMERT.final working-dir/evaluation/joshua.config

Run System on Development Test Set

Tokenize test set
mkdir -p working-dir/evaluation scripts/tokenizer.perl -l fr < wmt08/devtest/devtest2006.fr > working-dir/evaluation/devtest2006.input.tok scripts/tokenizer.perl -l en < wmt08/devtest/devtest2006.en > working-dir/evaluation/devtest2006.reference.tok
Lowercase test set
scripts/lowercase.perl < working-dir/evaluation/devtest2006.input.tok > working-dir/evaluation/devtest2006.input scripts/lowercase.perl < working-dir/evaluation/devtest2006.reference.tok > working-dir/evaluation/devtest2006.reference
Filter the model to fit into memory
$THRAX/scripts/filter_rules.sh 10 working-dir/evaluation/devtest2006.input <working-dir/corpus/grammar >working-dir/corpus/grammar.devtest2006
Change translation model in working-dir/evaluation/joshua.config
tm_file=working-dir/corpus/grammar.devtest2006
Decode with Joshua
java -Xmx1g -cp $JOSHUA/bin -Djava.library.path=$JOSHUA/lib -Dfile.encoding=utf8 joshua.decoder.JoshuaDecoder working-dir/evaluation/joshua.config working-dir/evaluation/devtest2006.input working-dir/evaluation/devtest2006.output
Extract the one best candidates:
java -cp $JOSHUA/bin -Dfile.encoding=utf8 joshua.util.ExtractTopCand working-dir/evaluation/devtest2006.output working-dir/evaluation/devtest2006.output.1best

Evaluation

Recase the output
- Train true case LM:
  $SRILM/bin/macosx64/ngram-count -unk -order 5 -kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -kndiscount5 -text working-dir/lm/europarl.tok -lm working-dir/lm/training.TrueCase.5gram.lm
- Create true case map (perl script here):
  perl truecase-map.perl <working-dir/lm/europarl.tok >working-dir/lm/true-case.map
- Disambiguate (strip-sent-tags.perl): $SRILM/bin/macosx/disambig -lm working-dir/lm/training.TrueCase.5gram.lm -keep-unk -order 5 -map working-dir/lm/true-case.map -text working-dir/evaluation/devtest2006.output.1best | perl strip-sent-tags.perl > working-dir/evaluation/devtest2006.output.recased
Detokenize the output
scripts/detokenizer.perl -l en < working-dir/evaluation/devtest2006.output.recased > working-dir/evaluation/devtest2006.output.detokenized
Wrap the output in SGML
scripts/wrap-xml.perl wmt08/devtest/devtest2006-ref.en.sgm en < working-dir/evaluation/devtest2006.output.detokenized > working-dir/evaluation/devtest2006.output.sgm
Score with NIST BLEU scoring tool
mteval-v11b.pl -r wmt08/devtest/devtest2006-ref.en.sgm -t working-dir/evaluation/devtest2006.output.sgm -s wmt08/devtest/devtest2006-src.fr.sgm -c

supported by the EuroMatrixPlus project
P7-IST-231720-STP
funded by the European Commission
under Framework Programme 7

EMNLP 2011 SIXTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION