A very active community is engaged in statistical machine translation research, which has produced a number of tools that may be useful for training a Moses system. Also, the more linguistically motivated models (factored model, syntax model) require tools to the linguistic annotation of corpora.
In this section, we list some useful tools. If you know (or are the developer of) anything we missed here, please contact us and we can add it to the list. For more comprehensive listings of MT tools, refer to the following pages:
The BerkeleyAligner (available at Sourceforge) is a word alignment software package that implements recent innovations in unsupervised word alignment. It is implemented in Java and distributed in compiled format.
mkdir /my/installation/dir cd /my/installation/dir wget http://berkeleyaligner.googlecode.com/files/berkeleyaligner_unsupervised-2.1.tar.gz tar xzf berkeleyaligner_unsupervised-2.1.tar.gz
cd berkeleyaligner chmod +x align ./align example.conf
MGIZA was developed by Qin Gao. It is an implementation of the popular GIZA++ word alignment toolkit to run multi-threaded on multi-core machines. Check the web site for more recent versions.
git clone https://github.com/moses-smt/mgiza.git cd mgiza/mgizapp cmake . make make install
Compiling MGIZA requires the Boost library. If your Boost library are in non-system directory, use the script
to compile MGIZA.
The MGIZA binary and the script
merge_alignment.py need to be copied in you binary directory that Moses will look up for word alignment tools. This is the exact command I use to copy MGIZA to it final destination:
export BINDIR=~/workspace/bin/training-tools cp bin/* $BINDIR/mgizapp cp scripts/merge_alignment.py $BINDIR
MGIZA works with the training script
train-model.perl. You indicate its use (opposed to regular GIZA++) with the switch
-mgiza. The switch
-mgiza-cpus NUMBER allows you to specify the number of CPUs.
mkdir /my/installation/dir cd /my/installation/dir git clone https://github.com/clab/fast_align.git cd fast_align make
Anymalign is a multilingual sub-sentential aligner. It can extract lexical equivalences from sentence-aligned parallel corpora. Its main advantage over other similar tools is that it can align any number of languages simultaneously. The details are describe in Lardilleux and Lepage (2009). To understand the algorithm, a pure python implementation can be found in minimalign.py but it is advisable use the main implementation for realistic usage.
mkdir /your/installation/dir cd /your/installation/dir wget https://anymalign.limsi.fr/latest/anymalign2.5.zip unzip anymalign2.5.zip
Translation Error Rate is an error metric for machine translation that measures the number of edits required to change a system output into one of the references. It is implemented in Java.
mkdir /my/installation/dir cd /my/installation/dir wget http://www.cs.umd.edu/~snover/tercom/tercom-0.7.25.tgz tar xzf tercom-0.7.25.tgz
METEOR is a metric that includes stemmed and synonym matches when measuring the similarity between system output and human reference translations.
mkdir /my/installation/dir cd /my/installation/dir wget http://www.cs.cmu.edu/~alavie/METEOR/install-meteor-1.0.sh sh install-meteor-1.0.sh
MXPOST was developed by Adwait Ratnaparkhi as part of his PhD thesis. It is a Java implementation of a maximum entropy model and distributed as compiled code. It can be trained for any language pair for with annotated POS data exists.
mkdir /your/installation/dir cd /your/installation/dir wget ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz tar xzf jmx.tar.gz echo '#!/bin/ksh' > mxpost echo 'export CLASSPATH=/your/installation/dir/mxpost.jar' >> mxpost echo 'java -mx30m tagger.TestTagger /your/installation/dir/tagger.project' >> mxpost
echo 'This is a test .' | ./mxpost
script/training/wrappers/make-factor-en-pos.mxpost.perl is a wrapper script to create factors for a factored translation model. You have to adapt the definition of
$MXPOST to point to your installation directory.
TreeTagger is a tool for annotating text with part-of-speech and lemma information.
Installation (Linux, check web site for other platforms):
mkdir /my/installation/dir cd /my/installation/dir wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/tree-tagger-linux-3.2.tar.gz wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/tagger-scripts.tar.gz wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/install-tagger.sh wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/english-par-linux-3.1.bin.gz wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/french-par-linux-3.2-utf8.bin.gz wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/spanish-par-linux-3.1.bin.gz wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/german-par-linux-3.2.bin.gz wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/italian-par-linux-3.2-utf8.bin.gz wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/dutch-par-linux-3.1.bin.gz wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/bulgarian-par-linux-3.1.bin.gz wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/greek-par-linux-3.2.bin.gz sh install-tagger.sh
The wrapper script
scripts/training/wrapper/make-pos.tree-tagger.perl creates part-of-speech factors using TreeTagger in the format expected by Moses. The command has the required parameters
-tree-tagger DIR to specify the location of your installation and
-l LANGUAGE to specify the two-letter code for the language (
fr, ...). Optional parameters are
-basic to output only basic part-of-speech tags (
VER instead of
VER:simp -- not available for all languages), and
--stem to output stems instead of part-of-speech tags.
Treetagger can also shallow parse the sentence, labelling it with chunk tags. See their website for details.
FreeLing is a set of a tokenizers, morpological analyzers, syntactic parsers. and other language tools for Asturian, Catalan, English, Galician, Italian, Portuguese, Russian, Spanish, and Welsh.
Michael Collins developed the first statistical parser as part of his PhD thesis. It is implemented in C.
mkdir /your/installation/dir cd /your/installation/dir wget http://people.csail.mit.edu/mcollins/PARSER.tar.gz tar xzf PARSER.tar.gz cd COLLINS-PARSER/code make
Collins parser also requires the installation of MXPOST. A wrapper file to generate parse trees in the format required to train syntax models with Moses is provided in
Helmut Schmid developed BitPar, a parser for highly ambiguous probabilistic context-free grammars (such as treebank grammars). BitPar uses bit-vector operations to speed up the basic parsing operations by parallelization. It is implemented in C and distributed as compiled code.
mkdir /your/installation/dir cd /your/installation/dir wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/BitPar/BitPar.tar.gz tar xzf BitPar.tar.gz cd BitPar/src make cd ../..
You will also need the parsing model for German which was trained on the Tiger treebank:
wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/BitPar/GermanParser.tar.gz tar xzf GermanParser.tar.gz cd GermanParser/src make cd ../..
There is also an English parsing model.
LoPar is an implementation of a parser for head-lexicalized probabilistic context-free grammars, which can be also used for morphological analysis. The program is distributed without source code.
mkdir /my/installation/dir cd /my/installation/dir wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/LoPar/lopar-3.0.linux.tar.gz tar xzf lopar-3.0.linux.tar.gz cd LoPar-3.0
The Berkeley is a phrase structure grammar parser implemented in Java and distributed open source. Models are provided for English, Bugarian, Arabic, Chinese, French, German.
Joshua is a machine translation decoder for hierarchical models. Joshua development is centered at the Center for Language and Speech Processing at the Johns Hopkins University in Baltimore, Maryland. It is implemented in Java.
Cdec is a decoder, aligner, and learning framework for statistical machine translation and other structured prediction models written by Chris Dyer in the University of Maryland Department of Linguistics. It is written in C++.
Apertium is an open source rule-based machine translation (RBMT) system, maintained principally by the University of Alicante and Prompsit Engineering.
Docent is a decoder for phrase-based SMT that treats complete documents, rather than single sentences, as translation units and permits the inclusion of features with cross-sentence dependencies. It is developed by Christian Hardmeier and implemented in C++
Phrase-based SMT toolkit written in Java. http://www-nlp.stanford.edu/wiki/Software/Phrasal2
COSTA MT Evaluation Tool is an open-source Java program that can be used to evaluate manually the quality of the MT output. It is simple in use, designed to allow MT potential users and developers to analyse their engines using a friendly environment. It enables the ranking of the quality of MT output segment-by-segment for a particular language pair.
Appraise is an open-source tool for manual evaluation of Machine Translation output. Appraise allows to collect human judgments on translation output, implementing annotation tasks such as translation quality checking, ranking of translations, error classification, and manual post-editing. It is used in the ACL WMT evaluation campaign.
Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text.
The library provides the following functionalities:
Text Normalization Transliteration Tokenization Morphological Analysis