Moses
statistical
machine translation
system

External Tools

A very active community is engaged in statistical machine translation research, which has produced a number of tools that may be useful for training a Moses system. Also, the more linguistically motivated models (factored model, syntax model) require tools to the linguistic annotation of corpora.

In this section, we list some useful tools. If you know (or are the developer of) anything we missed here, please contact us and we can add it to the list. For more comprehensive listings of MT tools, refer to the following pages:

Content

Word Alignment Tools

Berkeley Word Aligner

The BerkeleyAligner (available at Sourceforge) is a word alignment software package that implements recent innovations in unsupervised word alignment. It is implemented in Java and distributed in compiled format.

Installation:

 mkdir /my/installation/dir
 cd /my/installation/dir
 wget http://berkeleyaligner.googlecode.com/files/berkeleyaligner_unsupervised-2.1.tar.gz
 tar xzf berkeleyaligner_unsupervised-2.1.tar.gz

Test:

 cd berkeleyaligner
 chmod +x align 
 ./align example.conf

Multi-threaded GIZA++

MGIZA was developed by Qin Gao. It is an implementation of the popular GIZA++ word alignment toolkit to run multi-threaded on multi-core machines. Check the web site for more recent versions.

Installation:

   git clone https://github.com/moses-smt/mgiza.git
   cd mgiza/mgizapp
   cmake .
   make
   make install

Compiling MGIZA requires the Boost library. If your Boost library are in non-system directory, use the script

   manual-compile/compile.sh

to compile MGIZA.

The MGIZA binary and the script merge_alignment.py need to be copied in you binary directory that Moses will look up for word alignment tools. This is the exact command I use to copy MGIZA to it final destination:

  export BINDIR=~/workspace/bin/training-tools
  cp bin/* $BINDIR/mgizapp
  cp scripts/merge_alignment.py $BINDIR

MGIZA works with the training script train-model.perl. You indicate its use (opposed to regular GIZA++) with the switch -mgiza. The switch -mgiza-cpus NUMBER allows you to specify the number of CPUs.

Dyer et al.'s Fast Align

The Fast Align is a comparable fast unsupervised word aligner that nevertheless gives comparable results to GIZA++. It's details are described in a NAACL 2013 paper

Installation:

 mkdir /my/installation/dir
 cd /my/installation/dir
 git clone https://github.com/clab/fast_align.git
 cd fast_align
 make

Anymalign

Anymalign is a multilingual sub-sentential aligner. It can extract lexical equivalences from sentence-aligned parallel corpora. Its main advantage over other similar tools is that it can align any number of languages simultaneously. The details are describe in Lardilleux and Lepage (2009). To understand the algorithm, a pure python implementation can be found in minimalign.py but it is advisable use the main implementation for realistic usage.

Installation:

 mkdir /your/installation/dir
 cd /your/installation/dir
 wget https://anymalign.limsi.fr/latest/anymalign2.5.zip
 unzip anymalign2.5.zip

Evaluation Metrics

Translation Error Rate (TER)

Translation Error Rate is an error metric for machine translation that measures the number of edits required to change a system output into one of the references. It is implemented in Java.

Installation:

 mkdir /my/installation/dir
 cd /my/installation/dir 
 wget http://www.cs.umd.edu/~snover/tercom/tercom-0.7.25.tgz
 tar xzf tercom-0.7.25.tgz

METEOR

METEOR is a metric that includes stemmed and synonym matches when measuring the similarity between system output and human reference translations.

Installation:

 mkdir /my/installation/dir
 cd /my/installation/dir 
 wget http://www.cs.cmu.edu/~alavie/METEOR/install-meteor-1.0.sh
 sh install-meteor-1.0.sh

RIBES

RIBES is a metric that word rank-based metric that compares the ratio of contiguous and dis-contiguous word pairs between the system output and human translations.

Installation:

 # First download from http://www.kecl.ntt.co.jp/icl/lirg/ribes/ 
 # (need to accept to agree to the free license, so no direct URL)
 tar -xvzf RIBES-1.03.1.tar.gz 
 cd RIBES-1.03.1/
 python RIBES.py --help

Part-of-Speech Taggers

MXPOST (English)

MXPOST was developed by Adwait Ratnaparkhi as part of his PhD thesis. It is a Java implementation of a maximum entropy model and distributed as compiled code. It can be trained for any language pair for with annotated POS data exists.

Installation:

 mkdir /your/installation/dir
 cd /your/installation/dir
 wget ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz
 tar xzf jmx.tar.gz 
 echo '#!/usr/bin/env bash' > mxpost
 echo 'export CLASSPATH=/your/installation/dir/mxpost.jar' >> mxpost
 echo 'java -mx30m tagger.TestTagger /your/installation/dir/tagger.project' >> mxpost
 chmod +x mxpost

Test:

 echo 'This is a test .' | ./mxpost

The script script/training/wrappers/make-factor-en-pos.mxpost.perl is a wrapper script to create factors for a factored translation model. You have to adapt the definition of $MXPOST to point to your installation directory.

TreeTagger (English, French, Spanish, German, Italian, Dutch, Bulgarian, Greek)

TreeTagger is a tool for annotating text with part-of-speech and lemma information.

Installation (Linux, check web site for other platforms):

 mkdir /my/installation/dir
 cd /my/installation/dir
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/tree-tagger-linux-3.2.tar.gz
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/tagger-scripts.tar.gz
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/install-tagger.sh
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/english-par-linux-3.1.bin.gz
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/french-par-linux-3.2-utf8.bin.gz
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/spanish-par-linux-3.1.bin.gz
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/german-par-linux-3.2.bin.gz
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/italian-par-linux-3.2-utf8.bin.gz
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/dutch-par-linux-3.1.bin.gz
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/bulgarian-par-linux-3.1.bin.gz
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/greek-par-linux-3.2.bin.gz
 sh install-tagger.sh

The wrapper script scripts/training/wrapper/make-pos.tree-tagger.perl creates part-of-speech factors using TreeTagger in the format expected by Moses. The command has the required parameters -tree-tagger DIR to specify the location of your installation and -l LANGUAGE to specify the two-letter code for the language (de, fr, ...). Optional parameters are -basic to output only basic part-of-speech tags (VER instead of VER:simp -- not available for all languages), and --stem to output stems instead of part-of-speech tags.

Treetagger can also shallow parse the sentence, labelling it with chunk tags. See their website for details.

FreeLing

FreeLing is a set of a tokenizers, morpological analyzers, syntactic parsers. and other language tools for Asturian, Catalan, English, Galician, Italian, Portuguese, Russian, Spanish, and Welsh.

Syntactic Parsers

Collins (English)

Michael Collins developed the first statistical parser as part of his PhD thesis. It is implemented in C.

Installation:

 mkdir /your/installation/dir
 cd /your/installation/dir
 wget http://people.csail.mit.edu/mcollins/PARSER.tar.gz
 tar xzf PARSER.tar.gz
 cd COLLINS-PARSER/code
 make

Collins parser also requires the installation of MXPOST. A wrapper file to generate parse trees in the format required to train syntax models with Moses is provided in scrips/training/wrapper/parse-en-collins.perl.

BitPar (German, English)

Helmut Schmid developed BitPar, a parser for highly ambiguous probabilistic context-free grammars (such as treebank grammars). BitPar uses bit-vector operations to speed up the basic parsing operations by parallelization. It is implemented in C and distributed as compiled code.

Installation:

 mkdir /your/installation/dir
 cd /your/installation/dir
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/BitPar/BitPar.tar.gz
 tar xzf BitPar.tar.gz 
 cd BitPar/src
 make
 cd ../..

You will also need the parsing model for German which was trained on the Tiger treebank:

 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/BitPar/GermanParser.tar.gz
 tar xzf GermanParser.tar.gz
 cd GermanParser/src
 make
 cd ../..

There is also an English parsing model.

LoPar (German)

LoPar is an implementation of a parser for head-lexicalized probabilistic context-free grammars, which can be also used for morphological analysis. The program is distributed without source code.

Installation:

 mkdir /my/installation/dir
 cd /my/installation/dir
 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/LoPar/lopar-3.0.linux.tar.gz
 tar xzf lopar-3.0.linux.tar.gz
 cd LoPar-3.0

Berkeley Parser

The Berkeley is a phrase structure grammar parser implemented in Java and distributed open source. Models are provided for English, Bugarian, Arabic, Chinese, French, German.

http://code.google.com/p/berkeleyparser/

Other Open Source Machine Translation Systems

Joshua

Joshua is a machine translation decoder for hierarchical models. Joshua development is centered at the Center for Language and Speech Processing at the Johns Hopkins University in Baltimore, Maryland. It is implemented in Java.

cdec

Cdec is a decoder, aligner, and learning framework for statistical machine translation and other structured prediction models written by Chris Dyer in the University of Maryland Department of Linguistics. It is written in C++.

Apertium

Apertium is an open source rule-based machine translation (RBMT) system, maintained principally by the University of Alicante and Prompsit Engineering.

Docent

Docent is a decoder for phrase-based SMT that treats complete documents, rather than single sentences, as translation units and permits the inclusion of features with cross-sentence dependencies. It is developed by Christian Hardmeier and implemented in C++

Phrasal

Phrase-based SMT toolkit written in Java. http://www-nlp.stanford.edu/wiki/Software/Phrasal2

Other Translation Tools

COSTA MT Evaluation Tool

COSTA MT Evaluation Tool is an open-source Java program that can be used to evaluate manually the quality of the MT output. It is simple in use, designed to allow MT potential users and developers to analyse their engines using a friendly environment. It enables the ranking of the quality of MT output segment-by-segment for a particular language pair.

Appraise

Appraise is an open-source tool for manual evaluation of Machine Translation output. Appraise allows to collect human judgments on translation output, implementing annotation tasks such as translation quality checking, ranking of translations, error classification, and manual post-editing. It is used in the ACL WMT evaluation campaign.

Indic NLP Library

Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text.

The library provides the following functionalities:

    Text Normalization
    Transliteration
    Tokenization
    Morphological Analysis

https://github.com/anoopkunchukuttan/indic_nlp_library

Edit - History - Print
Page last modified on July 05, 2017, at 08:45 AM