The single best thing you can do is to binarize the phrase tables and language models. See question below also.
Filter and binarize your phrase tables. Binarize your language models using the IRSTLM. Binarize your lexicalized re-ordering table.
Binarizing the phrase table helps decrease memory usage as only phrase pairs that are needed for each sentence are read from file into memory. Similarly for language models and lexicalized reordering models.
This webpage tell you how to binarize the models.
We are always grateful for bug reports and code contribution. Send it to an existing Moses developer you work with, or send it to Hieu Hoang at Edinburgh University.
If you want to check it code yourself, create a github account here
Then ask one of the project admins to add you to the Moses project. The admins are currently
We'll prob ask to code review you a few times before giving you free reign. However, there is less oversight if you intend to work on your own branch, rather than the trunk.
The best way is using git.
From the command line, type
Or use whatever GUI client you have.
Email the mailing list with the title: 'Code monkey available. Will work for peanuts' ! Seriously, there's lots and lots of projects available. There has been 3-4 months projects in the past which have made a significant contribution to the community and have been integrated into the Moses toolkit. Your contribution will be grateful appreciated. Talk to your professor in the 1st instance, then talk to us.
See the section on phrase scoring
It depends on which part.
The decoder can be compiled and run on Linux (32 and 64-bits), Windows, Cygwin, Mac OSX (Intel and PowerPC). Unconfirmed reports of the decoder running on Solaris and BSD too.
The training and tuning scripts are regularly run on Linux (32 and 64-bits), and occasionally on Mac (Intel). The whole of the Moses pipeline should also run on Windows under Cygwin, however, this has not been confirmed. If you are able to run under Windows/Cygwin, please let us know and we can update this FAQ.
When running on non-Linux platforms, beware of the following issues:
gzipcommand line programs missing
Therefore, the only realistic OS to run the whole SMT pipeline on is Linux and Intel Mac.
Yes. Moses compiles and runs in cygwin exactly the same way as on linux
There are a proviso though:
Cygwin is 32-bit, even on 64 bit windows. The binary language models (kenlm, irstlm) need 64 bit to work with lm larger than about 2gb. This is the same as for 32 bit linux.
The Moses toolkit uses SGE (Sun Grid Engine) cluster to parallelize tasks. Even though it is not strictly necessary to use a cluster to run your experiments, it is highly advisable to get your experiments to run faster.
The most CPU intensive task is the tuning of the weights (MERT tuning). As an indication, a Europarl trained model, using 2000 sentences for tuning, takes 1-2 days to tune using 15 CPUs. 10-15 iterations are typical.
Moses shouldn't segfault, so the Moses developers would like to hear about it.
First of all, try to identify the fault yourself. The most common error is the ini file isn't correct, or the sentence input is badly formatted.
If necessary, you can debug the system by stepping through the source code. We put a lot of effort into making the code easy to read and debug. Also, the decoder comes with Visual Studio and XCode project file to help you debug in a GUI environment.
If you still can't find the solution, email the mailing list. Its useful to attach the ini file, the output just before it crashes, and any other info that you think may be useful to help resolve the problem.
This is now documented in its own section.
Firstly, make sure SRILM/IRSTLM themselves have compiled successfully. You should see be a libflm.a/libdstruct.a etc (for SRILM), or libirstlm.a. If these aren't available, then something went wrong. SRILM and IRSTLM are external libraries so the Moses developers have limited say and knowledge of them.
SRI or IRST LM both have their own mailing list where you can ask questions if you have problem compiling them. See here for details:
If Moses still doesn't compile successfully, look at the compile error to see where the compiler is trying to find these external libraries. Occasionally (especially when compiling on 64-bit machines), Moses expects the .a file in 1 subdirectory but they are in another. This is easily solved by moving copying the .a file to the place where Moses expect it to be.
There's a subproject in moses, in
contrib/web , which allows you to set up a web page to translate other web pages. Its written in perl and the installation is non-trivial. Follow the instructions carefully.
It doesn't translate ad-hoc sentences. If you have some code which allow translation of ad-hoc sentences, please share it with us !
You need to do everything twice, and run 2 decoders. There is a lot of overlap between them, but the toolkit is designed to go 1 way at a time.
This may happen means because you have a null byte in your data. Look at line 2 of model/lex.f2e.
Try this to find lines with null bytes in your original data:
grep -Pc '[\000]' <files ...>
(If your grep doesn't support Perl-style regepx syntax (
-P), you'll have to express that a different way.)
If this turns out to be the problem, and you don't want to run GIZA again from scratch, you can try the following:
First go into
working-dir/model and delete everything but the following:
aligned.grow-diag-final-and aligned.0.fr aligned.0.en lex.0-0.n2f lex.0-0.f2n
Now run this fragment of Perl:
perl -i.BAD -pe 's/[\000]/NULLBYTE/g;' aligned.0* lex.0*
This will replace every null byte in those four files, saving the old version out to
*.BAD. (This may be overkill, for instance if only the foreign side has the problem.
Now restart the moses training script with the same invocation as before, but tell it to start at step 5:
train-model.perl ... --first-step 5
Yes. Check the Syntax Tutorial.
Moses is licensed under the LGPL. See here for a thorough explanation of what this means.
Basically, if you're just using moses unchanged, there's no license issues. You can also use the moses library (
libmoses.a) in your own applications.
But if you want to distribute a modified version of moses, you have to distribute the source code to the modifications.
You have a version of GIZA++ which doesn't support cooccurrence files. To add
support for cooccurrence files, you need to edit the GIZA++ Makefile and add
CFLAGS_OPT. Then you should rebuild
You shouldn't be running this script. Moses moved from autotools to bjam in Autumn 2011.
This error occurs during the word alignment step and is related to GIZA++, and not directly to the Moses Toolkit. Neverthless, the solution is described here.
In general, Machine Translation training is non-convex. this means that there are multiple solutions and each time you run a full training job, you will get different results. In particular, you will see different results when running Giza++ (any flavour) and MERT.
The best way to deal with this (and most expensive) would be to run the full pipe-line, from scratch and multiple times. This will give you a feel for variance --differences in results. In general, variance arising from Giza++ is less damaging than variance from MERT.
To reduce variance it is best to use as much data as possible at each stage. It is possible to reduce this variability by using better machine learning, but in general it will always be there.
Another strategy is to fix everything once you have a set of good weights and never rerun MERT. Should you need to change say the language model, you will then manually alter the associated weight. This will mean stability, but at the obvious cost of generality. it is also ugly.
See Clark et al. for a discussion of some of these issues.
The ranges that you pass to
mert-moses.pl (using the
--range argument) are only used in the random restarts, so serve to guide mert rather than restrict it.