The main forum for communication on Moses is the Moses support mailing list.
We'd like to hear what you want from Moses. We can't promise to implement the suggestions, but they can be used as input into research and student projects, as well as Marathon projects. If you have a suggestion/wish for
a new feature or improvement, then either report them via the issue tracker, contact the mailing list or drop Barry or Hieu a line (addresses on the mailing list page).
Moses is an open source project that is at home in the academic research community. There are several venues where this community gathers, such as:
Moses is being developed as a reference implementation of state-of-the-art methods in statistical machine translation. Extending this implementation may be the subject of undergraduate or graduate theses, or class projects. Typically, developers extend functionality that they required for their projects, or to explore novel methods. Let us know if you made an improvement, no matter how minor. Also let us know if you found or fixed a bug.
We are aware of many commercial deployments of Moses, for instance as described by TAUS. Please let us know if you use Moses commercially. Do not hesitate to contact the core developers of Moses. They are willing to answer questions and may be even available for consulting services.
There are many ways you can contribute to Moses.
- To get started, build systems with your data and get familiar with how Moses works.
- Test out alternative settings for building a system. The shared tasks organized around the ACL Workshop on Statistical Machine Translation are a good forum to publish such results on standard data conditions.
- Read the code. While you at it, feel free to add comments or contribute to the Code Guide to make it easier for others to understand the code.
- If you come across inefficient implementations (e.g., bad algorithms or code in Perl that should be ported to C++), program more efficient implementations.
- If you have new ideas for features, tools, and functionality, add them.
- Help out with some of the projects listed below.
If you are looking for projects to improve Moses, please consider the following list:
- OpenOffice/Microsoft Word, Excel or Access plugins: (Hieu Hoang) Create wrappers for the Moses decoder to translate within user apps. Skills required - Windows, VBA, Moses. (GSOC)
- Moses on the OLPC: (Hieu Hoang) Create a front-end for the decoder, and possible the training pipeline, so that it can be run on the OLPC. Some preliminary work has been done here
- Rule-based numbers, currency, date translation: (Hieu Hoang) SMT is bad at translating numbers and dates. Write some simple rules to identify and translate these for the language pairs of your choice. Integrate it into Moses and combine it with the placeholder feature. Skills required - C++, Moses. (GSOC)
- Named entity translation: (Hieu Hoang) Text with lots of names and trademarks etc are difficult for SMT to translate. Integrate named entity recognition into Moses. Translate them using the transliteration phrase-table, placeholder feature, or a secondary phrase-table. Skills required - C++, Moses. (GSOC)
- Interactive visualization for SCFG decoding: (Hieu Hoang) Create a front-end to the hiero/syntax decoder that enables the user to re-translate a part of the sentence, change parameters in the decoder, add or delete translation rules etc. Skills required - C++, GUI, Moses. (GSOC)
Training & Tuning
- Incremental updating of translation and language model: When you add new sentences to the training data, you don't want to re-run the whole training pipeline (do you?). Abby Levenberg has implemented incremental training for Moses but what it lacks is a nice How-To guide.
- Compression for lmplz: (Kenneth Heafield) lmplz trains language models on disk. The temporary data on disk is not compressed, but it could be, especially with a fast compression algorithm like zippy. This will enable us to build much larger models. Skills required: C++. No SMT knowledge required. (GSOC)
- Faster tuning by reuse: In tuning, you constantly re-decode the same set of sentences and this can be very time-consuming. What if you could reuse part of the calculation each time? This has been previously proposed as a marathon project
- Use binary files to speed up phrase scoring: Phrase-extraction and scoring involves a lot of processing of text files which is inefficient in both time and disk usage. Using binary files and vocabulary ids has the potential to make training more efficient, although more opaque.
- Lattice training: At the moment lattices can be used for decoding, and also for MERT but they can't be used in training. It would be pretty cool if they could be used for training, but this is far from trivial.
- Training via forced decoding: (Matthias Huck) Implement leave-one-out phrase model training in Moses. Skills required - C++, SMT.
- Faster training for the global lexicon model: Moses implements the global lexicon model proposed by Mauser et al. (2009), but training features for each target word using a maximum entropy trainer is very slow (years of CPU time). More efficient training or accommodation of training of only frequent words would be useful.
- Letter-based TER: Implement an efficient version of letter-based TER as metric for tuning and evaluation, geared towards morphologically complex languages.
- New Feature Functions: Many new feature functions could be implemented and tested. For some ideas, see Green et al. (2014)
- Character count feature: The word count feature is very valuable, but may be geared towards producing superfluous function words. To encourage the production of longer words, a character count feature could be useful. Maybe a unigram language model fulfills the same purpose.
- Training with comparable corpora, related language, monolingual data: (Hieu Hoang) High quality parallel corpora is difficult to obtain. There is a large amount of work on using comparable corpora, monolingual data, and parallel data in closely related languages to create translation models. This project will re-implement and extend some of the prior work.
- Decoding algorithms for syntax-based models: Moses generally supports a large set of grammar types. For some of these (for instance ones with source syntax, or a very large set of non-terminals), the implemented CYK+ decoding algorithm is not optimal. Implementing search algorithms for dedicated models, or just to explore alternatives, would be of great interest.
- Source cardinality synchronous cube pruning for the chart-based decoder: (Matthias Huck) Pooling hypotheses by amount of covered source words. Skills required - C++, SMT.
- Cube pruning for factored models: Complex factored models with multiple translation and generation steps push the limits of the current factored model implementation which exhaustively computes all translations options up front. Using ideas from cube pruning (sorting the most likely rules and partial translation options) may be the basis for more efficient factored model decoding.
- Word class models for syntax-based translation: (Matthias Huck) In particular, add class-based LMs for the chart decoder (independently of factored translation). Skills required - C++, SMT/NLP.
- Missing features for chart decoder: A number of features are missing for the chart decoder, such as: MBR decoding (should be simple) and lattice decoding. In general, reporting and analysis within experiment.perl could be improved.
- More efficient rule table for chart decoder: (Marcin) The in-memory rule table for the hierarchical decoder loads very slowly and uses a lot of RAM. An optimized implementation that is vastly more efficient on both fronts should be feasible. Skills required - C++, NLP, Moses. (GSOC)
- Only maintain total hypothesis weight in decoding: At the moment, each hypothesis (partial translation) contains the full feature vector, but really all that is required is the weighted score. The feature vectors could then be supplied lazily, if needed for n-best lists, and decoding would be more efficient.
- More features for incremental search: Kenneth Heafield presented a faster search algorithm for chart decoding Grouping Language Model Boundary Words to Speed K-Best Extraction from Hypergraphs (NAACL 2013). This is implemented as a separate search algorithm in Moses (called 'incremental search'), but it lacks many features of the default search algorithm (such as sparse feature support, or support for multiple stateful features). Implementing these features for the incremental search would be of great interest.
- Scope-0 grammar and phrase-table: (Hieu Hoang). The most popular decoding algorithm for syntax MT is the CYK+ algorithm. This is a parsing algorithm which is able to use decoding with an unnormalized, unpruned grammar. However, the disadvantage of using such a general algorithm is its speed; Hopkins and Langmead (2010) showed that that a sentence of length n can be parsed using a scope-k grammar in O(nk) chart update. For an unpruned grammar with 2 non-terminals (the usual SMT setup), the scope is 3.
This project proposes to quantify the advantages and disadvantages of scope-0 grammar. A scope-0 grammar lacks application ambiguity, therefore, decoding can be fast and memory efficient. However, this must be offset against potential translation quality degradation due to the lack of coverage.
It may be that the advantages of a scope-0 grammar can only be realized through specifically developed algorithms, such as parsing algorithms or data structures. The phrase-table lookup for a Scope-0 grammar can be significantly simplified, made faster, and applied to much large span width.
This project will also aim to explore this potentially rich research area.
- A better phrase table: The current binarised phrase table suffers from (i) far too many layers of indirection in the code making it hard to follow and inefficient (ii) a cache-locking mechanism which creates excessive contention; and (iii) lack of extensibility meaning that (e.g.) word alignments were added on by extensively duplicating code and additional phrase properties are not available. A new phrase table could make Moses faster and more extensible.
- Multi-threaded decoding: Moses uses a simple "thread per sentence" model for multi-threaded decoding. However this means that if you have a single sentence to decode, then multi-threading will not get you the translation any faster. Is it possible to have a finer-grained threading model that can use multiple threads on a single sentence? This would call for a new approach to decoding.
- Better reordering: (Matthias Huck, Hieu Hoang) E.g. with soft constraints on reordering: Moses currently allows you to specify hard constraints on reordering, but it might be useful to have "soft" versions of these constraints. This would mean that the translation would incur a trainable penalty for violating the constraints, implemented by adding a feature function. Skills required - C++, SMT.
More ideas related to reordering:
- Merging the phrase table and lexicalized reordering table: (Matthias Huck, Hieu Hoang) They contain the same source and target phrases, but different probabilities, and how those probabilities are applied. Merging the 2 models would halve the number of lookups. Skills required - C++, Moses. (GSOC)
- Using artificial neural networks as memory to store the phrase table: (Hieu Hoang) ANN can be used as associative memory to store information in a lossy method. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4634358&tag=1]. It would be interesting to use them to how useful they are at store the phrase table. Further research can focus on how they can be used to store morphologically similar translations.
- Entropy-based pruning: (Matthias Huck) A more consistent method for pre-pruning phrase tables. Skills required - C++, NLP.
- Faster phrase-based decoding by refining feature state: Implement Heafield's Faster Phrase-Based Decoding by Refining Feature State (ACL 2014).
- Multi-pass decoding: (Hieu Hoang) Some features may be too expensive to use during decoding - maybe due to their computational cost, or due to their wider use of context which leads to more state splitting. Think of a recurrent neural network language model that both uses too much context (the entire output string) and is costly to compute. We would like to use these features in a reranking phase, but dumping out the search graph, and then re-decode it outside of Moses, creates a lot of additional overhead. So, it would be nicer to integrate second pass decoding within the decoder. This idea is related to coarse to fine decoding. Technically, we would like to be able to specify any feature function as a first pass or second pass feature function. There are some major issues that have to be tackled with multi-pass decoding:
- A losing hypothesis which have been recombined with the winning hypothesis may now be the new winning hypothesis. The output search graph has to be reordered to reflect this.
- The feature functions in the 2nd pass produce state information. Recombined hypotheses may no longer be recombined and have to be split.
- It would be useful for feature functions scores to be able to be evaluated asynchronously. That is, a function to calculate the score it called but the score is calculated later. Skills required - C++, NLP, Moses. (GSOC)
General Framework & Tools
- Out-of-vocabulary (OOV) word handling: Currently there are two choices for OOVs - pass them through or drop them. Often neither is appropriate and Moses lacks good hooks to add new OOV strategies, and lacks alternative strategies. A new phrase table class should be created which process OOV. To create a new phrase-table type, make a copy of
moses/TranslationModel/SkeletonPT.*, rename the class and follow the example in the file to implement your own code. Skills required - C++, Moses. (GSOC)
- Tokenization for your language: Tokenization is the only part of the basic SMT process that is language-specific. You can help make translation for your language better. Make a copy of the file
scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en and replace it with non-breaking words in your language. Skills required - SMT, Moses, lots of human languages. (GSOC)
- Python interface: A Python interface to the decoder could enable easy experimentation and incorporation into other tools. cdec has one and Moses has a python interface to the on-disk phrase tables (implemented by Wilker Aziz) but it would be useful to be able to call the decoder from python.
- Analysis of results: (Philipp Koehn) Assessing the impact of variations in the design of a machine translation system by observing the fluctuations of the BLEU score may not be sufficiently enlightening. Having more analysis of the types of errors a system makes should be very useful.
- Integration of sigfilter: The filtering algorithm of Johnson et al is available in Moses, but it is not well integrated, has awkward external dependencies and so is seldom used. At the moment the code is in the contrib directory. A useful project would be to refactor this code to use the Moses libraries for suffix arrays, and to integrate it with the Moses experiment management system (EMS). The goal would be to enable the filtering to be turned on with a simple switch in the EMS config file.
- Boostification: Moses has allowed boost since Autumn 2011, but there are still many areas of the code that could be improved by usage of the boost libraries, for instance using shared pointers in collections.
- Cruise control: Moses has cruise control running on a server at the University of Edinburgh, however this only tests one platform (Ubuntu 12.04). If you have a different platform, and care about keeping Moses stable on that platform, then you could set up a cruise control instance too. The code is all in the standard Moses distribution.
- Maintenance: The documentation always needs maintenance as new features are introduced and old ones are updated. Such a large body of documentation inevitably contains mistakes and inconsistencies, so any help in fixing these would be most welcome. If you want to work on the documentation, just introduce yourself on the mailing list.
- Help messages: Moses has a lot of executables, and often the help messages are quite cryptic or missing. A help message in the code is more likely to be maintained than separate documentation, and easier to locate when you're trying to find the right options. Fixing the help messages would be a useful contribution to making Moses easier to use.