SMT Glossary v 1.0

SMT Glossary

This glossary includes common terms that are helpful for new users of statistical machine translation (SMT) and the open source Moses Decoder project.

Term	Source	Description
aligned data	SMT	Aligned data are the elements of a parallel corpus consisting of two or more languages. Each element in one language matches the corresponding element in the other language(s). The elements, sometimes called segments, can be block-aligned, paragraph-aligned, sentence-aligned, phrase-aligned or token-aligned.
alignment process	SMT	There are two alignment processes. In corpus preparation, the alignment process creates aligned data. During training, the alignment process uses a program such as MGIZA++ to create word alignment files.
BLEU score	SMT	BLEU stands for Bi-Lingual Evaluation Understudy”. A BLEU score indicates how closely the token sequences in one set of data, such as machine translation output, correlate with (match) the token sequences in another set of data, such as a reference human translation. See: evaluation process
corpus preparation	SMT	Corpus preparation is the general process to extract, transform, categorize various documents from their original purpose to and align the resulting data into a parallel corpus for training a translation model.
development (dev) set	SMT	See “tuning set”
eval set	SMT	See “test set”
evaluation process	SMT	The evaluation process uses a translation model of components created in the training process and configured with the tuning process to translate several thousand source language sentences in the eval set. This process then compares the resulting machine translations to reference translations, also in the eval set. The final BLEU score evaluation report shows how well the machine translations match the reference translations.
hierarchical model	SMT	SMT translation model that uses hierarchical training corpus.
hierarchical training data	SMT	A training corpus with each phrase annotated with the hierarchical structure of the language, such as parts of speech, word function, etc.
language model	SMT	A “language model” or “lm” is a statistical description of one language that includes the frequencies of token-based n-grams occurrences in a corpus. The “lm” is trained from a large monolingual corpus and saved as a file. The language model file is a required component of every translation model. Moses uses language model to select the most “probably” target language sentence from a large set of “possible” translations it generated using the phrase table and reordering table.
language model types	SMT	Language model files contain statistical data generated by one of various programs. Moses Decoder can use language model file types including: KenLM SRILM, RandLM and IRSTLM. SRILM, RandLM and IRSTLM toolkits include tools that train the new language model files. KenLM, however, only reads ARPA standard language model files which can be created by SRILM, IRSTLM.
moses configuration file	SMT	The moses configuration file is a text file created during the tuning process. The file contains the paths to the phrase table(s), reordering table, language model(s) with other codes and numeric values that control how the Moses Decoder works.
n-grams	SMT	An n-gram is a subsequence of n number of (1, 2, 3, etc) items in a larger sequence. In an lm n-grams are sequences of tokens. In phrase tables and reordering tables, n-grams are sequences of pairs of source and target language tokens.
parallel corpus	SMT	See “parallel data”
parallel data	SMT	A linguistic corpus of two or more languages where each element in one language corresponds to an element with the same meaning in the other language(s). The original, authored language is identified as the source language. Non-source languages are referred to as “target” languages. For Moses SMT, parallel data takes the form of one source and one target language text file where both files contain corresponding translation of sentences line by line.
phrase table	SMT	A “phrase table” is a statistical description of a parallel corpus of source-target language sentence pairs. The frequencies that n-grams in a source language text co-occur with n-grams in a parallel target language text represent the probability that those source-target paired n-grams will occur again in other texts similar to the parallel corpus. In practical terms, the phrase table is a file created during the training process and saved in the translation model folder. It functions as a sophisticated dictionary between the source and target languages. Phrase tables and reordering tables are translation model components.
pipeline	SMT	A “pipeline” is a toolchain of processes connected by standard streams, so that the output of each process (stdout) feeds directly as input (stdin) to the next one.
recaser model	SMT	A recaser model is a special translation model translates lower cased data to “natural” cased text (upper and lower casing).
reordering table	SMT	A “reordering table” contains the statistical frequencies that describe the changes in word order between source and target languages, such as “big house” versus “house big”. In practical terms, a “reordering table” is a file created during the training process and saved as a file in the model folder. The reordering table is translation model components.
source language	SMT	The source language is the language of the text that is to be translated. Typically, this is the authored language of the text. The source language is the same as the TMX specification “srclang” attribute of the <tu> tag.
target language	SMT	The target language is the language the source language text should be translated to.
test set	SMT	A pair of source and target language data, typically containing of several thousands of pairs used in the evaluation process.
tokenization	SMT	Tokenization is the process of separating words from punctuation and symbols into tokens.
tokens	SMT	Tokens are the basic unit in a machine translation process. Tokens are a sequence of characters, such as words, punctuation or symbols, separated by a space. See: BLEU score
toolchain	SMT	A “toolchain” is a series of linked or “chained” programming tools used in a series where the output of an upstream tool become the input for a “downstream” tool.
training corpus	SMT	A linguistic corpus with parallel data prepared for training into the phrase table and a reordering table components of a translation model.
training data	SMT	See: training corpus
training process	SMT	Training is a process in the machine learning branch of artificial intelligence field. In the training process, a system “learns” the relationships between parallel data. In SMT, the source language texts are stimuli that generate the target language text as a response. In practical terms, training starts with the bitext files and creates the phrase table and reordering table that are components of a translation model.
translation memory	SMT	A translation memory (tm) is parallel data that was collected for the purpose of aiding future translations.
translation model	SMT	A “translation model” consists of one or more phrase tables, zero or more reordering tables, one or more language models and one moses configuration file that were created during the training and tuning processes.
tuning process	SMT	Tuning is a process that finds the optimized configuration file settings for a translation model when used a specific purpose. The tuning process translates thousands of source language phrases in the tuning set with a translation model, compares the model's output to a set of reference human translations, and adjusts the settings with the intention to improve the translation quality. This process continues through numerous iterations. With each iteration, the tuning process repeats the steps until it reaches an optimized translation quality.
tuning set	SMT	A pair of source and target language data, typically containing of several thousands of pairs used in the tuning process.
word aligner	SMT	A word aligner is a program that created word alignment files during the word alignment process. Moses currently supports these word aligners: GIZA++, MGIZA++, and BerkeleyAligner.
word alignment	SMT	Word alignment process uses a word aligner to create a word alignment file during the training process.
words	SMT	A word is the smallest unit of meaning in a language that will stand on its own. In SMT, a word is a token created in the tokenization process that is not a punctuation or symbol.

SMT Glossary v 1.0
(Excerpts from the "DoMY Glossary" in Do Moses Yourself Community Edition)
2011-10-22 21:15
Copyright © 2011 Precision Translation Tools Co., Ltd.
SMT Glossary by Precision Translation Tools Co., Ltd. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.