Sparse Data

Building machine translation systems for under-resourced languages or in the face of sparse data conditions for other reasons, is a special challenge, and may require special methods.

Sparse Data is the main subject of 15 publications. 11 are discussed here.

Topics in Data

Publications

Several reports show how statistical machine translation allows for rapid development with limited resources (Al-Onaizan et al., 2000; Al-Onaizan et al., 2002; Foster et al., 2003; Oard and Och, 2003).

A practical example of this is the rapid development of a Haitian Creole to English machine translation systems for first responder assistance for the aftermath of the 2010 earthquake in the country (Lewis et al., 2011). The training data made available and extended during this effort was the topic of a shared task (Callison-Burch et al., 2011), where several research teams participated (Eidelman et al., 2011; Hewavitharana et al., 2011; Hu et al., 2011; Stymne, 2011).

Another good example study is the development of a Yiddish-English system Genzel et al. (2009), where a range of methods were explored, such as taking advantages of the close relation of Yiddish to German and the existence of Polish and Hebrew loan words.

Benchmarks

A shared task on Haitian Creole organized at the 2011 ACL Workshop on statistical machine translation (Callison-Burch et al., 2011) provides a data set that has been used by several research groups.

Discussion

New Publications

Jeff Ma and Spyros Matsoukas and Richard Schwartz (2011): Improving Low-Resource Statistical Machine Translation with a Novel Semantic Word Clustering Algorithm, Proceedings of the 13th Machine Translation Summit (MT Summit XIII)
add
@inproceedings{MTS-2011-Ma-2,
author = {Jeff Ma and Spyros Matsoukas and Richard Schwartz},
title = {Improving Low-Resource Statistical Machine Translation with a Novel Semantic Word Clustering Algorithm},
url = {http://www.mt-archive.info/MTS-2011-Ma-2.pdf},
pages = {352-359},
booktitle = {Proceedings of the 13th Machine Translation Summit (MT Summit XIII)},
publisher = {International Association for Machine Translation},
location = {Xiamen, China},
year = 2011
}
Ma et al. (2011)
Steve DeNeefe and Ulf Hermjakob and Kevin Knight (2008): Overcoming Vocabulary Sparsity in MT Using Lattices, Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas (AMTA)
add
@inproceedings{amta08:DeNeefe,
author = {Steve DeNeefe and Ulf Hermjakob and Kevin Knight},
title = {Overcoming Vocabulary Sparsity in {MT} Using Lattices},
url = {http://www.isi.edu/natural-language/mt/amta2008su.pdf},
googlescholar = {4025401116724932831},
pages = {89--96},
booktitle = {Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas (AMTA)},
location = {Waikiki, Hawaii},
year = 2008
}
DeNeefe et al. (2008)
Wang, Pidong and Nakov, Preslav and Ng, Hwee Tou (2012): Source Language Adaptation for Resource-Poor Machine Translation, Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
add
@InProceedings{wang-nakov-ng:2012:EMNLP-CoNLL,
author = {Wang, Pidong and Nakov, Preslav and Ng, Hwee Tou},
title = {Source Language Adaptation for Resource-Poor Machine Translation},
booktitle = {Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning},
month = {July},
address = {Jeju Island, Korea},
publisher = {Association for Computational Linguistics},
pages = {286--296},
url = {http://www.aclweb.org/anthology/D12-1027},
year = 2012
}
Wang et al. (2012)
William Lewis and Phong Yang (2012): Building MT for a Severely Under-Resourced Language: White Hmong, Proceedings of the Tenth Conference of the Association for Machine Translation in the Americas (AMTA)
add
@inproceedings{AMTA-2012-Lewis,
author = {William Lewis and Phong Yang },
title = {Building {MT} for a Severely Under-Resourced Language: White Hmong},
url = {http://www.mt-archive.info/AMTA-2012-Lewis.pdf},
booktitle = {Proceedings of the Tenth Conference of the Association for Machine Translation in the Americas (AMTA)},
location = {San Diego, California},
year = 2012
}
Lewis and Yang (2012)

MT Research Survey Wiki

A Comprehensive Survey of Neural and Statistical Machine Translation Research Publications

Search Descriptions

Sparse Data

Publications

Benchmarks

Discussion

Related Topics

New Publications