Search Descriptions

Main Topics

Search Publications





Sparse Data

Building machine translation systems for under-resourced languages or in the face of sparse data conditions for other reasons, is a special challenge, and may require special methods.

Sparse Data is the main subject of 15 publications.


Several reports show how statistical machine translation allows for rapid development with limited resources (Al-Onaizan et al., 2000; Al-Onaizan et al., 2002; Foster et al., 2003; Oard and Och, 2003).
A practical example of this is the rapid development of a Haitian Creole to English machine translation systems for first responder assistance for the aftermath of the 2010 earthquake in the country (Lewis et al., 2011). The training data made available and extended during this effort was the topic of a shared task (Callison-Burch et al., 2011), where several research teams participated (Eidelman et al., 2011; Hewavitharana et al., 2011; Hu et al., 2011; Stymne, 2011).
Another good example study is the development of a Yiddish-English system Genzel et al. (2009), where a range of methods were explored, such as taking advantages of the close relation of Yiddish to German and the existence of Polish and Hebrew loan words.


A shared task on Haitian Creole organized at the 2011 ACL Workshop on statistical machine translation (Callison-Burch et al., 2011) provides a data set that has been used by several research groups.


Related Topics

Sparse data increases the problem of Unknown Words, which may be replaced by Paraphrasing. If training data into a bridge language is available, such Pivot Languages can be exploited. The need to make use of any available data resources, even Comparable Corpora, is more urgent.

In general, since many methods in statistical machine translations are geared towards making effective use of the training data, they will be more likely make a difference in a sparse data scenario.

New Publications

  • Ma et al. (2011)
  • DeNeefe et al. (2008)
  • Wang et al. (2012)
  • Lewis and Yang (2012)