Featured Translation Task - EMNLP 2011 Sixth Workshop on Statistical Machine Translation

EMNLP 2011 SIXTH WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Featured Translation Task: Translating Haitian Creole Emergency SMS messages

July 30 - 31, 2011
Edinburgh, UK

The featured translation task of WMT11 is to translate Haitian Creole SMS messages into English. These text messages (SMS) were sent by people in Haiti in the aftermath of the January 2010 earthquake. The messages were sent to an emergency response service and information service called "Mission 4636". They were originally written in Haitian Creole, and were translated into English by a group of volunteers during the disaster response so that first responders (many of whom did not speak Haitian Creole) could understand and act on them. Simultaneously, volunteers were making maps of Haiti and helping to pinpoint the locations described in the messages. More than 30,000 messages were sent to the 4636 number. First responders used the volunteer created translations and maps, and were able to act on the vast majority of requests for help.

Secretary of State Clinton described one success of the Mission 4636 program: "The technology community has set up interactive maps to help us identify needs and target resources. And on Monday, a seven-year-old girl and two women were pulled from the rubble of a collapsed supermarket by an American search-and-rescue team after they sent a text message calling for help." Ushahidi@Tufts described another: "The World Food Program delivered food to an informal camp of 2500 people, having yet to receive food or water, in Diquini to a location that 4636 had identified for them."

In this featured task, we will provide the Haitian Creole SMS messages along with the translations that the volunteers created. We have split the messages into training / dev / devtest / test sets, and have assembled additional out-of-domain parallel corpora.

GOALS

The goals of the Haitian Creole to English translation task are:

To focus researchers on the problems presented by low resource languages
To provide a real-world data set consisting of SMS messages, which contains noisy, abbreviated language
To develop techniques for building translation systems that will be useful in future crises

We hope that both beginners and established research groups will participate in this task.

TASK DESCRIPTION

We provide data for translating Haitian Creole SMS messages. You may use any of the resources from the standard translation task. The goal is to improve the qualtiy of translating noisy data in a low resource language. You might consider:

Doing automated cleaning of the raw (noisy) SMS data in the traing set.
Trying to map from out of vocabulary Haitian words onto French and then using a French-English model to translate the unknown word.
Incorporate morphological and/or syntactic models to better cope with the low resource language pair.

Participants will use their systems to translate two test sets consisting of 849 unseen Haitian Creole SMS messages. One of the test sets contains the "raw" SMS messages, and the other contains messages that were cleaned up by human post-editors. The translation quality will measured by a manual evaluation and various automatic evaluation metrics. We hope that the difference in performance on the raw v. cleaned test sets will highlight the importance of handling noisy input data.

TRAINING DATA

We provide the following data:

Training set parallel sentences words per lang Comments / source

In-domain SMS data 17,192 35k This data consists primarially of raw (noisy) SMS data. Courtesy of Mission 4636.

Medical domain 1,619 10k Courtesy of CMU.

Newswire domain 13,517 30k Courtesy of CMU.

Glossary 35,728 85k Courtesy of CMU.

Wikipedia parallel sentence 8,476 90k Data automatically extracted from Wikipedia. Courtesy of MSR.

Wikipedia named entities 10,499 25k Courtesy of MSR.

The bible 30,715 850k Courtesy MSR.

Haitisurf dictionary 3,763 4k Courtesy Haitisurf.com (with assistance from MSR).

Krengle dictionary 1,687 3k Courtesy Krengle.net (with assistance from MSR).

Krengle sentences 658 3k Courtesy Krengle.net (with assistance from MSR).

Please Note: We have anonymized the SMS messages, but in some cases the anonymization may be incorrect or incomplete. Since this is the first release of this data, we are going to control the release a little more closely and ask researchers participating in WMT11 to help identify messages that need to be anonymized. To receive the data, sign up for a github account and send your username to Chris Callison-Burch (ccb@cs.jhu.edu).

If you find additional Haitian Creole training data we ask that you add it to the git repository.

In addition to this data, you may use any of the data provided in the standard translation task. You are also welcome to use any linguistic tools such as taggers, parsers, or morphological analyzers.

DEVELOPMENT DATA

Development set	parallel sentences	words per lang	Comments
SMS dev clean	925	12k	This set of SMS data was manually cleaned.
SMS dev raw	925	12k	This set of SMS data was not manually cleaned. It is parallel to the clean set (the messages are the same but are real, noisy data.)
SMS devtest clean	925	19k	This set of SMS data was manually cleaned.
SMS devtest raw	925	19k	This set of SMS data was not manually cleaned. It is parallel to devtest clean, but it is the un-cleaned sms messages.

DOWNLOAD

To download the data, sign up for a github account and send your username to Chris Callison-Burch (ccb@cs.jhu.edu).

EVALUATION

Evaluation will be done both automatically as well as by human judgement.

Manual Scoring: We will collect subjective judgments about translation quality from human annotators. If you participate in the shared task, we ask you to commit about 8 hours of time to do the manual evaluation. The evaluation will be done with an online tool.
As in previous years, we expect the translated submissions to be in recased, detokenized, XML format, just as in most other translation campaigns (NIST, TC-Star).

DATES

Release of training data	February 4, 2011
Test set distributed for translation task	March 14, 2011
Submission deadline for translation task	March 18, 2011
Paper due date	May 19, 2011

OTHER REQUIREMENTS

You are invited to submit a report about your approach. Your submission report should highlight in which ways your own methods and data differ from the standard approaches.

As with the other tasks, participants agree to contribute to the manual evaluation about eight hours of work.

ACKNOWLEDGEMENTS

We thank Rob Munro and Mission for providing this unique data for scientifc study. We thank the Microsoft Translator team at Microsoft Research (especially Will Lewis) for sponsoring the Haitian Creole-English translation task. They generously provided cleaned and re-translated SMS content, negotiated for additional data that could be used for the workshop on our behalf, and helped with defining the scope of the task. Thanks to CMU for providing further training data.

supported by the EuroMatrixPlus project
P7-IST-231720-STP
funded by the European Commission
under Framework Programme 7

Training set	parallel sentences	words per lang	Comments / source
In-domain SMS data	17,192	35k	This data consists primarially of raw (noisy) SMS data. Courtesy of Mission 4636.
Medical domain	1,619	10k	Courtesy of CMU.
Newswire domain	13,517	30k	Courtesy of CMU.
Glossary	35,728	85k	Courtesy of CMU.
Wikipedia parallel sentence	8,476	90k	Data automatically extracted from Wikipedia. Courtesy of MSR.
Wikipedia named entities	10,499	25k	Courtesy of MSR.
The bible	30,715	850k	Courtesy MSR.
Haitisurf dictionary	3,763	4k	Courtesy Haitisurf.com (with assistance from MSR).
Krengle dictionary	1,687	3k	Courtesy Krengle.net (with assistance from MSR).
Krengle sentences	658	3k	Courtesy Krengle.net (with assistance from MSR).

EMNLP 2011 SIXTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION