June 29-30, 2005

Workshop Program

The goal of this workshop is to provide a forum for researchers working on problems related to the creation and use of parallel text. Recent events have demonstrated once again the importance of inter-language communication across a broad range of languages. This reinforces the need for advances in machine translation (MT) and multi-lingual processing tools, especially for languages with scarce resources.

This is a two-day workshop featuring two tracks:

  1. Building and Using Parallel Texts for Languages with Scarce Resources (day 1)
  2. Exploiting Parallel Texts for Statistical Machine Translation (day 2)
Both tracks feature a shared task each, that allows participants to compare their results on a common task. Although not required, we encourage submissions to participate in the shared tasks for bench-marking purposes.



The aim of this track is to bring together researchers involved in the study of creating and using parallel corpora for minority languages. The track will be therefore centered around issues related to manual/automatic collection of parallel corpora, studies in the "import" of knowledge from a well-studied language via parallel alignments, evaluations of the quality of collected corpora or the quality of the tools that are derived based on these corpora.

We invite submissions of papers addressing any of the following issues:

While we invite submissions addressing any of the above topics, or related issues, we particularly welcome work involving parallel corpora addressing languages with scarce resources.

Shared task

In addition to regular paper presentations, the track will also include a shared task for the evaluation of various word alignment techniques. Word alignment represents an important step in exploiting parallel corpora, and yet there is no common evaluation framework for such systems. This follows on the success of the word alignment task that took place as a part of the NAACL 2003 workshop on parallel text. This year's edition will be distinct in that it will focus on Inuktitut-English and Romanian-English alignment. This fits into the theme of our track, since neither Inuktitut nor Romanian is a widely studied language, and there are relatively few online resources and tools available.

Teams that participate in the alignment exercise will be provided the training data for each language pair and development data taken from the gold standard data in order to build their systems. Thereafter they will be provided the unaligned gold standard data and asked to submit their proposed alignments in a short time frame. There will be two tracks for each language pair, one for teams that augment the training data with additional resources, and another for those that only use the training data. The resulting alignments will be evaluated relative to the previously mentioned gold standard data prior to the workshop. Short papers describing systems participating in this shared task and all evaluation methodologies employed will constitute a separate section in the workshop proceedings.

A more detailed description, training, development, and test data, and a number of other related resources will be made available from


The focus of this track is to use parallel corpora for machine translation.

Translating documents from foreign languages into English (or between any two languages) by computer is one of the oldest goals in computational linguistics. Now, armed with vast amounts of digitally available translated text and powerful computers, we are witnessing significant progress toward achieving that goal. Statistical methods allow the analysis of parallel text corpora and the automatic construction of machine translation systems. Already, for some language pairs such as Chinese-English or Arabic-English, statistical machine translation (SMT) systems built at research labs outperform commercial systems.

Recent experimentation has shown that the performance of SMT systems varies greatly with the source language. In this workshop we would like to encourage researchers to investigate ways to improve the performance of SMT systems for diverse languages, including morphologically complex languages (e.g., Finnish) and languages with partial free word order (e.g., German). These issues lie on the border of linguistic analysis and statistical modeling, and the ACL conference is the most appropriate forum to investigate them, as ACL has a long tradition of hosting high-quality research in both areas.

Topics of interest include, but are not limited to:

In addition to submissions on the topics listed above, this track of the workshop features a shared task and we encourage participants to evaluate their approaches on that task. The shared task is to evaluate your approach to machine translation---see the list of topics of interests above---on the Europarl corpus.

A more detailed description of the shared task, the test and training corpora, a freely available MT system, and a number of other resources are available from


Submissions will consist of regular full papers of max. 8 pages, formatted following the ACL 2005 guidelines. Authors of regular full papers will be required to indicate a track for their submission. In addition, teams participating in the shared tasks will be invited to submit short papers (max. 4 pages) describing their systems. Both submission and review processes will be handled electronically.


Regular paper submissions April 10
(shared task) Results submissions April 10
(shared task) Short paper submissions April 17
Notification (short and regular papers) May 4
Camera-ready papers May 15


Philipp Koehn (University of Edinburgh)
Joel Martin (National Research Council of Canada)
Rada Mihalcea (University of North Texas)
Christof Monz (University of Maryland)
Ted Pedersen (University of Minnesota, Duluth)


For questions, comments, etc. please send email to


Lars Ahrenberg (Linkoping University)
Bill Byrne (University of Cambridge)
Chris Callison-Burch (University of Edinburgh)
Nicoletta Calzolari (University of Pisa)
Francisco Casacuberta (University of Valencia)
David Chiang (University of Maryland)
Mona Diab (Columbia University)
George Foster (Canada National Research Council)
Alexander Fraser (ISI/University of Southern California)
Pascale Fung (Hong Kong University of Science and Technology)
Rob Gaizauskas (University of Sheffield)
Ulrich German (University of Toronto)
Dan Gildea (University of Rochester)
Jan Hajic (Charles University)
Andrew Hardie (University of Lancaster)
Rebecca Hwa (University of Pittsburgh)
Nancy Ide (Vassar College)
Kevin Knight (ISI/University of Southern California)
Greg Kondrak (University of Alberta)
Roland Kuhn (Canada National Research Council)
Shankar Kumar (Johns Hopkins University)
Philippe Langlais (University of Montreal)
Alon Lavie (Carnegie Mellon University)
Lori Levin (Carnegie Mellon University)
Daniel Marcu (ISI/University of Southern California)
Tony McEnery (University of Lancaster)
Bridget McInnes (University of Minnesota)
Magnus Merkel (Linkoping University)
Bob Moore (Microsoft Research)
Herman Ney (RWTH Aachen)
Maria das Gracas Volpe Nunes (University of Sao Paulo)
Franz-Josef Och (Google)
Kemal Oflazer (Sabanci University)
Miles Osborne (University of Edinburgh)
Andrei Popescu-Belis (University of Geneva)
Katharina Probst (CMU)
Amruta Purandare (University of Pittsburgh)
Florence Reeder (MITRE)
Philip Resnik (University of Maryland)
Antonio Ribeiro (European Commission Joint Research Council)
Michel Simard (Xerox)
Kevin Scannell (St. Louis University)
Libin Shen (University of Pennsylvania)
Eiichiro Sumita (ATR Spoken Language Translation Research Lab)
Joerg Tiedemann (University of Groningen)
Christoph Tillmann (IBM)
Hajime Tsukada (NTT Communication Science Laboratories)
Dan Tufis (Research Institute for AI of the Romanian Academy)
Jean Veronis (Universite de Provence)
Michelle Vanni (Army Research Lab)
Stephan Vogel (Carnegie Mellon University)
Clare Voss (Army Research Lab)
Taro Watanabe (ATR Spoken Language Translation Research Laboratories)
Dekai Wu (Hong Kong University of Science and Technology)