European Parliament Proceedings Parallel Corpus 1996-2003


This page contains information on previous releases of the Europarl corpus. Most users will want to look at the current data instead.

Version 1, the original release, contains data from April 1996 to December 2001.

Version 2 adds January 2002 to September 2003.

Unlike the current release, v1 and v2 are not in UTF-8. All languages excluding Greek are in ISO-8859-1 (Latin 1) encoding. Greek data is in ISO-8859-7.


Release v2

We released on 4 December 2003 an extended and improved version of the corpus. Most of what is written below for Version 1 still applies.

Changes

The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.

To use the parallel corpora with tools like Giza++, you want to

Download


Release v1

The goals of the processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we separated out punctuation and identified sentence boundaries. We sentence aligned the data a using tool based on the Church and Gale algorithm.

Size of the corpus

Version 1.1 covers April 1996 to December 2001. It contains roughly 20 million words in 740,000 sentences per language.

Formats

The data is available in two formats. All formats contain document (<CHAPTER id>), speaker (<SPEAKER id name>), and paragraph (<P>) mark-up on a seperate line. The data is stored in one file per day.

Download

Currently available for download:


Test Sets

This common test was used in the Koehn/Och/Marcu ACL 2003 paper. It is taken from Q4/2000 portion of the data (2000-10 to 2000-12), with the other parts used for training.
  • common test set (7MB).

    This is a superset of that test set, with true-casing:

  • common test set 2 (14MB).

    Known Bugs

    Some special HTML entities and noisy characters are not removed from the data.

    Terms of Use

    We are not aware of any copyright restrictions of the material. If you use this data in your research, please contact pkoehn@inf.ed.ac.uk. Please let us know, if you find problems with the data or if you want the data for other language pairs. We recommend using the last quarter of 2000 for testing (2000-10 until 2000-12) for consistency in reporting research results on this data.