European Parliament Proceedings Parallel Corpus 1996-2006
For a detailed description of this corpus, please read:
Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, pdf.
Please cite the paper, if you use this corpus in your work.
See also the extended (but earlier) version of the report
(ps,
pdf).
The Europarl parallel corpus is extracted from the proceedings of the European
Parliament. It includes versions in 11 European languages: Romanic (French,
Italian, Spanish, Portuguese), Germanic (English, Dutch, German,
Danish, Swedish), Greek and Finnish.
The goal of the extraction and processing was to generate sentence aligned text
for statistical machine translation systems. For
this purpose we extracted matching items and labeled them with
corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned
the data using a tool based on the
Church
and Gale algorithm.
Release v3
On 28 September 2007 we released a further expanded and improved version of the corpus. Previous versions are available here. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.
Changes since v2
added 10/2003 - 10/2006 data, now up to 44 million words per language
all data is released in UTF-8 encoding
some data now includes mark-up information on text's original langauge
data previously in the wrong language has been detected and removed
aligned data is not tokenized, but tokenizer is provided
further refined preprocessing
All formats contain document (<CHAPTER id>), speaker (<SPEAKER
id name language>),
and paragraph (<P>)
mark-up on a seperate line. The data is stored in one file per day.
Some documents have the SPEAKER tag attribute LANGUAGE which indicates what language the original speaker was using.
To use the parallel corpora with tools like Giza++, you want to:
tokenize the text (recommended)
lowercase the text (recommended)
strip empty lines and their correspondences (highly recommended)
remove lines with XML-Tags (starting with "<") (required)
Download
source release (text files with preprocessing tools and sentence aligner), 783 MB
tools (preprocessing tools and sentence aligner only), 8.0 KB
Sizes for single-language data after tokenizing and removing XML.
Language
Sentences
Words
Danish
1,563,012
37,467,445
German
1,517,987
37,614,344
Greek
962,820
26,306,875
English
1,461,429
39,618,240
Spanish
1,476,106
41,408,300
Finnish
1,407,544
26,413,278
French
1,487,459
44,688,872
Italian
1,405,282
39,504,158
Dutch
1,616,104
39,778,617
Portuguese
1,441,203
40,862,310
Swedish
1,475,195
33,407,005
Sizes for parallel corpora after sentence aligning, tokenizing, and removing XML.
Parallel Corpus (L1-L2)
Sentences
L1 Words
L2 Words
Danish-English
1,304,947
34,169,707
36,225,880
German-English
1,313,096
34,700,362
36,663,083
Greek-English
662,090
18,834,758
18,827,241
Spanish-English
1,304,116
37,870,751
36,429,274
Finnish-English
1,257,720
24,895,790
34,802,617
French-English
1,334,080
41,573,117
37,436,222
Italian-English
1,251,315
36,411,166
36,510,033
Dutch-English
1,326,412
36,784,168
36,690,392
Portuguese-English
1,287,757
37,342,426
36,355,907
Swedish-English
1,164,536
28,882,142
32,053,628
Test Sets
Several test sets have been released for the Europarl corpus.
In general, the Q4/2000 portion of the data (2000-10 to 2000-12) should be reserved for
testing. All released test sets have been selected from this quarter. The shared tasks for the 2006 and 2007 ACL Workshops on
Statistical Machine Translation provide test sets from the Europarl corpus.
The original common test set from the Koehn/Och/Marcu ACL 2003 Paper is available in the archives.
Known Bugs
Some special HTML entities and noisy characters are not
removed from the data.
Some recent Greek data has only parts of transcripts in the files.
Terms of Use
We are not aware of any copyright restrictions of the material.
If you use this data in your research, please contact
pkoehn@inf.ed.ac.uk.
Please let us know if you find problems with the data
or if you want the data for other language pairs.
We recommend using the last quarter of 2000 for testing
(2000-10 until 2000-12) for consistency in reporting
research results on this data.
Acknowledgments
Version 3 of this corpus was prepared by Cameron Shaw Fordyce (CELCT), Josh Schroeder, and Philipp Koehn (both University of Edinburgh). The work was in part supported by the EuroMatrix project funded by the European Commission (6th Framework Programme).