European Parliament Proceedings Parallel Corpus 1996-2009


For a detailed description of this corpus, please read:

Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, pdf.

Please cite the paper, if you use this corpus in your work. See also the extended (but earlier) version of the report (ps, pdf).

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish.

The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.


Release v6

On 4 February 2011 we released a further expanded and improved version of the corpus. Previous versions are available here. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.

Changes since v5

All formats contain document (<CHAPTER id>), speaker (<SPEAKER id name language>), and paragraph (<P>) mark-up on a separate line. The data is stored in one file per day, and in smaller units for newer data.

Some documents have the SPEAKER tag attribute LANGUAGE which indicates what language the original speaker was using.

To use the parallel corpora with tools like GIZA++, you want to:

Download


Size of the Corpus

Sizes for single-language data after tokenizing and removing XML.

LanguageSentencesWords
Bulgarian 229,649-
Czech 479,63610,770,230
Danish 2,117,83949,615,228
German 1,985,56048,648,697
Greek 1,344,198-
English 2,032,00654,720,731
Spanish 1,942,76155,105,479
Estonian 493,198 9,455,337
Finnish 1,929,05435,799,132
French 2,002,26657,860,307
Hungarian 479,67610,601,411
Italian 1,905,55552,306,430
Lithuanian 493,204 9,731,052
Latvian 473,27610,024,350
Dutch 2,147,19553,459,456
Polish 387,537 8,142,067
Portuguese1,942,70053,799,459
Romanian 224,805 5,891,952
Slovak 487,41610,783,688
Slovene 465,98510,616,127
Swedish 2,037,94545,562,972

Sizes for parallel corpora after sentence aligning, tokenizing, and removing XML.

Parallel Corpus (L1-L2)SentencesL1 WordsEnglish Words
Bulgarian-English 226,768 - 6,011,944
Czech-English 462,35110,573,98312,296,772
Danish-English 1,785,77546,102,45548,833,481
German-English 1,739,15445,607,26947,978,832
Greek-English 1,064,544 -30,325,647
Spanish-English 1,786,59451,551,48549,411,045
Estonian-English 469,622 9,318,98612,452,336
Finnish-English 1,742,55334,123,01347,601,416
French-English 1,825,07754,568,49950,551,047
Hungarian-English 455,27010,429,93512,111,122
Italian-English 1,737,08149,065,28349,981,015
Lithuanian-English 456,796 9,489,99712,144,335
Latvian-English 453,879 9,854,1241,2051,769
Dutch-English 1,822,03650,315,41249,938,127
Polish-English 448,43310,317,69711,910,117
Portuguese-English1,783,43750,267,74149,634,127
Romanian-English 222,854 5,866,203 5,908,150
Slovak-English 460,78010,602,99812,228,702
Slovene-English 456,81810,475,91312,121,729
Swedish-English 1,678,33341,031,74045,628,613


Test Sets

Several test sets have been released for the Europarl corpus. In general, the Q4/2000 portion of the data (2000-10 to 2000-12) should be reserved for testing. All released test sets have been selected from this quarter. The shared tasks for the 2006 and 2007 ACL Workshops on Statistical Machine Translation provide test sets from the Europarl corpus.

The original common test set from the Koehn/Och/Marcu ACL 2003 Paper is available in the archives.

Extended versions of these test sets are available in the Evaluation Matrix of the EuroMatrix project.

Known Bugs

Terms of Use

We are not aware of any copyright restrictions of the material. If you use this data in your research, please contact pkoehn@inf.ed.ac.uk. Please let us know if you find problems with the data or if you want the data for other language pairs. We recommend using the last quarter of 2000 for testing (2000-10 until 2000-12) for consistency in reporting research results on this data.

Acknowledgments

The work was in part supported by the EuroMatrixPlus project funded by the European Commission (7th Framework Programme).