Europarl Parallel Corpus Archives

Prior Releases of
European Parliament Proceedings Parallel Corpus

This page contains information on previous releases of the Europarl corpus. Most users will want to look at the current data instead.

Version 1, the original release, contains data from April 1996 to December 2001.

Version 2 adds January 2002 to September 2003.

Unlike the current release, v1 and v2 are not in UTF-8. All languages excluding Greek are in ISO-8859-1 (Latin 1) encoding. Greek data is in ISO-8859-7.

Version 3 adds October 2003 to October 2006. All data is now in UTF-8.

Version 5 adds November 2007 to October 2009.

Version 6 adds November 2009 to December 2010.

Release v6

On 4 February 2011 we released a further expanded and improved version of the corpus. Previous versions are available here. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.

Changes since v5

added 11/2009 - 12/2010 data, now up to around 50 million words per language
added corpora for 10 more official languages of more recent EU member countries (Bulgarian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, and Slovene), albeit smaller in size, from 01/2007
further refined preprocessing, cleaning

Download

source release (text files with preprocessing tools and sentence aligner), 1.3 GB
tools (preprocessing tools and sentence aligner only), 8.6 KB
parallel corpus Bulgarian-English, 23 MB, 01/2007-12/2010
parallel corpus Czech-English, 43 MB, 01/2007-12/2010
parallel corpus Danish-English, 164 MB, 04/1996-12/2010
parallel corpus German-English, 172 MB, 04/1996-12/2010
parallel corpus Greek-English, 125 MB, 04/1996-12/2010
parallel corpus Spanish-English, 170 MB, 04/1996-12/2010
parallel corpus Estonian-English, 41 MB, 01/2007-12/2010
parallel corpus Finnish-English, 163 MB, 01/1997-12/2010
parallel corpus French-English, 177 MB, 04/1996-12/2010
parallel corpus Hungarian-English, 43 MB, 01/2007-12/2010
parallel corpus Italian-English, 172 MB, 04/1996-12/2010
parallel corpus Lithuanian-English, 41 MB, 01/2007-12/2010
parallel corpus Latvian-English, 41 MB, 01/2007-12/2010
parallel corpus Dutch-English, 174 MB, 04/1996-12/2010
parallel corpus Polish-English, 42 MB, 01/2007-12/2010
parallel corpus Portuguese-English, 173 MB, 04/1996-12/2010
parallel corpus Romanian-English, 21 MB, 01/2007-12/2010
parallel corpus Slovak-English, 43 MB, 01/2007-12/2010
parallel corpus Slovene-English, 40 MB, 01/2007-12/2010
parallel corpus Swedish-English, 155 MB, 01/1997-12/2010

Size of the Corpus

Sizes for single-language data after tokenizing and removing XML.

Language Sentences Words

Bulgarian 229,649 -

Czech 479,636 10,770,230

Danish 2,117,839 49,615,228

German 1,985,560 48,648,697

Greek 1,344,198 -

English 2,032,006 54,720,731

Spanish 1,942,761 55,105,479

Estonian 493,198 9,455,337

Finnish 1,929,054 35,799,132

French 2,002,266 57,860,307

Hungarian 479,676 10,601,411

Italian 1,905,555 52,306,430

Lithuanian 493,204 9,731,052

Latvian 473,276 10,024,350

Dutch 2,147,195 53,459,456

Polish 387,537 8,142,067

Portuguese 1,942,700 53,799,459

Romanian 224,805 5,891,952

Slovak 487,416 10,783,688

Slovene 465,985 10,616,127

Swedish 2,037,945 45,562,972

Language	Sentences	Words
Bulgarian	229,649	-
Czech	479,636	10,770,230
Danish	2,117,839	49,615,228
German	1,985,560	48,648,697
Greek	1,344,198	-
English	2,032,006	54,720,731
Spanish	1,942,761	55,105,479
Estonian	493,198	9,455,337
Finnish	1,929,054	35,799,132
French	2,002,266	57,860,307
Hungarian	479,676	10,601,411
Italian	1,905,555	52,306,430
Lithuanian	493,204	9,731,052
Latvian	473,276	10,024,350
Dutch	2,147,195	53,459,456
Polish	387,537	8,142,067
Portuguese	1,942,700	53,799,459
Romanian	224,805	5,891,952
Slovak	487,416	10,783,688
Slovene	465,985	10,616,127
Swedish	2,037,945	45,562,972

Sizes for parallel corpora after sentence aligning, tokenizing, and removing XML.

Parallel Corpus (L1-L2) Sentences L1 Words English Words

Bulgarian-English 226,768 - 6,011,944

Czech-English 462,351 10,573,983 12,296,772

Danish-English 1,785,775 46,102,455 48,833,481

German-English 1,739,154 45,607,269 47,978,832

Greek-English 1,064,544 - 30,325,647

Spanish-English 1,786,594 51,551,485 49,411,045

Estonian-English 469,622 9,318,986 12,452,336

Finnish-English 1,742,553 34,123,013 47,601,416

French-English 1,825,077 54,568,499 50,551,047

Hungarian-English 455,270 10,429,935 12,111,122

Italian-English 1,737,081 49,065,283 49,981,015

Lithuanian-English 456,796 9,489,997 12,144,335

Latvian-English 453,879 9,854,124 1,2051,769

Dutch-English 1,822,036 50,315,412 49,938,127

Polish-English 448,433 10,317,697 11,910,117

Portuguese-English 1,783,437 50,267,741 49,634,127

Romanian-English 222,854 5,866,203 5,908,150

Slovak-English 460,780 10,602,998 12,228,702

Slovene-English 456,818 10,475,913 12,121,729

Swedish-English 1,678,333 41,031,740 45,628,613

Parallel Corpus (L1-L2)	Sentences	L1 Words	English Words
Bulgarian-English	226,768	-	6,011,944
Czech-English	462,351	10,573,983	12,296,772
Danish-English	1,785,775	46,102,455	48,833,481
German-English	1,739,154	45,607,269	47,978,832
Greek-English	1,064,544	-	30,325,647
Spanish-English	1,786,594	51,551,485	49,411,045
Estonian-English	469,622	9,318,986	12,452,336
Finnish-English	1,742,553	34,123,013	47,601,416
French-English	1,825,077	54,568,499	50,551,047
Hungarian-English	455,270	10,429,935	12,111,122
Italian-English	1,737,081	49,065,283	49,981,015
Lithuanian-English	456,796	9,489,997	12,144,335
Latvian-English	453,879	9,854,124	1,2051,769
Dutch-English	1,822,036	50,315,412	49,938,127
Polish-English	448,433	10,317,697	11,910,117
Portuguese-English	1,783,437	50,267,741	49,634,127
Romanian-English	222,854	5,866,203	5,908,150
Slovak-English	460,780	10,602,998	12,228,702
Slovene-English	456,818	10,475,913	12,121,729
Swedish-English	1,678,333	41,031,740	45,628,613

Release v5

On 20 January 2010 we released a further expanded and improved version of the corpus. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.

Changes since v3 (v4 was only released partially for WMT 2009)

added 11/2007 - 10/2009 data, now up to 55 million words per language
further refined preprocessing, cleaning

Download

source release (text files with preprocessing tools and sentence aligner), 1616 MB
tools (preprocessing tools and sentence aligner only), 8.1 KB
parallel corpus Danish-English, 163 MB, 04/1996-10/2009
parallel corpus German-English, 164 MB, 04/1996-10/2009
parallel corpus Greek-English, 120 MB, 04/1996-10/2009
parallel corpus Spanish-English, 169 MB, 04/1996-10/2009
parallel corpus Finnish-English, 162 MB, 01/1997-10/2009
parallel corpus French-English, 176 MB, 04/1996-10/2009
parallel corpus Italian-English, 170 MB, 04/1996-10/2009
parallel corpus Dutch-English, 172 MB, 04/1996-10/2009
parallel corpus Portuguese-English, 172 MB, 04/1996-10/2009
parallel corpus Swedish-English, 153 MB, 01/1997-10/2009

Size of the Corpus

Sizes for single-language data after tokenizing and removing XML.

Language Sentences Words

Danish 2,009,958 47,305,502

German 1,822,735 44,688,020

Greek 1,257,518 -

English 1,891,918 50,978,295

Spanish 1,871,700 52,503,808

Finnish 1,834,727 34,106,317

French 1,904,613 55,088,177

Italian 1,827,091 50,161,729

Dutch 2,054,417 50,926,645

Portuguese 1,849,973 51,294,994

Swedish 1,936,391 43,291,692

Language	Sentences	Words
Danish	2,009,958	47,305,502
German	1,822,735	44,688,020
Greek	1,257,518	-
English	1,891,918	50,978,295
Spanish	1,871,700	52,503,808
Finnish	1,834,727	34,106,317
French	1,904,613	55,088,177
Italian	1,827,091	50,161,729
Dutch	2,054,417	50,926,645
Portuguese	1,849,973	51,294,994
Swedish	1,936,391	43,291,692

Sizes for parallel corpora after sentence aligning, tokenizing, and removing XML.

Parallel Corpus (L1-L2) Sentences L1 Words English Words

Danish-English 1,684,664 43,692,760 46,282,519

German-English 1,581,107 41,587,670 43,848,958

Greek-English 960,356 - 27,468,389

Spanish-English 1,689,850 48,860,242 46,843,295

Finnish-English 1,646,143 32,355,142 45,136,552

French-English 1,723,705 51,708,806 47,915,991

Italian-English 1,635,140 46,380,851 47,236,441

Dutch-English 1,715,710 47,477,378 47,166,762

Portuguese-English 1,681,991 47,621,552 47,000,805

Swedish-English 1,570,411 38,537,243 42,810,628

Parallel Corpus (L1-L2)	Sentences	L1 Words	English Words
Danish-English	1,684,664	43,692,760	46,282,519
German-English	1,581,107	41,587,670	43,848,958
Greek-English	960,356	-	27,468,389
Spanish-English	1,689,850	48,860,242	46,843,295
Finnish-English	1,646,143	32,355,142	45,136,552
French-English	1,723,705	51,708,806	47,915,991
Italian-English	1,635,140	46,380,851	47,236,441
Dutch-English	1,715,710	47,477,378	47,166,762
Portuguese-English	1,681,991	47,621,552	47,000,805
Swedish-English	1,570,411	38,537,243	42,810,628

Release v3

On 28 September 2007 we released a further expanded and improved version of the corpus. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.

Changes since v2

added 10/2003 - 10/2006 data, now up to 44 million words per language
all data is released in UTF-8 encoding
some data now includes mark-up information on text's original langauge
data previously in the wrong language has been detected and removed
aligned data is not tokenized, but tokenizer is provided
further refined preprocessing

Download

source release (text files with preprocessing tools and sentence aligner), 783 MB
tools (preprocessing tools and sentence aligner only), 8.0 KB
parallel corpus Danish-English, 126 MB, 04/1996-10/2006
parallel corpus German-English, 136 MB, 04/1996-10/2006
parallel corpus Greek-English, 82 MB, 04/1996-10/2006
parallel corpus Spanish-English, 130 MB, 04/1996-10/2006
parallel corpus Finnish-English, 124 MB, 01/1997-10/2006
parallel corpus French-English, 136 MB, 04/1996-10/2006
parallel corpus Italian-English, 130 MB, 04/1996-10/2006
parallel corpus Dutch-English, 133 MB, 04/1996-10/2006
parallel corpus Portuguese-English, 132 MB, 04/1996-10/2006
parallel corpus Swedish-English, 114 MB, 01/1997-10/2006

Size of the Corpus

Sizes for single-language data after tokenizing and removing XML.

Language Sentences Words

Danish 1,563,012 37,467,445

German 1,517,987 37,614,344

Greek 962,820 26,306,875

English 1,461,429 39,618,240

Spanish 1,476,106 41,408,300

Finnish 1,407,544 26,413,278

French 1,487,459 44,688,872

Italian 1,405,282 39,504,158

Dutch 1,616,104 39,778,617

Portuguese 1,441,203 40,862,310

Swedish 1,475,195 33,407,005

Language	Sentences	Words
Danish	1,563,012	37,467,445
German	1,517,987	37,614,344
Greek	962,820	26,306,875
English	1,461,429	39,618,240
Spanish	1,476,106	41,408,300
Finnish	1,407,544	26,413,278
French	1,487,459	44,688,872
Italian	1,405,282	39,504,158
Dutch	1,616,104	39,778,617
Portuguese	1,441,203	40,862,310
Swedish	1,475,195	33,407,005

Sizes for parallel corpora after sentence aligning, tokenizing, and removing XML.

Parallel Corpus (L1-L2) Sentences L1 Words L2 Words

Danish-English 1,304,947 34,169,707 36,225,880

German-English 1,313,096 34,700,362 36,663,083

Greek-English 662,090 18,834,758 18,827,241

Spanish-English 1,304,116 37,870,751 36,429,274

Finnish-English 1,257,720 24,895,790 34,802,617

French-English 1,334,080 41,573,117 37,436,222

Italian-English 1,251,315 36,411,166 36,510,033

Dutch-English 1,326,412 36,784,168 36,690,392

Portuguese-English 1,287,757 37,342,426 36,355,907

Swedish-English 1,164,536 28,882,142 32,053,628

Parallel Corpus (L1-L2)	Sentences	L1 Words	L2 Words
Danish-English	1,304,947	34,169,707	36,225,880
German-English	1,313,096	34,700,362	36,663,083
Greek-English	662,090	18,834,758	18,827,241
Spanish-English	1,304,116	37,870,751	36,429,274
Finnish-English	1,257,720	24,895,790	34,802,617
French-English	1,334,080	41,573,117	37,436,222
Italian-English	1,251,315	36,411,166	36,510,033
Dutch-English	1,326,412	36,784,168	36,690,392
Portuguese-English	1,287,757	37,342,426	36,355,907
Swedish-English	1,164,536	28,882,142	32,053,628

Release v2

We released on 4 December 2003 an extended and improved version of the corpus. Most of what is written below for Version 1 still applies.

Changes

added 1/2002 - 9/2003 data, now up to 28 million words per language
cleaned up preprocessing
ships with a sentence aligner that allows for the creation of any parallel corpus between two language pairs and allows you to plug in your own tokenizer and sentence splitter

Download

source release (text files with sentence aligner), 559 MB
parallel corpus Danish-English, 99 MB, 04/1996-09/2003
parallel corpus German-English, 105 MB, 04/1996-09/2003
parallel corpus Greek-English, 75 MB, 04/1996-02/2002
parallel corpus Spanish-English, 101 MB, 04/1996-09/2003
parallel corpus Finnish-English, 91 MB, 01/1997-09/2003
parallel corpus French-English, 103 MB, 04/1996-09/2003
parallel corpus Italian-English, 101 MB, 04/1996-09/2003
parallel corpus Dutch-English, 102 MB, 04/1996-09/2003
parallel corpus Portuguese-English, 102 MB, 04/1996-09/2003
parallel corpus Swedish-English, 90 MB, 01/1997-09/2003

Release v1

The goals of the processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we separated out punctuation and identified sentence boundaries. We sentence aligned the data a using tool based on the Church and Gale algorithm.

Size of the corpus

Version 1.1 covers April 1996 to December 2001. It contains roughly 20 million words in 740,000 sentences per language.

Download

Currently available for download:

Danish-English: document aligned (80MB), sentence aligned (74MB).
German-English: document aligned (77MB), sentence aligned (70MB).
Greek-English: document aligned (80MB), sentence aligned (67MB).
Spanish-English: document aligned (83MB), sentence aligned (75MB).
Finnish-English: document aligned (65MB), sentence aligned (60MB).
French-English: document aligned (76MB), sentence aligned (70MB).
Dutch-English: document aligned (82MB), sentence aligned (74MB).
Italian-English: document aligned (81MB), sentence aligned (73MB).
Portuguese-English: document aligned (76MB), sentence aligned (69MB).
Swedish-English: document aligned (71MB), sentence aligned (61MB).

Test Sets

This common test was used in the Koehn/Och/Marcu ACL 2003 paper. It is taken from Q4/2000 portion of the data (2000-10 to 2000-12), with the other parts used for training.

common test set (7MB).

This is a superset of that test set, with true-casing:

common test set 2 (14MB).

Known Bugs

Some special HTML entities and noisy characters are not removed from the data.

Terms of Use

We are not aware of any copyright restrictions of the material. If you use this data in your research, please contact pkoehn@inf.ed.ac.uk. Please let us know, if you find problems with the data or if you want the data for other language pairs. We recommend using the last quarter of 2000 for testing (2000-10 until 2000-12) for consistency in reporting research results on this data.

Prior Releases ofEuropean Parliament Proceedings Parallel Corpus

Release v6

Size of the Corpus

Release v5

Size of the Corpus

Release v3

Size of the Corpus

Release v2

Release v1

Size of the corpus

Download

Test Sets

Known Bugs

Terms of Use

Prior Releases of
European Parliament Proceedings Parallel Corpus