ParaCrawl

Large-Scale Parallel Web Crawl

This corpus was created by crawling a large number of sites across the web.

This is ongoing work. The corpus published here (official release 2016) is fairly noisy and covers only few language pairs. For more recent versions of the corpus, please see the website of the Paracrawl project.

A small part of this corpus was used for the WMT 2016 Bilingual Document Alignment Shared Task. The overview paper describes some of the processing:
Findings of the WMT 2016 Bilingual Document Alignment Shared Task, Christian Buck and Philipp Koehn, Proceedings of the First Conference on Machine Translation (WMT), 2016, bib

First Official Release 2016

Language PairRawDeduplicatedClean
French-English 21GB
2.773 million segments
11.843 billion English tokens
7.9GB
243 million segments
2.870 billion English tokens
1.2GB
34 million segments
443 million English tokens
German-English 24GB
3.161 billion segments
14.006 billion English tokens
7.7GB
264 million segments
2.731 billion English tokens
1.2GB
36 million segments
425 million English tokens
Italian-English 8.4GB
1.190 billion segments
5.283 billion English tokens
3.1GB
91 million segments
1.088 billion English tokens
539MB
14 million segments
188 million English tokens
Russian-English 3.9GB
418 million segments
1.7 billion English tokens
1.5GB
42 million segments
460 million English tokens
149MB
2.8 million segments
40 million English tokens
Spanish-English 8.6GB
118 million segments
5.157 billion English tokens
3.7GB
106 million segments
1.300 billion English tokens
1.2GB
18 million segments
232 million English tokens

"Raw" and "Dedup" release of data includes URLs and a quality score (based on sentence alignment). Clean data is subset of Dedup with positive quality score.

Token count was computed with wc on the raw untokenized text.

Preliminary Release

Language PairFile Size (xz)Segment PairsTokens (English)
French-English367M11,808,682137,821,373
German-English374M12,169,115136,793,414
Italian-English158M4,312,24155,915,710
Russian-English34M772,57112,344,705
Spanish-English188M5,508,75268,419,109

Token count was computed with wc on the raw untokenized text.

Contact

Philipp Koehn
University of Edinburgh / Johns Hopkins University
(phi@jhu.edu)

Acknowledgment

This corpus was created with partial support by a Google Faculty Research Award.