Parallel Corpora Available On-Line
This page is your 'shopping list' for parallel texts. Let us know if we're missing something.
- We don't claim anything about copyright issues, make sure you don't break any restrictions.
- We don't claim anything about alignment of the collections. Some sources might need more work from you, some might need less.
And remember, we're interested in any tools you create to get the clean data from not so clean collections.
- Europarl, data of release 7 available (most of European languages)
- News Commentary corpus, part of WMT 2013 shared task training data
- OPUS (various languages, various sub-corpora)
- Subtitles (various languages, various sites, e.g. OpenSubtitles, TED)
- JRC-Acquis Multilingual legal text in 22 European languages
- EU Official Journal Multilingual legal text in 22 European languages
- The United Nations Parallel Corpus v1.0 - An official parallel corpus released by the United Nations. Constains sentence aligned data for all 6 language pairs and a fully-aligned subcorpus across all 6 languages.
- Multi-UN A Multilingual corpus from United Nation documents in 7 languages (an older, unofficial release than the corpus above.)
- Microtopia A Multilingual corpus extracted from Twitter and Sina Weibo in 11 languages.
- Asian Scientific Paper Excerpt Corpus Japanese-English and Japanese-Chinese scientific paper abstracts (3 million sentence pairs JE, 600,000 sentence pairs JC)
Besides collections mentioned above, LDC has heaps of data available.