This is all the pure text right after splitting into languages using Google's Compact Language Detector 2. The data is very noisy and contains a lot of boilerplate but it is also very much.
The file format is a very simple plain-text format:
df6fa1abb58549287111ba8d776733e9 1.000000 url content ...
where df6fa1abb58549287111ba8d776733e9 is a magic number to mark the start of a new block. The second element is a number that gives the index of the extracted block which can be useful if blocks in different languages were found on a single page.
Since at this point no tokenization or sentence splitting is done we report the size in byes:
| LANG | Bytes (2012+2013) |
|---|---|
| en | 23,618,329,602,163 |
| un | 4,367,264,610,469 |
| de | 1,019,661,500,844 |
| es | 986,863,932,133 |
| fr | 912,158,947,446 |
| ja | 577,139,792,775 |
| ru | 537,364,268,817 |
| pl | 334,309,399,324 |
| it | 325,579,516,606 |
| pt | 316,872,603,278 |
| zh | 264,912,799,026 |
| nl | 207,899,143,023 |
| cs | 139,682,978,056 |
| tr | 138,255,666,018 |
| sv | 130,421,180,734 |
| ar | 109,914,506,835 |
| ro | 100,825,300,518 |
| fa | 90,374,032,287 |
| id | 90,186,279,228 |
| hu | 86,761,753,366 |
| vi | 77,684,501,671 |
| th | 71,970,958,807 |
| el | 67,123,604,246 |
| da | 63,059,532,993 |
| fi | 47,727,938,111 |
| zh-Hant | 46,967,117,486 |
| no | 44,683,775,875 |
| ko | 42,212,933,823 |
| uk | 32,959,660,768 |
| ms | 31,988,273,379 |
| bg | 29,465,357,109 |
| sk | 29,016,615,633 |
| sr | 27,536,190,203 |
| iw | 25,070,386,589 |
| ca | 24,274,065,896 |
| hr | 24,131,691,489 |
| lt | 22,030,944,312 |
| sl | 14,059,233,548 |
| lv | 11,390,955,330 |
| hi | 10,997,195,948 |
| ta | 10,539,311,295 |
| et | 10,270,827,759 |
| la | 8,408,419,782 |
| war | 8,102,391,410 |
| is | 7,730,642,988 |
| ka | 5,428,243,707 |
| bn | 3,908,650,702 |
| nn | 3,901,018,216 |
| hy | 3,798,173,037 |
| tl | 3,481,010,578 |
| sq | 3,474,406,259 |
| my | 3,297,902,869 |
| eu | 3,285,427,626 |
| gl | 3,107,623,692 |
| ml | 2,863,670,088 |
| tt | 2,240,062,962 |
| te | 2,223,186,973 |
| be | 2,207,971,182 |
| af | 2,162,668,040 |
| ne | 2,055,216,862 |
| mk | 1,918,809,982 |
| mr | 1,637,482,872 |
| cy | 1,509,091,610 |
| az | 1,421,450,552 |
| ur | 1,406,684,538 |
| si | 1,403,876,502 |
| kn | 1,330,260,001 |
| fy | 1,317,046,051 |
| so | 1,315,073,394 |
| mn | 1,140,723,230 |
| vo | 1,116,516,465 |
| gu | 1,110,807,207 |
| eo | 1,096,331,719 |
| sa | 979,279,954 |
| kk | 891,043,735 |
| mt | 761,846,156 |
| km | 723,100,866 |
| sco | 705,685,695 |
| ga | 700,428,417 |
| co | 697,347,610 |
| sw | 650,679,225 |
| mg | 630,391,941 |
| uz | 564,116,186 |
| ku | 534,084,333 |
| pa | 456,340,297 |
| ie | 435,674,188 |
| lb | 416,320,874 |
| am | 295,441,288 |
| haw | 287,017,875 |
| rw | 285,620,476 |
| jw | 280,734,879 |
| ps | 264,819,752 |
| mi | 262,351,363 |
| bo | 257,492,214 |
| ia | 254,949,178 |
| dv | 249,066,656 |
| ceb | 247,746,154 |
| ht | 225,690,983 |
| zzp | 221,637,820 |
| yi | 221,374,976 |
| ug | 219,219,100 |
| gn | 218,753,912 |
| blu | 216,771,780 |
| su | 205,803,323 |
| br | 198,602,623 |
| rm | 196,640,675 |
| ha | 192,939,072 |
| fo | 184,472,723 |
| ky | 177,978,912 |
| ln | 171,589,365 |
| gd | 171,367,162 |
| lo | 155,873,885 |
| oc | 149,596,564 |
| tg | 121,964,958 |
| zu | 117,166,801 |
| wo | 113,587,953 |
| qu | 105,027,938 |
| kl | 87,559,836 |
| syr | 86,670,052 |
| tk | 85,597,368 |
| bh | 82,584,050 |
| kha | 73,385,571 |
| aa | 69,270,273 |
| crs | 66,286,045 |
| rn | 60,295,470 |
| ba | 57,775,821 |
| gv | 57,107,664 |
| sm | 52,769,223 |
| ny | 52,209,325 |
| sn | 49,592,637 |
| to | 48,187,233 |
| xh | 47,329,488 |
| mfe | 46,555,077 |
| yo | 42,375,087 |
| st | 41,788,647 |
| sd | 40,701,508 |
| dz | 38,504,390 |
| ti | 37,127,554 |
| or | 34,877,485 |
| bi | 33,532,208 |
| iu | 32,357,666 |
| om | 29,092,598 |
| fj | 28,270,544 |
| lg | 25,703,715 |
| ts | 25,281,385 |
| ig | 20,390,649 |
| chr | 20,241,181 |
| tn | 17,192,310 |
| ik | 16,941,479 |
| na | 16,516,781 |
| ss | 16,105,170 |
| tlh | 13,364,626 |
| as | 12,367,138 |
| ab | 8,669,857 |
| ay | 8,649,728 |
| ak | 8,261,920 |
| za | 4,545,052 |
| nso | 2,075,133 |
| sg | 2,070,307 |
| ve | 1,787,627 |
| ks | 1,445,926 |
| bs | 1,262,415 |
| lif | 428,511 |
When using parts of this work, please cite:
@inproceedings{Buck-commoncrawl,
author = {Christian Buck and Kenneth Heafield and Bas van Ooyen},
title = {N-gram Counts and Language Models from the Common Crawl},
year = {2014},
month = {May},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
address = {Reykjavk, Iceland{i}k, Iceland}
}