Raw data

This is all the pure text right after splitting into languages using Google's Compact Language Detector 2. The data is very noisy and contains a lot of boilerplate but it is also very much.

The file format is a very simple plain-text format:

df6fa1abb58549287111ba8d776733e9 1.000000 url
content
...

where df6fa1abb58549287111ba8d776733e9 is a magic number to mark the start of a new block. The second element is a number that gives the index of the extracted block which can be useful if blocks in different languages were found on a single page.

Since at this point no tokenization or sentence splitting is done we report the size in byes:

LANG Bytes (2012+2013)
en 23,618,329,602,163
un 4,367,264,610,469
de 1,019,661,500,844
es 986,863,932,133
fr 912,158,947,446
ja 577,139,792,775
ru 537,364,268,817
pl 334,309,399,324
it 325,579,516,606
pt 316,872,603,278
zh 264,912,799,026
nl 207,899,143,023
cs 139,682,978,056
tr 138,255,666,018
sv 130,421,180,734
ar 109,914,506,835
ro 100,825,300,518
fa 90,374,032,287
id 90,186,279,228
hu 86,761,753,366
vi 77,684,501,671
th 71,970,958,807
el 67,123,604,246
da 63,059,532,993
fi 47,727,938,111
zh-Hant 46,967,117,486
no 44,683,775,875
ko 42,212,933,823
uk 32,959,660,768
ms 31,988,273,379
bg 29,465,357,109
sk 29,016,615,633
sr 27,536,190,203
iw 25,070,386,589
ca 24,274,065,896
hr 24,131,691,489
lt 22,030,944,312
sl 14,059,233,548
lv 11,390,955,330
hi 10,997,195,948
ta 10,539,311,295
et 10,270,827,759
la 8,408,419,782
war 8,102,391,410
is 7,730,642,988
ka 5,428,243,707
bn 3,908,650,702
nn 3,901,018,216
hy 3,798,173,037
tl 3,481,010,578
sq 3,474,406,259
my 3,297,902,869
eu 3,285,427,626
gl 3,107,623,692
ml 2,863,670,088
tt 2,240,062,962
te 2,223,186,973
be 2,207,971,182
af 2,162,668,040
ne 2,055,216,862
mk 1,918,809,982
mr 1,637,482,872
cy 1,509,091,610
az 1,421,450,552
ur 1,406,684,538
si 1,403,876,502
kn 1,330,260,001
fy 1,317,046,051
so 1,315,073,394
mn 1,140,723,230
vo 1,116,516,465
gu 1,110,807,207
eo 1,096,331,719
sa 979,279,954
kk 891,043,735
mt 761,846,156
km 723,100,866
sco 705,685,695
ga 700,428,417
co 697,347,610
sw 650,679,225
mg 630,391,941
uz 564,116,186
ku 534,084,333
pa 456,340,297
ie 435,674,188
lb 416,320,874
am 295,441,288
haw 287,017,875
rw 285,620,476
jw 280,734,879
ps 264,819,752
mi 262,351,363
bo 257,492,214
ia 254,949,178
dv 249,066,656
ceb 247,746,154
ht 225,690,983
zzp 221,637,820
yi 221,374,976
ug 219,219,100
gn 218,753,912
blu 216,771,780
su 205,803,323
br 198,602,623
rm 196,640,675
ha 192,939,072
fo 184,472,723
ky 177,978,912
ln 171,589,365
gd 171,367,162
lo 155,873,885
oc 149,596,564
tg 121,964,958
zu 117,166,801
wo 113,587,953
qu 105,027,938
kl 87,559,836
syr 86,670,052
tk 85,597,368
bh 82,584,050
kha 73,385,571
aa 69,270,273
crs 66,286,045
rn 60,295,470
ba 57,775,821
gv 57,107,664
sm 52,769,223
ny 52,209,325
sn 49,592,637
to 48,187,233
xh 47,329,488
mfe 46,555,077
yo 42,375,087
st 41,788,647
sd 40,701,508
dz 38,504,390
ti 37,127,554
or 34,877,485
bi 33,532,208
iu 32,357,666
om 29,092,598
fj 28,270,544
lg 25,703,715
ts 25,281,385
ig 20,390,649
chr 20,241,181
tn 17,192,310
ik 16,941,479
na 16,516,781
ss 16,105,170
tlh 13,364,626
as 12,367,138
ab 8,669,857
ay 8,649,728
ak 8,261,920
za 4,545,052
nso 2,075,133
sg 2,070,307
ve 1,787,627
ks 1,445,926
bs 1,262,415
lif 428,511

Paper

When using parts of this work, please cite:

@inproceedings{Buck-commoncrawl,
 author = {Christian Buck and Kenneth Heafield and Bas van Ooyen},
 title = {N-gram Counts and Language Models from the Common Crawl},
 year = {2014},
 month = {May},
 booktitle = {Proceedings of the Language Resources and Evaluation Conference},
 address = {Reykjavk, Iceland{i}k, Iceland}
}