N-gram counts and language models from the CommonCrawl

Raw data

This is all the pure text right after splitting into languages using Google's Compact Language Detector 2. The data is very noisy and contains a lot of boilerplate but it is also very much.

The file format is a very simple plain-text format:

df6fa1abb58549287111ba8d776733e9 1.000000 url
content
...

where df6fa1abb58549287111ba8d776733e9 is a magic number to mark the start of a new block. The second element is a number that gives the index of the extracted block which can be useful if blocks in different languages were found on a single page.

Since at this point no tokenization or sentence splitting is done we report the size in byes:

LANG	Bytes (2012+2013)
en	23,618,329,602,163
un	4,367,264,610,469
de	1,019,661,500,844
es	986,863,932,133
fr	912,158,947,446
ja	577,139,792,775
ru	537,364,268,817
pl	334,309,399,324
it	325,579,516,606
pt	316,872,603,278
zh	264,912,799,026
nl	207,899,143,023
cs	139,682,978,056
tr	138,255,666,018
sv	130,421,180,734
ar	109,914,506,835
ro	100,825,300,518
fa	90,374,032,287
id	90,186,279,228
hu	86,761,753,366
vi	77,684,501,671
th	71,970,958,807
el	67,123,604,246
da	63,059,532,993
fi	47,727,938,111
zh-Hant	46,967,117,486
no	44,683,775,875
ko	42,212,933,823
uk	32,959,660,768
ms	31,988,273,379
bg	29,465,357,109
sk	29,016,615,633
sr	27,536,190,203
iw	25,070,386,589
ca	24,274,065,896
hr	24,131,691,489
lt	22,030,944,312
sl	14,059,233,548
lv	11,390,955,330
hi	10,997,195,948
ta	10,539,311,295
et	10,270,827,759
la	8,408,419,782
war	8,102,391,410
is	7,730,642,988
ka	5,428,243,707
bn	3,908,650,702
nn	3,901,018,216
hy	3,798,173,037
tl	3,481,010,578
sq	3,474,406,259
my	3,297,902,869
eu	3,285,427,626
gl	3,107,623,692
ml	2,863,670,088
tt	2,240,062,962
te	2,223,186,973
be	2,207,971,182
af	2,162,668,040
ne	2,055,216,862
mk	1,918,809,982
mr	1,637,482,872
cy	1,509,091,610
az	1,421,450,552
ur	1,406,684,538
si	1,403,876,502
kn	1,330,260,001
fy	1,317,046,051
so	1,315,073,394
mn	1,140,723,230
vo	1,116,516,465
gu	1,110,807,207
eo	1,096,331,719
sa	979,279,954
kk	891,043,735
mt	761,846,156
km	723,100,866
sco	705,685,695
ga	700,428,417
co	697,347,610
sw	650,679,225
mg	630,391,941
uz	564,116,186
ku	534,084,333
pa	456,340,297
ie	435,674,188
lb	416,320,874
am	295,441,288
haw	287,017,875
rw	285,620,476
jw	280,734,879
ps	264,819,752
mi	262,351,363
bo	257,492,214
ia	254,949,178
dv	249,066,656
ceb	247,746,154
ht	225,690,983
zzp	221,637,820
yi	221,374,976
ug	219,219,100
gn	218,753,912
blu	216,771,780
su	205,803,323
br	198,602,623
rm	196,640,675
ha	192,939,072
fo	184,472,723
ky	177,978,912
ln	171,589,365
gd	171,367,162
lo	155,873,885
oc	149,596,564
tg	121,964,958
zu	117,166,801
wo	113,587,953
qu	105,027,938
kl	87,559,836
syr	86,670,052
tk	85,597,368
bh	82,584,050
kha	73,385,571
aa	69,270,273
crs	66,286,045
rn	60,295,470
ba	57,775,821
gv	57,107,664
sm	52,769,223
ny	52,209,325
sn	49,592,637
to	48,187,233
xh	47,329,488
mfe	46,555,077
yo	42,375,087
st	41,788,647
sd	40,701,508
dz	38,504,390
ti	37,127,554
or	34,877,485
bi	33,532,208
iu	32,357,666
om	29,092,598
fj	28,270,544
lg	25,703,715
ts	25,281,385
ig	20,390,649
chr	20,241,181
tn	17,192,310
ik	16,941,479
na	16,516,781
ss	16,105,170
tlh	13,364,626
as	12,367,138
ab	8,669,857
ay	8,649,728
ak	8,261,920
za	4,545,052
nso	2,075,133
sg	2,070,307
ve	1,787,627
ks	1,445,926
bs	1,262,415
lif	428,511

Paper

When using parts of this work, please cite:

@inproceedings{Buck-commoncrawl,
 author = {Christian Buck and Kenneth Heafield and Bas van Ooyen},
 title = {N-gram Counts and Language Models from the Common Crawl},
 year = {2014},
 month = {May},
 booktitle = {Proceedings of the Language Resources and Evaluation Conference},
 address = {Reykjavk, Iceland{i}k, Iceland}
}