From the corpora list
::::::::::::::::::::::::::::::
We are pleased to announce the release of the first very large ngram databases derived from the giga-token COW14 web corpora. They are completely free (CC-BY) and can be downloaded without registration. We have applied no frequency thresholds whatsoever. In addition to the counted ngram lists, we offer raw versions such that everybody can create their own version. The raw ngrams also contain additional information (crawl year, top-level domain, country geolocation).
There are also English dependency bigrams (based on Malt parses) containing words, their heads, and the dependency relation between them.
For end-users, there are also word and lemma frequency lists with some convenient frequency measures, optionally with a frequency threshold of 10 (smaller files, easier handling).
——————————————————————–
LICENSE AND REFERENCES
License Creative Commons Attribution 4.0 International
References http://corporafromtheweb.org/category/cow-citation/
Please tell us whenever you publish work based on COW:
https://webcorpora.org/publication/
DOWNLOAD
http://hpsg.fu-berlin.de/cow/ngrams/
http://hpsg.fu-berlin.de/cow/frequencies/
ORIGIN AND ORIGINAL CORPUS SIZES
The ngrams are derived from the COW14AX sentence-shuffled corpora.
Information http://corporafromtheweb.org/category/corpora/
Interface https://webcorpora.org/
English 9,578,828,861 tokens (International)
German 11,660,894,000 tokens (AT, CH, DE)
Spanish 3,680,794,644 tokens (International)
Swedish 4,842,753,707 tokens (FI, SV)
FREQUENCY LISTS
Languages English, German, Spanish, Swedish
Versions Lemma, Lemma + POS, Word, Word + POS
Thresholds no threshold; raw frequency > 9
Measures raw frequency, absolute rank, frequency per million,
log-frequency per million, frequency band
NGRAMS
N 1 .. 5
Languages English, German, Spanish, Swedish
Versions Raw, Word, Word + POS, Lemma (except Swedish)
DEPENDENCY BIGRAMS
Languages English (German soon, maybe Swedish)
Versions Raw, Word, Word + POS, Lemma, Lemma + POS