Extracting n word phrases in large texts

This is a summary of resources posted on [Corpora-List] early 2014

CMU-Cambridge Statistical Language Modeling toolkit

http://mi.eng.cam.ac.uk/~prc14/toolkit.html

Sketch Engine

http://www.sketchengine.co.uk/documentation/wiki/SkE/NGrams

Lawrence Anthony’s AntConc 

http://www.antlab.sci.waseda.ac.jp/software.html

kfNgram

http://www.kwicfinder.com/kfNgram/kfNgramHelp.html

Colibri

Software for the extraction of n-grams as well as patterns that are not consecutive (skipgrams). The software is written in C++ for speed and memory efficiency but comes with a Python binding for usage from Python script. It also has a standalone CLI tool that can do what you want.

https://github.com/proycon/colibri-core

http://proycon.github.io/colibri-core/doc/ f

Maarten van Gompel

GnuPG key: 0x1A31555C  XMPP: proycon@anaproy.nl