Exploring Part of Speech (POS)-tag sequences in a large-scale learner corpus of L2 English: A developmental perspective

Abstract This research explores the POS-tag sequences that shape the transition from upper intermediate (B2 CEFR) to near-native proficiency (C2 CEFR) in a corpus of essays (n=32,410) from the Cambridge Learner Corpus. Gilquin (2018) and others have shown that POS tag sequences offer a holistic approach to extracting the most commonly used patterns without a … Read more

John Sinclair and language theory

The following is an extract form Hunston (2022, p. 256). Hunston, S. (2022). Corpora in applied linguistics. Cambridge University Press. Sinclair made a number of generalisations in the 1980s (Sinclair 1991, 2004; see also Francis 1993; Hoey 2005; Hunston 2002; Stubbs 2001) which might be summarised as follows: • In describing the meanings of a word, … Read more

Corpus of North American Spoken English (CoNASE)

The Corpus of North American Spoken English (CoNASE), a 1.25-billion-word corpus of geolocated automatic speech-to-text transcripts, is now available in a beta version. URL http://cc.oulu.fi/~scoats/CoNASE.html for more information. The corpus was created from 301,847 ASR transcripts from 2,572 YouTube channels, corresponding to 154,041 hours of video. The size of the corpus is 1,252,066,371 word tokens. … Read more

5 recent papers on language complexity and learner language

Bulté, B., & Roothooft, H. (2020). Investigating the interrelationship between rated L2 proficiency and linguistic complexity in L2 speech. System, 102246. Abstract This study investigates the relationship between nine quantitative measures of L2 speech complexity and subjectively rated L2 proficiency by comparing the oral productions of English L2 learners at five IELTS proficiency levels. We carry … Read more