phonetic annotation e.g. adding information about how a word in a spoken corpus was pronounced.
prosodic annotation — again in a spoken corpus — adding information about prosodic features such as stress, intonation and pauses.
syntactic annotation —e.g. adding information about how a given sentence is parsed, in terms of syntactic analysis into such units such phrases and clauses
semantic annotation e.g. adding information about the semantic category of words — the noun cricket as a term for a sport and as a term for an insect belong to different semantic categories, although there is no difference in spelling or pronunciation.
pragmatic annotation e.g. adding information about the kinds of speech act (or dialogue act) that occur in a spoken dialogue — thus the utterance okay on different occasions may be an acknowledgement, a request for feedback, an acceptance, or a pragmatic marker initiating a new phase of discussion. discourse annotation e.g. adding information about anaphoric links in a text, for example connecting the pronoun them and its antecedent the horses in: I’ll saddle the horses and bring them round. [an example from the Brown corpus]
stylistic annotation e.g. adding information about speech and thought presentation (direct speech, indirect speech, free indirect thought, etc.) lexical annotation adding the identity of the lemma of each word form in a text — i.e. the base form of the word, such as would occur as its headword in a dictionary (e.g. lying has the lemma LIE).
TheMultidimensional Analysis Taggeris a program for Windows that replicates Biber’s (1988) Variation across Speech and Writing tagger for the multidimensional functional analysis of English texts, generally applied for studies on text type or genre variation. The program can generate a grammatically annotated version of the corpus selected as well as the necessary statistics to perform a text-type or genre analysis. The program plots the input text or corpus on Biber’s (1988) Dimensions and determines its closest text type, as proposed by Biber (1989) A Typology of English Texts. Finally, the program offers a tool for visualising the Dimensions features of an input text.
i Case-insensitive
m Multiline : allow the grep engine to match at ^ and $ after and before at \r or \n.
s Magic Dot : allows . to match \r and \n
x Free-spacing: ignore unescaped white space; allow inline comments in grep patterns.
(?imsx) On
(?-imsx) Off
(?i-msx) Mixed
———————————————————————————————————————————————————————————————————
Regex Meta-Characters:
———————————————————————————————————————————————————————————————————
. Any character except newline or carriage return
[ ] Any single character of set
[^ ] Any single character NOT of set
0 or more previous regular expression
*? 0 or more previous regular expression (non-greedy)
1 or more previous regular expression
+? 1 or more previous regular expression (non-greedy)
? 0 or 1 previous regular expression
| Alternation
( ) Grouping regular expressions
^ Beginning of a line or string
$ End of a line or string
{m,n} At least m but most n previous regular expression
{m,n}? At least m but most n previous regular expression (non-greedy)
\1-9 Nth previous captured group
\& Whole match # BBEdit: ‘&’ only – no escape needed
` Pre-match # PCRE? NOT BBEdit
\’ Post-match # PCRE? NOT BBEdit
+ Highest group matched # PCRE? NOT BBEdit
\A Beginning of a string
\b Backspace(0x08)(inside[]only) # PCRE?
\b Word boundary(outside[]only)
\B Non-word boundary
\d Digit, same as[0-9]
\D Non-digit
\G Assert position at end of previous match or start of string for first match
—————————————————————————————————————————————————————————————————
Case-Change Operators
—————————————————————————————————————————————————————————————————
\E Change case – acts as an end delimiter to terminate runs of \L & \U.
\l Change case of only the first character to the right lower case. (Note: lowercase ‘L’)
\L Change case of all text to the right to lowercase.
\u Change case of only the first character to the right to uppercase.
\U Change case of all text to the right to uppercase.
——————————————————————————————————————————————————————————————
White-Space or Non-White-Space
——————————————————————————————————————————————————————————————
\t Tab
\n Linefeed
\r Return
\R Return or Linefeed or Windows CRLF (matches any Unicode newline sequence).
\f Formfeed
\s Whitespace character equivalent to [ \t\n\r\f]
\S Non-whitespace character
——————————————————————————————————————————————————————
\W Non-word character
\w Word character[0-9A-Za-z_]
\z End of a string
\Z End of a string, or before newline at the end
(?#) Comment
(?:) Grouping without backreferences
(?=) Zero-width positive look-ahead assertion
(?!) Zero-width negative look-ahead assertion
(?>) Nested anchored sub-regexp stops backtracking
(?imx-imx) Turns on/off imx options for rest of regexp
(?imx-imx:…) Turns on/off imx options, localized in group # ‘…’ indicates added regex pattern
———————————————————————————————————————————————————————————————
PERL-STYLE PATTERN EXTENSIONS : BBEdit Documentation : ‘…’ indicates added regex pattern
————————————————————————————————————————————————————————————————
Extension Meaning
————————————————————————————————————————————————————————————————
(?:…) Cluster-only parentheses, no capturing
(?#…) Comment, discard all text between the parentheses
(?imsx-imsx) Enable/disable pattern modifiers
(?imsx-imsx:…) Cluster-only parens with modifiers
(?=…) Positive lookahead assertion
(?!…) Negative lookahead assertion
(?<=…) Positive lookbehind assertion
(?…) Match non-backtracking subpattern (“once-only”)
(?R) Recursive pattern
—————————————————————————————————————————————————————————————————
POSITIONAL ASSERTIONS (duplicatation of above)
—————————————————————————————————————————————————————————————————
POSITIVE LOOKBEHIND ASSERTION: (?<=’pattern’) # Lookbehind Assertions are of fixed-length
NEGATIVE LOOKBEHIND ASSERTION: (?<!’pattern’)
————————————————————————————————————————————————————————————————
SPECIAL CHARACTER CLASSES (POSIX standard except where ‘Perl Extension’ is indicated):
———————————————————————————————————————————————————————————————
CLASS MEANING
———————————————————————————————————————————————————————————————
[[:alnum:]] Alpha-numeric characters
[[:alpha:]] Alphabetic characters
[[:ascii:]] Character codes 0-127 # Perl Extension
[[:blank:]] Horizontal whitespace
[[:cntrl:]] Control characters
[[:digit:]] Decimal digits (same as \d)
[[:graph:]] Printing characters, excluding spaces
[[:lower:]] Lower case letters
[[:print:]] Printing characters, including spaces
[[:punct:]] Punctuation characters
[[:space:]] White space (same as \s)
[[:upper:]] Upper case letters
[[:word:]] Word characters (same as \w) # Perl Extension
[[:xdigit:]] Hexadecimal digits
Usage example of multiple character classes:
[[:alpha:][:digit:]]
«Negated» character class example:
[[:^digit:]]+
** POSIX-style character class names are case-sensitive
** The outermost brackets above indicate a RANGE; the class name itself looks like this: [:alnum:]
—————————————————————————————————————————————————————————————
CONDITIONAL SUBPATTERNS
—————————————————————————————————————————————————————————————
Conditional subpatterns allow you to apply “if-then” or “if-then-else” logic to pattern matching.
The “if” portion can either be an integer between 1 and 99, or an assertion.
The forms of syntax for an ordinary conditional subpattern are:
If the condition evaluates as true, the “yes-pattern” portion attempts to match. Otherwise, the
“no-pattern” portion does (if there is a “no-pattern”).
A) All course images and functionality have been updated for the ‘new’ Sketch Engine interface.
B) New functions specific to the ‘new’ Sketch Engine interface are now included in the course (e.g. Good Dictionary EXamples (GDEX))
C) Course is now completely self-contained – no need for external assessments. Certificates of completion generated automatically upon completion of online activities.
D) Improved reflective component and opportunities for peer discussion.
The course is primarily pitched at L2 graduate writing students, but anyone is eligible, whether a student, lecturer, or anyone with an interest in language and technology.
To enrol, follow the instructions at the link provided. Please contact the course creator Dr. Peter Crosthwaite at p.cros@uq.edu.au with any questions or technical problems.
Following three successful conventions, the 4th Learner Corpus Studies in Asia and the World (LCSAW4) will be held on Sunday, 29, September2019, at Kobe University Centennial Hall in Japan. URL
LCSAW4 is organized in cooperation with the ESRC-AHRC project led byDr. Tony McEnery at Lancaster University, UK.
Invited Speakers
Tony McEnery
Patrick Rebuschatt
Padraic Monaghan
Kazuya Saito
John Williams
Aaron Baty
Pascual Pérez-Paredes
Yukio Tono
Shin Ishikawa
Mariko Abe
Yasutake Ishii
Emi Izumi
Masatoshi Sugiura
LCSAW4 Poster Session CFP
Date: Sunday, September 29, 2019
Venue: Kobe University Centennial Hall
Presentation Type: Poster
Language: English
Topic: Studies related to L2 learner corpus
Publication : Online proceedings with ISSN will be published.
Submission : Please send your abstract and short-bio by 20 May 2019 http://bit.ly/lcsaw4 If you cannot access the site, please contact the organizer (iskwshin@gmail.com)
In a way, corpus linguistics could be seen as a type of content analysis that places great emphasis on the fact that language variation is highly systematic.
We´ll look at ways in which frequency and word combination can reveal different patterns of use and meaning at the lexical, syntactical and semantic levels. We will examine how we can make use of corpus linguistics methods to look at a corpus of texts (from different or the same individuals) and single texts and how these compare to what is frequent in similar or identical registers or communicative situations. This way, we can not only find out what is frequent but also what is truly distinctive or central in a given text or group of texts.
Students are encouraged to download and install Antconc on their laptops:
There are different well-established CL methods to research language usage through the examination of naturally occurring data. These methods stress the importance of frequency and repetition across texts and corpora to create saliency. These methods can be grouped in four categories:
–Analysis of keywords. These are words that are unusually frequent in corpus A when compared with corpus B. This is a Quantitative method that examines the probability to find/not to find a set of words in a given corpus against a reference corpus. This method is said to reduce both researchers´ bias in content analysis and cherry-picking in grounded theory.
–Analysis of collocations. Collocations are words found within a given span (-/+ n words to the left and right) of a node word. This analysis is based on statistical tests that examine the probability to find a word within a specific lexical context in a given corpus. There are different collocation strength measures and a variety of approaches to collocation analysis (Gries, 2013). A collocational profile of a word, or a string of words, provides a deeper understanding of the meaning of a word and its contexts of use.
–Colligation analysis. This involves the analysis of the syntagmatic patterns where words, and string of words, tend to co-occur with other words (Hoey, 2005). Patterning stresses the relationship between a lexical item and a grammatical context, a syntactic function (i.e. postmodifiers in noun phrases) and its position in the phrase or in the clause. Potentially, every word presents distinctive local colligation analysis. Word Sketches have become a widely used way to examine patterns in corpora.
–N-grams. N-gram analysis relies on a bottom-up computational approach where strings of words (although other items such as part of speech tags are perfectly possible) are grouped in clusters of 2,3,4,5 or 6 words and their frequency is examined. Previous research on n-grams shows that different domains (topics, themes) and registers (genres) offer different preferences in terms of the n-grams most frequently used by expert users.
Quote 1: what is a corpus?
The word corpus is Latin for body (plural corpora). In linguistics a corpus is a collection of texts (a ‘body’ of language) stored in an electronic database. Corpora are usually large bodies of machine-readable text containing thousands or millions of words. A corpus is different from an archive in that often (but not always) the texts have been selected so that they can be said to be representative of a particular language variety or genre, therefore acting as a standard reference. Corpora are often annotated with additional information such as part-of-speech tags or to denote prosodic features associated with speech. Individual texts within a corpus usually receive some form of meta-encoding in a header, giving information about their genre, the author, date and place of publication etc. Types of corpora include specialised, reference, multilingual, parallel, learner, diachronic and monitor. Corpora can be used for both quantitative and qualitative analyses. Although a corpus does not contain new information about language, by using software packages which process data we can obtain a new perspective on the familiar (Hunston 2002: 2–3).
Baker et al. (2006). A glossary of corpus linguistics. Edinburgh: UEP.
Quote 2: introspection
Armchair linguistics does not have a good name in some linguistics circles. A caricature of the armchair linguist is something like this. He sits in a deep soft comfortable armchair, with his eyes closed and his hands clasped behind his head. Once in a while he opens his eyes, sits up abruptly shouting, “Wow, what a neat fact!”, grabs his pencil, and writes something down. Then he paces around for a few hours in the excitement of having come still closer to knowing what language is really like. (There isn’t anybody exactly like this, but there are some approximations.)
Charles Fillmore. Directions in Corpus Linguistics (Proceedings of Nobel Symposium 82, 1991),
Quote 3: evidence in a corpus
We as linguists should train ourselves specifically to be open to the evidence of long text. This is quite different from using the computer to be our servant in trying out our ideas; it is making good use of some essential differences between computers and people.
[…] I believe that we have to cultivate a new relationship between the ideas we have and the evidence that is in front of us. We are so used to interpreting very scant evidence that we are not in a good mental state to appreciate the opposite situation. With the new evidence the main difficulty is controlling and organizing it rather than getting it.
Sinclair. Trust the Text. (2004:17)
Quote 4: why analyse registers?
Register, genre, and style differences are fundamentally important for any student with a primary interest in language. For example, any student majoring in English, or in the study of another language like Japanese or Spanish, must understand the text varieties in that language. If you are training to become a teacher (e.g. for secondary education or for TESL), you will shortly be faced with the task of teaching your own students how to use the words and structures that are appropriate to different spoken and written tasks – different registers and genres. Other students of language are more interested in the study of literature or the creative writing of new literature, issues relating to the style perspective, since the literary effects that distinguish one novel (or poem) from the next are realized as linguistic differences.
Biber & Conrad (2009:4)
Quote 8: sleeping furiously
Tony McEnery has outlined the reasons why corpus linguistics was largely ignored in the past possibly because of the influence of Noam Chomsky. Prof. McEnery has placed this debate in a wider context where different stakeholders fight a paradigm war: rationalist introspection versus evidence driven analysis.
Quote 9: epistemological adherence?
“Science is a subject that relies on measurement rather than opinion”, Bill Cox wrote in the book version of Human Universe, the BBC Show. And I think he is right. Complementary research methodologies can only bring about better insights and better-informed debates.
Hands-on workshop. Corpus analysis: the basics.
Tasks
3a Run a word list
3b Run a keyword list
3c Use concord plot: explore its usefulness
3d Choose a lexical item: explore clusters
3e Choose a lexical item: explore n-grams
3f Run a collocation analysis
Download the Conservative manifesto 2017 here and the Labour 2017 manifesto here
OR
Policy paper: DFID Education Policy 2018: Get Children Learning (PDF)