Category: corpus linguistics

Vyatkina 2020. Corpora as open educational resources for language teaching.

Post author By perezparedes
Post date 26th June 2020

Vyatkina, N. 2020. Corpora as open educational resources for language teaching. Foreign Language Annals; 1– 12

Abstract

Corpora, large electronic collections of texts, have been used in language teaching for several decades. Also known as Data‐Driven Learning (DDL), this method has been gaining popularity because empirical research has consistently shown its effectiveness for learning. However, corpora are still underutilized, especially with learners of languages other than English, at lower proficiency levels, and in non‐university contexts. This is regrettable because DDL has a great potential for developing modular flipped content, especially for hybrid, remote, and online courses. This article first provides an overview of DDL applications and findings of empirical research. Next, it outlines obstacles to wider DDL implementation as well as available and possible solutions. Corpus user guides and exercise collections tied to specific corpora are discussed as one promising direction, and an example of such new open educational resources for teaching German is presented. The article concludes with a discussion of implications and future directions.

Conclusions

The above overview shows that corpora have been successfully used in language teaching for decades. Corpus‐based word frequency lists have informed teaching syllabi and materials, and both teachers and learners have used corpora directly to search for language use examples and explore patterns. Such inductive DDL applications have led to significant learning gains, especially in vocabulary and grammar knowledge, and have been frequently more efficient than non‐DDL teaching methods while at the same time enhancing learner autonomy and providing individualized learning experiences. These aspects of DDL make it especially suitable for applications in hybrid, remote, and online courses. Corpus‐based modules can be developed to supplement and enhance existing syllabi with digital and flipped content. Students can conduct corpus searches individually on their own computers and at their own pace, and report the results to the teachers via worksheets or other conventional media. This type of work would also contribute to larger educational goals such as developing students’ critical thinking, analytical ability, and digital literacy.

Although DDL has predominantly targeted EFL and ESL university students, there is no inherent reason for why it should be restricted to these contexts. The field has recently been expanding, and the readers of this journal can take heart in the fact that DDL can also work with LOTEs, primary and secondary schools, as well as beginning learners. There is still a great potential for growth in these areas. Publication of teacher guides and DDL exercise collections integrated with specific corpora would be especially helpful to teachers. Incorporating Corpora , introduced above, is an exemplar of such a project that presents an alternative “third way” to hands‐on and hands‐off DDL, a middle ground “between the polished, albeit limited, linguistic information neatly systematized in dictionaries and the countless other linguistic facts that can be gleaned from corpora, but which only experienced corpus users are able to access” (Frankenberg‐Garcia, 2014, p. 141). The pilot study conducted with the project’s materials showed that even lower‐competency learners were capable of autonomous DDL when it was scaffolded through an online guiding interface. It is planned to continuously maintain, update, and expand the project’s modules as well as to test them with other teachers and students, including those in secondary schools. Although Incorporating Corpora is focused on a specific German corpus, it can serve as a model for creating similar materials for other languages and corpora, and it is hoped that other DDL researchers and practitioners will follow suit.

5 recent papers on language complexity and learner language

Post author By perezparedes
Post date 4th June 2020

Bulté, B., & Roothooft, H. (2020). Investigating the interrelationship between rated L2 proficiency and linguistic complexity in L2 speech. System, 102246.

Abstract

This study investigates the relationship between nine quantitative measures of L2 speech complexity and subjectively rated L2 proficiency by comparing the oral productions of English L2 learners at five IELTS proficiency levels. We carry out ANOVAs with pairwise comparisons to identify differences between proficiency levels, as well as ordinal logistic regression modelling, allowing us to combine multiple complexity dimensions in a single analysis. The results show that for eight out of nine measures, targeting syntactic, lexical and morphological complexity, a significant overall effect of proficiency level was found, with measures of lexical diversity (i.e. Guiraud’s index and HD-D), overall syntactic complexity (mean length of AS-unit), phrasal elaboration (mean length of noun phrase) and morphological richness (morphological complexity index) showing the strongest association with proficiency level. Three complexity measures emerged as significant predictors in our logistic regression model, each targeting different linguistic dimensions: Guiraud’s index, the subordination ratio and the morphological complexity index.

Conclusion

The present study on the relationship between nine complexity measures and five different levels of oral proficiency, as measured by the IELTS speaking test, confirms previous studies which have found that learners at higher levels of proficiency tend to produce more complex language. Even though we found higher complexity scores in higher proficiency levels for measures of lexical, syntactic and morphological complexity, the observed patterns differ substantially across measures. If we only consider differences between adjacent proficiency levels, we observed a significant increase in morphological richness (as measured by the morphological complexity index) between levels 4 and 5, in lexical diversity (Guiraud’s index) between levels 5 and 6, and in overall syntactic (mean length of AS-unit), clausal (mean length of clause) and phrasal complexity (mean length of noun phrase) as well as lexical diversity (Guiraud’s index and HD-D) between levels 6 and 7. We did not observe significant differences in complexity between the highest two proficiency levels in our dataset (i.e. 7 and 8). In addition, we found that the Guiraud index, the subclause ratio and the morphological complexity index applied to verbs were significant predictors for proficiency level in our ordinal logistic regression model, explaining around two thirds of the variance in proficiency level.

Crossley, S. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(3).

Abstract

This paper provides an overview of how analyses of linguistic features in writing samples provide a greater understanding of predictions of both text quality and writer development and links between language features within texts. Specifically, this paper provides an overview of how
language features found in text can predict human judgements of writing proficiency and changes in writing levels in both cross-sectional and longitudinal studies. The goal is to provide a better understanding of how language features in text produced by writers may influence writing quality
and growth. The overview will focus on three main linguistic construct (lexical sophistication, syntactic complexity, and text cohesion) and their interactions with quality and growth in general. The paper will also problematize previous research in terms of context, individual differences, and reproducibility.

Conclusion

While there are a number of potential limitations to linguistic analyses of writing, advanced NLP tools and programs have begun to address linguistic complications while better data collection methods and more robust statistical and machine learning approaches can help to control for confounding variables such as first language
differences, prompt effects, and variation at the individual level. This means that we are slowly gaining a better understanding of interactions between linguistic production and text quality and writing development across multiple types of writers, tasks, prompts, and disciplines. Newer studies are beginning to also look at interaction between linguistic features in text (product measures) and writing process characteristics such as
fluency (bursts), revisions (deletions and insertions) or source use (Leijten & Van Waes, 2013; Ranalli, Feng, Sinharry, & Chukharev-Hudilainen, 2018; Sinharay, Zhang, & Deane, 2019). Future work on the computational side may address concerns related to the accuracy of NLP tools, the classification of important discourse structures such as claims and arguments, and eventually even predictions of argumentation strength, flow,
and style.
Importantly, we need not wait for the future because linguistic text analyses have immediate applications in automatic essay scoring (AES) and automatic writing evaluation (AWE), both of which are becoming more common and can have profound effects on the teaching and learning of writing skills. Current issues for both AES and AWE involve both model reliability (Attali & Burstein, 2006; Deane, Williams, Weng, &
Trapani, 2013; Perelman, 2014) and construct validity (Condon, 2013; Crusan, 2010; Deane et al., 2013; Elliot et al., 2013, Haswell, 2006; Perelman, 2012), but more principled analyses of linguistic feature, especially those that go beyond words and structures, are helping to alleviate those concern and should only improve over time. That being said, the analysis of linguistic features in writing can help us not only better understand writing quality and development but also improve the teaching and learning of writing skills and strategies.

Díez-Bedmar, M. B., & Pérez-Paredes, P. (2020). Noun phrase complexity in young Spanish EFL learners’ writing: Complementing syntactic complexity indices with corpus-driven analyses. International Journal of Corpus Linguistics, 25(1), 4-35.

Abstract

he research reported in this article examines Noun Phrase (NP) syntactic complexity in the writing of Spanish EFL secondary school learners in Grades 7, 8, 11 and 12 in the International Corpus of Crosslinguistic Interlanguage. Two methods were combined: a manual parsing of NPs and an automatic analysis of NP indices using the Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC). Our results revealed that it is in premodifying slots that syntactic complexity in NPs develops. We argue that two measures, (i) nouns and modifiers (a syntactic complexity index) and (ii) determiner + multiple premodification + head (a NP type obtained as a result of a corpus-driven analysis), can be used as indices of syntactic complexity in young Spanish EFL learner language development. Besides offering a learner-language-driven taxonomy of NP syntactic complexity, the paper underscores the strength of using combined methods in SLA research.

Conclusion

Our research highlights the need for using combined methods of analysis that examine the same data from different perspectives. The use of statistical complexity analysis software (Kyle, 2016) has allowed us to account for every single noun and nominal group in the corpus. The range of indices in Kyle (2016) has allowed us to approach syntactic phenomena from a purely quantitative perspective. As a result, we have found that the use of the “Nouns as modifiers” index yields significant differences between Grades 8 and 12, which confirms our finding that premodification slots are of interest for the study of learner language development. The corpus-driven manual analysis of NPs, in turn, has allowed us to gain an in-depth understanding of the types of complexity patterns used by learners in the different grades. As a result of this approach, our research has produced a learner-generated taxonomy of NP syntactic complexity that can be used in studies that examine learner language in other contexts. By combining these two research methods, we hope to make a case for their integration and to enrich methodological pluralism (McEnery & Hardie, 2012; Römer, 2016). Moreover, the findings obtained with the two methods are consistent and thus show promising avenues for collaboration and complementarity.

Two methodological features of this study are worth considering. The fine-grained classification of NP types, which includes every NP type found in the corpus, may have determined the results of the statistical analysis: the more detailed the classification of NP, the more likely it is to obtain a low number of instances in some of the NP types. Another feature to be considered is that the manual parsing conducted did not include every single noun in the corpus. This may be seen as a limitation of this study. Another limitation lies in the use of automatic analysis software and POS tagging that was not written primarily to navigate learner language. The impact of these systems on learner-language analysis has rarely been explored in corpus linguistics, and we believe that these software solutions should be sensitive to the range of disfluencies of learner language. If the small number of errors found in the use of automatic tools in learner language are considered tolerable, the automatic analysis of complexity and frequency indices in learner language can be beneficial. Finally, this study has not offered a Contrastive Interlanguage Analysis (CIA) (Granger, 1996, 2015) as it is beyond the scope of this paper to look at other L1 learners or English as an L1.

Khushik, G. A., & Huhta, A. (2019). Investigating Syntactic Complexity in EFL Learners’ Writing across Common European Framework of Reference Levels A1, A2, and B1. Applied Linguistics.

Abstract

The study investigates the linguistic basis of Common European Framework of Reference (CEFR) levels in English as a foreign language (EFL) learners’ writing. Specifically, it examines whether CEFR levels can be distinguished with reference to syntactic complexity (SC) and whether the results differ between two groups of EFL learners with different first languages (Sindhi and Finnish). This sheds light on the linguistic comparability of the CEFR levels across L1 groups. Informants were teenagers from Pakistan (N = 868) and Finland (N = 287) who wrote the same argumentative essay that was rated on a CEFR-based scale. The essays were analysed for 28 SC indices with the L2 Syntactic Complexity Analyzer and Coh-Metrix. Most indices were found to distinguish CEFR levels A1, A2, and B1 in both language groups: the clearest separators were the length of production units, subordination, and phrasal density indices. The learner groups differed most in the length measures and phrasal density when their CEFR level was controlled for. However, some indices remained the same, and the A1 level was more similar than A2 and B2 in terms of SC across the two groups.

Vercellotti, M. L. (2019). Finding variation: assessing the development of syntactic complexity in ESL Speech. International Journal of Applied Linguistics, 29(2), 233-247.

Abstract

This paper examines the development and variation of syntactic complexity in the speech of 66 L2 learners over three academic semesters in an intensive English program. This investigation tracked development using hierarchical linear modeling with three commonly‐used, recommended measures of productive complexity (i.e., length of AS‐unit, clause length, subordination) and three exploratory measures of structural complexity (i.e., syntactic variety, weighted complexity scores, frequency of nonfinite clauses) to capture different aspects of syntactic complexity. All measures showed growth over time, suggesting that learners are not forced to prioritize certain aspects of the construct at the expense of others (i.e., no trade‐off effects) across development. The unexplained significant variation found in these data differed among the measures reinforcing notions of multidimensionality of linguistic complexity.

Conclusion

The results can inform the measurement choices and methodology for future English L2 research. As would be expected with language learning performance, there was substantial variation. L2 researchers likely want to use practical measures that capture the variation between individuals and across development. The variation in different parts of the measure’s models suggest that the measures capture separate aspects of complexity, and some suggestions can be offered. Subordination may serve as a practical, broad measure of complexity in instructed contexts. The easily calculated phrasal complexity revealed variation early in development, as did the weighted structural complexity measure. Moreover, researchers may want to consider using the weighted complexity measure for research investigating individual differences in language performance. One possibility is to create a measure based on standard deviation (e.g., De Clercq & Housen, 2017) of the weighted complexity measure, if the study’s purpose is to measure the variety of structural complexity in the language sample, rather than the growth of the developmentally‐aligned structural complexity. When investigating differences in language learning outcomes, general complexity and the weighted structural complexity may be useful, given the additional variation found in the models. The unexplained significant remaining variation between individuals is fodder for future longitudinal research. For instance, future research might consider how production may be influenced by the frequency and function of constructions in learners’ L1s, motivation (Verspoor & Behrens, 2011), or individual speaking style (Pallotti, 2009). Overall, this paper offers a unique comparison of syntactic complexity, both productive and structural complexity measures, advancing our understanding of this most complex construct of language performance.

Tags Complexity

corpus linguistics

Tyler (2010): Usage-Based Approaches to Language and Their Applications to Second Language Learning

Post author By perezparedes
Post date 25th May 2020

A selection of extracts form : Tyler, A. (2010). Usage-Based Approaches to Language and Their Applications to Second Language Learning. Annual Review of Applied Linguistics, 30, 270-291. doi:10.1017/S0267190510000140

The central idea in usage-based models is that a user’s language emerges as a result of exposure to numerous usage events (Kemmer & Barlow, 2000), that is, situated instances of the language user understanding or producing language to convey particular meaning in a specific communicative situation. When a speaker engages in communication, it is assumed that she is attempting to achieve specific interactional goals using intentionally chosen linguistic strategies aimed at members of her speech community. Tying usage events to particular speech communities reflects the understanding that actual language use is culturally and contextually embedded. (p.271)

In the United States, Byrnes and her colleagues (e.g., Byrnes, 2009; Byrnes,
Maxim, & Norris, in press) have successfully developed a genre-based curriculum for advanced learners of German as a foreign language. The focus is on close reading of genre-defined texts and analysis of the situated choices writers make in order to create meaning. Thus, the reading-writing-speaking connections are of central concern. The close reading involves intensive analysis of genre based usage of lexicon, grammatical choices, and rhetorical structure in reading and writing. Texts are presented as “models for the meaning-making resources available in the language system; texts are instances of linguistically construed cultural situations that themselves are reflective of the entire cultural system”. (p.274)

Part of the ability to form schema is importantly connected to humans’ large memory capacity and sensitivity to linguistic input and input frequency. In fact, studies of frequency effects (e.g., Bybee, 2003, 2006; Bybee & Moder, 1983; Ellis, 2002a, 2002b, 2008a,b) have shown that humans are quite sensitive to the frequency with which they have encountered words and the constructions in which words are encountered, down to evidencing reaction time sensitivity to how likely they are to hear a verb in the present tense versus the past tense. The growing body of evidence amassed by Tomasello (e.g., 2003) and his colleagues (e.g., Lieven & Tomasello, 2008) indicates that first language learning is itembased and frequency driven. Their work shows that young children are highly
attuned to the specific exemplars of language in the input and the context in which they are used. (p.281)

Recall that CL argues that syntax itself is meaningful and that syntactic patterns are templates abstracted over many instances of utterances. Moreover, there are no derivations or transformations of a more basic syntactic pattern to a second, derived pattern. Thus, Goldberg represented the double object construction (Subj-X Verb Obj-Y Obj-Z) as meaning X caused Y to receive Z, as in Hiro gave Yun the book . This construction contrasts with the caused-motion construction (Subj-X Verb-Obj Y Oblique object Z) which means X caused Y to move to Z, as in Lissa threw the football to Mike. A few studies demonstrating the usefulness of Goldberg’s construction grammar (Goldberg, 1995, 2006) in L2 learning are beginning to emerge: Mellow (2006) for teaching relative clause constructions; Manzanares and Rojo Lopez (2008) for double object constructions;
Strauss, Lee, and Ahn (2006) for Korean completive constructions. (p.285)

corpus linguistics Usage-based perspectives

Understanding language in its natural habitat

Post author By perezparedes
Post date 17th May 2020

Linguists and psychologists often study sentences in isolation, which may be akin to studying animals in separate cages in a zoo. I have been guilty of this in much of my own work. Our understanding of language will undoubtedly benefit from more focus on language in its natural habitat: conversation (e.g., Du Bois et al., 2003; Hilpert, 2017; Thompson and Hopper, 2001). In fact, the human tendency to cooperate plays a key role in how our complex system of language emerges via cultural evolution (Botha and Knight, 2009; De Boer et al., 2012; Ellis and Larsen-Freeman, 2009; Richerson and Christiansen, 2013; Steels, 2005; Tomasello, 2009). Cooperation is most evident when people use language to communicate with one another in conversations.

Language provides a constrained and discrete system that offers us a window into our even more general, creative, but constrained system of general-purpose knowledge. It allows us to teach and learn, dream and imagine, and reflect and reason in ways that are uniquely human. Moreover, a focus on the functions and distributions of constructions offers important insights about how individual constructions emerge and evolve over time (e.g., Barðdal et al., 2015; Mauri and Sansò, 2011; Traugott, 2015; Traugott and Trousdale, 2013).

There is a growing synergy among linguists, psychologists, anthropologists, and computer scientists, and so it is a very exciting time for research on language. This volume only scratches the surface of what we have already learned, as is evident in the copious list of references.

Goldberg, A. E. (2019: 163). Explain me this: Creativity, competition, and the partial productivity of constructions. Princeton University Press.

Tags Goldberg, Usage-based

Academic discourse analysis of language corpus linguistics video videos

Recordings of #corpuslinguistics LAEL webinars

Post author By perezparedes
Post date 13th May 2020

This message distributed on the corpora list by Prof Tony Berber Sardinha, Director, LAEL. Graduate Program in Applied Linguistics

The recordings of the following LAEL webinars are available at:
https://www.youtube.com/channel/UC9-sGRAJFwaPj0I4U3aR0jA

Speaker: Tove Larsson, Uppsala University, Sweden; Un. Catholique Louvain, BelgiumTitle: On Learner Corpus Research and things I wish someone had told me when I was a grad student

Speaker: Tyle True, Northern Arizona University, USATitle: Do you know how fast you were going?: Learning about the language of U.S. traffic stops to help L2 English

Speaker: Viviana Cortes, Georgia State University, USATitle: Lexical bundles as building blocks in discourse: the case of 3-word bundles

Speaker: Joe Collentine, Northern Arizona University, USATitle: Corpus research in the acquisition of Spanish as an L2

Speaker: Larissa Goulart, Northern Arizona University, USATitle: Second language writing for University purposes: From learner, to teacher, to researcher

Speaker: Ana Bocorny, Federal University of Rio Grande do Sul, BrazilTitle: Writing scientific articles with the support of key lexical bundles: a corpus-driven study in the area of health sciences

Speaker: Deise P. Dutra, Federal University of Minas Gerais, BrazilTitle: Language use: Corpus linguistics informing the language classroom

Speaker: Carolina Zuppardi, Laureate Digital and Pontifical Catholic University of Sao Paulo, BrazilTitle: Collocation dimensions in academic English

More webinar recordings will be made available on that channel.
For information on future webinars, go to:
http://corpuslg.org/lael_english/webinars/

Tags LAEL webinars

corpus linguistics NLP syntax text analysis text tools

SyntagNet

Post author By perezparedes
Post date 9th April 2020

This information from corpora-bounces@uib.no

SyntagNet, a resource with 88,000 lexical-semantic combinations, is now out!

We are proud to announce that SyntagNet 1.0 (http://syntagnet.org) is now available for download at http://syntagnet.org/download. Developed at the Sapienza NLP group (http://nlp.uniroma1.it), the multilingual Natural Language Processing group at the Sapienza University of Rome, SyntagNet is a manually-curated large-scale lexical-semantic combination database which associates pairs of concepts with pairs of co-occurring words. The goal of SyntagNet is to capture sense distinctions evoked by syntagmatic relations (e.g. mouse.n.1 and squeak.v.1 vs mouse.n.2 and click.n.4), hence providing information which complements the essentially paradigmatic knowledge shared by currently available Lexical Knowledge Bases such as WordNet. Its main features are:

Wide coverage, with 78,000 noun-verb and noun-noun lexical combinations extracted from the English Wikipedia and the British National Corpus.
High-quality, fully manual disambiguation for all of the lexical combinations, according to the WordNet 3.0 sense inventory.
A resulting Lexical Knowledge Base made up of 88,019 semantic combinations linking 20,626 WordNet 3.0 unique synsets with a relation edge.
A user-friendly web interface for looking up terms and their lexical-semantic combinations, with complete linkage to BabelNet 4.0.

And much more! Please check out our EMNLP 2019 paper:

M. Maru, F. Scozzafava, F. Martelli, R. Navigli. SyntagNet: Challenging Supervised Word Sense Disambiguation with Lexical-Semantic Combinations, Proc. of EMNLP-IJCNLP 2019

or http://syntagnet.org for more details!

SyntagRank, a state-of-the-art knowledge-based Word Sense Disambiguation system which uses SyntagNet to perform disambiguation in five languages (English, French, German, Italian and Spanish) is also available from the same website (will be demoed at ACL 2020!).

SyntagNet is an output of the MOUSSE ERC Consolidator Grant No. 726487 and of the ELEXIS project No. 731015 under the European Union’s Horizon 2020 research and innovation programme. Babelscape proudly developed the online interface and API, and provides the infrastructure for maintaining the service.

The Sapienza NLP group