New Directions in Corpus-based Translation Studies

Through the Corpora List

:::::::::::::::::::::::::::::::::::::

The “Language Science Press” has just published the following open access book in their series “Translation and Multilingual NLP”:

“NEW DIRECTIONS IN CORPUS-BASED TRANSLATION STUDIES” by Claudio Fantinuoli & Federico Zanettin (eds.)

Please download your free copy from http://langsci-press.org/catalog/book/76

ABSTRACT

Corpus-based translation studies has become a major paradigm and research methodology and has investigated a wide variety of topics in the last two decades. The contributions to this volume add to the range of corpus-based studies by providing examples of some less explored applications of corpus analysis methods to translation research. They show that the area keeps evolving as it constantly opens up to different frameworks and approaches, from appraisal theory to process-oriented analysis, and encompasses multiple translation settings, including (indirect) literary translation, machine(-assisted) translation and the practical work of professional legal translators. The studies included in the volume also expand the range of application of corpus applications in terms of the tools used to accomplish the research tasks outlined.

Free ngram databases from COW14 web corpora

From the corpora list

::::::::::::::::::::::::::::::

We are pleased to announce the release of the first very large ngram databases derived from the giga-token COW14 web corpora. They are completely free (CC-BY) and can be downloaded without registration. We have applied no frequency thresholds whatsoever. In addition to the counted ngram lists, we offer raw versions such that everybody can create their own version. The raw ngrams also contain additional information (crawl year, top-level domain, country geolocation).

There are also English dependency bigrams (based on Malt parses) containing words, their heads, and the dependency relation between them.

For end-users, there are also word and lemma frequency lists with some convenient frequency measures, optionally with a frequency threshold of 10 (smaller files, easier handling).

——————————————————————–

LICENSE AND REFERENCES

License Creative Commons Attribution 4.0 International
References http://corporafromtheweb.org/category/cow-citation/

Please tell us whenever you publish work based on COW:
https://webcorpora.org/publication/

DOWNLOAD

http://hpsg.fu-berlin.de/cow/ngrams/
http://hpsg.fu-berlin.de/cow/frequencies/

ORIGIN AND ORIGINAL CORPUS SIZES

The ngrams are derived from the COW14AX sentence-shuffled corpora.

Information http://corporafromtheweb.org/category/corpora/
Interface https://webcorpora.org/

English 9,578,828,861 tokens (International)
German 11,660,894,000 tokens (AT, CH, DE)
Spanish 3,680,794,644 tokens (International)
Swedish 4,842,753,707 tokens (FI, SV)

FREQUENCY LISTS

Languages English, German, Spanish, Swedish
Versions Lemma, Lemma + POS, Word, Word + POS
Thresholds no threshold; raw frequency > 9
Measures raw frequency, absolute rank, frequency per million,
log-frequency per million, frequency band

NGRAMS

N 1 .. 5
Languages English, German, Spanish, Swedish
Versions Raw, Word, Word + POS, Lemma (except Swedish)

DEPENDENCY BIGRAMS

Languages English (German soon, maybe Swedish)
Versions Raw, Word, Word + POS, Lemma, Lemma + POS

CFP Posters on late-breaking results June 15 deadline

Through the corpora list

:::::::::::::::::::::::::::::::::
CORPUS LINGUISTICS 2015

The CL2015 organising committee is pleased to issue a call for posters on late-breaking results on any of the topics in the conference’s scope. By “late-breaking” we mean research which was not at a sufficiently advanced stage for an abstract submission to be made in the main submission cycle, but which has now reached that point.

We anticipate that the research in question will still be in its earliest phases. “Late-breaking results” include – but are not necessarily limited to – pilot study results, corpus creation activities currently in hand, newly-developed software, and so on.

· Abstracts should be 400-750 words in length. They must be formatted using the conference stylesheet (available to download from http://ucrel.lancs.ac.uk/cl2015/call.php )

· We especially encourage submission of abstracts from early-career researchers, including postgraduate research students and postdoctoral researchers.

· Abstracts which were previously submitted for the January deadline, and not accepted, are NOT eligible to be resubmitted.

· Abstracts should be submitted by email to cl2015@lancaster.ac.uk by 15th June 2014.

· As with all presentations, at least one author of any late-submission poster must attend the conference.
For more details see http://ucrel.lancs.ac.uk/cl2015

An archive copy of the previously-circulated CL2015 Call for Participation may be found here: http://ucrel.lancs.ac.uk/cl2015/doc/CL2015-CallParticipation.pdf

Andrew Hardie, Tony McEnery, Amanda Potts, Vaclav Brezina, and Paul Rayson
The CL2015 Organising Committee

Adam Kilgarriff: a selection of papers and talks

Some readings to remember one of the most indisputably influential corpus linguists in the 20 and 21st centuries.

Using corpora for language research

https://www.sketchengine.co.uk/documentation/attachment/wiki/AK/Papers/SkE_for_lingResearch2013.ppt?format=raw

Googleology is bad science

http://www.kilgarriff.co.uk/Publications/2007-K-CL-Googleology.pdf

Grammar is to meaning as the law is to good behaviour. Corpus Linguistics and Linguistic Theory 3 (2): 195-198.

http://www.kilgarriff.co.uk/Publications/2007-K-CLLT-grammarlaw.doc

Native & learner language in interviews

This talk discusses some of our findings in

Pérez-Paredes, P., & Sánchez Tornel, M. (2015). A multidimensional analysis of learner language during story reconstruction in interviews. In M. Callies & S. Götz (Eds.), Learner Corpora in Language Testing and Assessment. Amsterdam: John Benjamins.