Open corpora on Sketch Engine: EcoLexicon

If you work with or read about corpora, you are probably familiar with Sketch Engine. If you aren’t familiar with it, it is described on its own website as “the ultimate corpus tool”, and that’s maybe not an exaggeration. You can do a ton of cool stuff with it. Sketch Engine also provides access to hundreds of ready-to-use corpora in close to a hundred languages.

However, it requires a subscription (although a 30-day trial is available for starters). This puts some people off from it; after all, there are a lot of free resources out there.

What some people may not realize, though, is that there are some “open corpora” on Sketch Engine that can be explored (with all of Sketch Engine’s features) without registration.


What inspired me to write this post is the presence in the list of open corpora of the EcoLexicon English Corpus, which was made available earlier this year.

The EcoLexicon is an environmental knowledge base and tool developed at the University of Granada. It is described as a knowledge base for “language specialists, domain experts, and the general public. Its representations are designed to help translators, technical writers, and environmental experts who need to access and better understand specialized environmental concepts with a view to writing or translating specialized and semi-specialized texts” (San Martín et al., 2017, p. 97).

The EcoLexicon English Corpus is a collection of English texts used in the EcoLexicon project. Searches can be limited according to domain of environmental studies, type of intended user of the text, geographical variety of English, country of publication, year of publication (1973-2016), publication genre, and who edited the text.

From an ELT-perspective, perhaps certain ESP-settings are the most obvious place for such a corpus to be useful. But really I’m just happy such a corpus has been made available to anyone who wants to explore language concerning environmental studies, policies, and communication.

The other open corpora are valuable too! And can let you try out Sketch Engine without needing to commit to anything or even registering.

Antonio San Martín, Melania Cabezas-García, Miriam Buendía, Beatriz Sánchez-Cárdenas, Pilar León-Araúz, Pamela Faber. 2017. Recent Advances in EcoLexicon. Dictionaries: Journal of the Dictionary Society of North America, (38)1, pp. 96-115.


New to corpus linguistics? Here are the basics

This is a great little explainer from Warren M. Tang. It covers the basics of the basics, and provides ready-to-use definitions and descriptions. He has also got a straightforward glossary of corpus types.

I get questions from colleagues about the basics of corpora sometimes, and I’m always happy to find simple reference materials that can help them out just in case my explanations aren’t clear or head off on a tangent 🙂

Presentation at JALT 2017: Using the SCoRE corpus

UPDATE: If the slides are not appearing properly below, they can be seen here.

UPDATE 2: I noticed that the bottom of slide #20 has been cut off. The ending of that sentence should be “… are marked as being incorrect.”

These are my slides from a presentation I gave at the JALT International Conference in Tsukuba, Japan. I talked about using SCoRE with my students and their reactions to it (and toward guided induction activities).

The main points were that my students felt SCoRE was a simple-to-use tool; and they liked it although it was hard to differentiate their feelings toward SCoRE itself from the guided induction approach. As an aside, in my perception they liked the activities more and became much quicker at doing them as they became more familiar with the nature of the activities. I think that if a tool like SCoRE  or an approach like guided induction is going to be used, it should be used with some consistency so that students can get accustomed to it  (i.e. use it habitually, not as a one-off).

Calls for Papers

In addition to the burst of recent research articles, there are a couple journals putting out CfPs for corpora-related topics.

JCADS is the Journal of Corpora & Discourse Studies. It’s new and will be publishing its first issue in the summer of 2018. The prior link is to the CfP posted on Facebook, submission info is here.

Études en Didactique des Langues (EDL) will publish a special issue on didactic (instructional) uses of corpora. The link is to a French-language page, but if you don’t read French, don’t worry, you can find a PDF of the CfP in both French and English in the list of documents. If you have trouble figuring out which document it is, the date on the document is 17/09/2017.

And finally, not a corpus-related issue, but an interesting (to me) special issue nonetheless, The Language Learning Journal put out a CfP for an issue focusing on the use of video and other audio-visual material.



A smorgasboard of DDL journal activity

Last month, in addition to the release of new corpora, two journals released special issues dedicated to DDL/CL in language learning.

One is the open-access Language Learning & Technology. I haven’t read it yet, but the table of contents looks very interesting. The other one is Language Testing. It’s interesting to see how CL and questions of assessment interact.

Finally, though not a whole dedicated issue, ReCALL has an online first article titled ‘Unlearning overgenerated be through data-driven learning in the secondary EFL classroom’. This will be the first article I get to, as overgenerated be is a recurring issue for many of my students and I’m curious to see what the authors found.

Spoken BNC2014 & EFCAMDAT

Two large, open-access* corpora have recently been released.

Go here to learn about and get access to the Spoken BNC2014, and go here to learn about and get access to the EF-Cambridge Open Language Database (EFCAMDAT). EFCAMDAT is a learner corpus featuring essays from adult learners of English around the world, and this is its second, bigger release. Spoken BNC2014 is brand new (to the public), and, as you may very well be aware, is being described as “the largest ever public collection of transcribed British conversations“.

I don’t have anything too special to say. These both look like terrific resources. I think all who are interested should check them out.

Sounds odd (unintentionally literal)

Yesterday I wrote that the expression “achieve a breakthrough” sounded odd to me, and I much prefer the more frequent “make a breakthrough”.

“Achieve a breakthrough” is, however, a robust collocation (after MAKE, ACHIEVE is the most frequent lemma with “a breakthrough” in COCA; and it has a high MI score). So why did it sound so odd to me? Have I somehow just not encountered it much? Well, maybe sorta.

In the COCA data, any form of ACHIEVE occurs immediately to the left of “a breakthrough” 24 times, while any form of MAKE in that position occurs 65 times, a ratio of about 1:3. COCA provides register/context information in its KWIC displays, and this info showed me that of ACHIEVE’s 24 occurrences, 2 were in spoken contexts (a little over 8%), while 18 of 65 MAKE hits were spoken (a little under 28%). So, the ratio in spoken contexts is 1:9. Alternatively, in non-spoken contexts the ratio is about 1:2.


I was in a speaking/listening skills oriented class, and we were talking about “achieve a breakthrough”. Perhaps the expression sounded odd to me because it’s not an expression I hear much, even if it would be unremarkable to me in a written context. That is, perhaps it is primarily an expression of written English, but not spoken, and thus it struck me as out of place or something.

I don’t actually think COCA is the best source of data about spoken English, but this sort of thing could be explored in other, maybe more suitable corpora.

Title note: when I said it sounded odd to me, I meant in a broad sense it seemed odd to me. Now I realize I may have unintentionally described how it really, literally, sounded odd to me.