If you work with or read about corpora, you are probably familiar with Sketch Engine. If you aren’t familiar with it, it is described on its own website as “the ultimate corpus tool”, and that’s maybe not an exaggeration. You can do a ton of cool stuff with it. Sketch Engine also provides access to hundreds of ready-to-use corpora in close to a hundred languages.
However, it requires a subscription (although a 30-day trial is available for starters). This puts some people off from it; after all, there are a lot of free resources out there.
What some people may not realize, though, is that there are some “open corpora” on Sketch Engine that can be explored (with all of Sketch Engine’s features) without registration.
Some interesting corpora there!
What inspired me to write this post is the presence in the list of open corpora of the EcoLexicon English Corpus, which was made available earlier this year.
The EcoLexicon is an environmental knowledge base and tool developed at the University of Granada. It is described as a knowledge base for “language specialists, domain experts, and the general public. Its representations are designed to help translators, technical writers, and environmental experts who need to access and better understand specialized environmental concepts with a view to writing or translating specialized and semi-specialized texts” (San Martín et al., 2017, p. 97).
The EcoLexicon English Corpus is a collection of English texts used in the EcoLexicon project. Searches can be limited according to domain of environmental studies, type of intended user of the text, geographical variety of English, country of publication, year of publication (1973-2016), publication genre, and who edited the text.
From an ELT-perspective, perhaps certain ESP-settings are the most obvious place for such a corpus to be useful. But really I’m just happy such a corpus has been made available to anyone who wants to explore language concerning environmental studies, policies, and communication.
The other open corpora are valuable too! And can let you try out Sketch Engine without needing to commit to anything or even registering.
Antonio San Martín, Melania Cabezas-García, Miriam Buendía, Beatriz Sánchez-Cárdenas, Pilar León-Araúz, Pamela Faber. 2017. Recent Advances in EcoLexicon. Dictionaries: Journal of the Dictionary Society of North America, (38)1, pp. 96-115. https://doi.org/10.1353/dic.2017.0004
UPDATE 2: I noticed that the bottom of slide #20 has been cut off. The ending of that sentence should be “… are marked as being incorrect.”
These are my slides from a presentation I gave at the JALT International Conference in Tsukuba, Japan. I talked about using SCoRE with my students and their reactions to it (and toward guided induction activities).
The main points were that my students felt SCoRE was a simple-to-use tool; and they liked it although it was hard to differentiate their feelings toward SCoRE itself from the guided induction approach. As an aside, in my perception they liked the activities more and became much quicker at doing them as they became more familiar with the nature of the activities. I think that if a tool like SCoRE or an approach like guided induction is going to be used, it should be used with some consistency so that students can get accustomed to it (i.e. use it habitually, not as a one-off).
Note: They are saved on a different device than the one I’m using right now, but later I will post an example of one of the guided induction worksheets I mentioned in the presentation/slides.
In the WordSketch function, there is now a button that provides more context for the words co-occuring with the target word/lemma.
Automatic PoS tagging in the corpus sometimes results in errors, and this feature is meant to help with this problem. It doesn’t really prevent the errors, but it should help users make correct identifications despite tagging errors.
I use SkELL quite often, so I was glad to read that there has been an update to the example sentences that get shown. This was an occasional, irritating issue because the kinds of SkELL-assisted activities I usually do with my students are hampered by spelling mistakes and such. For them, the learners, the cleaner data should be beneficial.
“SKELL If you are a user of SKELL, you might have noticed a recent improvement in the quality of the example sentences. This is thanks to the deletion of sentences that contained spelling mistakes and hapax legomena. While both of these things can be of interest, it is better that the 40 example sentences of a word or phrase are as accurate as possible.
There are 10,370 instances of the word ‘dolphin’, for example, in the full corpus. The algorithm that chooses the best 40 for learners now works with cleaner data.
This might be tl;dr … If you are just looking for a list or some links to parallel corpora, please go to the end of this post.
In response to my presentation at this years ETJ Tokyo conference, where I talked about the parallel corpus and DDL tool called SCoRE, I was asked whether there were parallel corpora available in other languages. Short answer: Yes! Caveat: They are not always straightforward to use.
First of all, a quick explanation of what is a parallel corpus. It is a kind of bi- or multilingual corpus. A parallel corpus is a corpus that contains text from one language that is aligned with translations from one or more other languages; so, for example, if I query talk * in the Japanese interface of SCoRE I will get concordance lines in English that contain talk+ any wordand concordance lines in Japanese that are translations of those English lines. These are parallel concordances.
Here is another illustration showing a sample of a concordance from the Portugese-English parallel corpus COMPARA. The query terms were “talk” “.*” (this is the syntax for the talk + any word search in COMPARA, quote marks included).
Parallel concordancing can used for activities like translation tasks, of course, but they are also useful for DDL, at least in certain situations. In my experience, having translations of English concordance lines available in students’ L1 is very helpful for both lower-proficiency students and novice DDL students. Both the content and format of concordance lines can be difficult for such students, but in both cases the L1 support offered by parallel corpora allows students to quickly grasp the meaning of the English lines, letting them focus on the context or patterns in the lines. Even if they don’t always need the L1 support to really understand the English lines, they often feel more comfortable and are more receptive to doing activities and work that they are generally unaccustomed to doing. Perhaps as they become more familiar with concordance lines they can switch to monolingual lines.
Another benefit is that they can get a sense of how differently (or similarly) concepts, ideas, or notions may be expressed in the L2 as compared to their L1. Students can pick up on shades of meaning, nuance, and usage. I’ve seen this lead to lexical development where students have commented that they found a phrase or new (and natural-sounding) way to express something they had previously expressed inaccurately due to L1 interference, or had been completely unaware of because it wasn’t covered in any traditional way (i.e. it really is something they discovered for themselves). It’s only anecdotal, but I have spoken with my students about these mini ‘light-bulb’ moments and they react very positively to them.
There can be issues, though. There needs to be some understanding of, say, the directionality and relationship of the source material to the translations, or where the translations have come from and their quality, and of course that the translation seen in a concordance line is almost certainly not the only potential/accurate way to translate the source text. And another thing to keep in mind is that students’ need to share a single L1 unless the corpus is multilingual with translations available for all of the students’ L1s (which would overcome one issue but possibly raise others).
But still, parallel concordances can be quite useful and make it easier for students to get involved in doing DDL work. For more info about uses and issues with parallel corpora/concordances I recommend reading ‘Frankenberg-Garcia, A. (2005). Pedagogical uses of monolingual and parallel concordances. ELT Journal, 59(3), 189-198.’
Finally, where are these parallel corpora? A simple google search will turn up numerous parallel corpora available for download, such as the Open Parallel Corpus (OPUS), but that means you need to run your own parallel concordancing software. Something like AntPConc might be a relatively easy-to-use piece of software for this. However, even if you are comfortable running an application like AntPConc, the parallel corpora you find might not be appropriate for your students unless you are in an ESP environment with students learning language for, say, international legal or technical contexts (like the EuroParl corpus).
Alternatively, I’ve compiled a very brief list of some parallel corpora and projects that have web-based interfaces. A caution, though, I am familiar only with the English-Japanese corpora on this list; although some of the others have been used for language learning, or designed with language learning as a goal, I cannot vouch for the pedagogic applicability or accuracy of the other language combinations here (I’ll leave that to folks who understand the languages in these corpora).
Two main things were going on in week 4: A) Lecture and activities for annotation and tagging, and B) Reviewing essays that used corpus analyses.
Annotation comes up a lot in CL readings and discussions, and I would imagine it seems extremely technical when you only read or hear about it. The practical activities/videos using CLAWS, TagAnt, and USAS do a lot to demystify the tagging process. I was particularly interested in USAS as I have not used a semantic tagger before.
Reading and reviewing undergraduate essays was also a nice activity because in a few ways it helped to demonstrate what one should look for when doing analyses, as well as stimulating thinking about how info and data should be presented. It was useful to me as a reflective assessment.
In week 5 the main topics were, broadly, using CL to look at social issues and introducing CQPweb, which, if you’ve ever tried to use it before, has an interface that can be difficult to come to grips with (i.e. it’s a bit intimidating imo). Having a walk-thru like this really helps because CQPweb is a really powerful tool, and it allows you to access a lot of corpora (not only English).
The exploration of CL with forensic linguistics is also quite interesting, and is a nice example of CL being used to examine the context and setting of language use and not just a focus on grammatical or internal characteristics of language, but how those characteristics can inform our understanding of what is going in the case. And in turn how that understanding can affect and influence decide-making in the world.