If you work with or read about corpora, you are probably familiar with Sketch Engine. If you aren’t familiar with it, it is described on its own website as “the ultimate corpus tool”, and that’s maybe not an exaggeration. You can do a ton of cool stuff with it. Sketch Engine also provides access to hundreds of ready-to-use corpora in close to a hundred languages.
However, it requires a subscription (although a 30-day trial is available for starters). This puts some people off from it; after all, there are a lot of free resources out there.
What some people may not realize, though, is that there are some “open corpora” on Sketch Engine that can be explored (with all of Sketch Engine’s features) without registration.
Some interesting corpora there!
What inspired me to write this post is the presence in the list of open corpora of the EcoLexicon English Corpus, which was made available earlier this year.
The EcoLexicon is an environmental knowledge base and tool developed at the University of Granada. It is described as a knowledge base for “language specialists, domain experts, and the general public. Its representations are designed to help translators, technical writers, and environmental experts who need to access and better understand specialized environmental concepts with a view to writing or translating specialized and semi-specialized texts” (San Martín et al., 2017, p. 97).
The EcoLexicon English Corpus is a collection of English texts used in the EcoLexicon project. Searches can be limited according to domain of environmental studies, type of intended user of the text, geographical variety of English, country of publication, year of publication (1973-2016), publication genre, and who edited the text.
From an ELT-perspective, perhaps certain ESP-settings are the most obvious place for such a corpus to be useful. But really I’m just happy such a corpus has been made available to anyone who wants to explore language concerning environmental studies, policies, and communication.
The other open corpora are valuable too! And can let you try out Sketch Engine without needing to commit to anything or even registering.
Antonio San Martín, Melania Cabezas-García, Miriam Buendía, Beatriz Sánchez-Cárdenas, Pilar León-Araúz, Pamela Faber. 2017. Recent Advances in EcoLexicon. Dictionaries: Journal of the Dictionary Society of North America, (38)1, pp. 96-115. https://doi.org/10.1353/dic.2017.0004
No mentions that who can be used for animals. And it isn’t only MW. I checked six learner’s dictionaries and none of them said this was acceptable, or an option. This isn’t a huge deal, necessarily, but it can lead to confusion if, say, this has been taught as a ‘rule’ and then students read graded readers where the ‘rule’ is broken without any consequence (search for horse who, fish who, or monkey who in the Lextutor graded reader corpus, for example). Of course, there are tons of sources where students might encounter who being used for animals.
This came up because a student of mine had written “I don’t want a dog who is so big” and a peer suggested it should be which. And that’s FINE. It can be which :-). Or that :-). But it can also be who :-D.
For teachers who like to do consciousness raising or language awareness activities, this kind of situation provides opportunities for discussions of things like what if you read/hear some language in real life that doesn’t seem to match what the dictionary/grammar guide says, analyzing lines of ‘controversial’ lexicogrammar patterns, or formulating ideas about why people choose to use who, or which, or maybe that in different circumstances (does it seem ok for certain animals but not others? is it special for pets? does it change the meaning/tone/nuance?).
Of course, the underlying ideas could be applied to other language questions, too. So, in general, the ‘corpus lesson’ here is that corpora can be used to explore alternatives to more conventional patterns and aid in developing greater language awareness. Corpus-use can be applied to not just learning frequent or common patterns of expression, but to expanding the ways in which learners are able to express themselves.
While talking about this with another teacher, it was suggested that maybe the learner’s dictionaries (and perhaps some other learner-oriented materials) don’t acknowledge who for animals as acceptable because it’s new (recent) and thus ‘non-standard’. But I have trouble seeing which of these would be considered ‘non-standard’ (in fact, I doubt that in many cases fluent English users would even notice this usage unless it were pointed out or they were looking for it). And it’s not really a recent thing, is it?:
This might be tl;dr … If you are just looking for a list or some links to parallel corpora, please go to the end of this post.
In response to my presentation at this years ETJ Tokyo conference, where I talked about the parallel corpus and DDL tool called SCoRE, I was asked whether there were parallel corpora available in other languages. Short answer: Yes! Caveat: They are not always straightforward to use.
First of all, a quick explanation of what is a parallel corpus. It is a kind of bi- or multilingual corpus. A parallel corpus is a corpus that contains text from one language that is aligned with translations from one or more other languages; so, for example, if I query talk * in the Japanese interface of SCoRE I will get concordance lines in English that contain talk+ any wordand concordance lines in Japanese that are translations of those English lines. These are parallel concordances.
Here is another illustration showing a sample of a concordance from the Portugese-English parallel corpus COMPARA. The query terms were “talk” “.*” (this is the syntax for the talk + any word search in COMPARA, quote marks included).
Parallel concordancing can used for activities like translation tasks, of course, but they are also useful for DDL, at least in certain situations. In my experience, having translations of English concordance lines available in students’ L1 is very helpful for both lower-proficiency students and novice DDL students. Both the content and format of concordance lines can be difficult for such students, but in both cases the L1 support offered by parallel corpora allows students to quickly grasp the meaning of the English lines, letting them focus on the context or patterns in the lines. Even if they don’t always need the L1 support to really understand the English lines, they often feel more comfortable and are more receptive to doing activities and work that they are generally unaccustomed to doing. Perhaps as they become more familiar with concordance lines they can switch to monolingual lines.
Another benefit is that they can get a sense of how differently (or similarly) concepts, ideas, or notions may be expressed in the L2 as compared to their L1. Students can pick up on shades of meaning, nuance, and usage. I’ve seen this lead to lexical development where students have commented that they found a phrase or new (and natural-sounding) way to express something they had previously expressed inaccurately due to L1 interference, or had been completely unaware of because it wasn’t covered in any traditional way (i.e. it really is something they discovered for themselves). It’s only anecdotal, but I have spoken with my students about these mini ‘light-bulb’ moments and they react very positively to them.
There can be issues, though. There needs to be some understanding of, say, the directionality and relationship of the source material to the translations, or where the translations have come from and their quality, and of course that the translation seen in a concordance line is almost certainly not the only potential/accurate way to translate the source text. And another thing to keep in mind is that students’ need to share a single L1 unless the corpus is multilingual with translations available for all of the students’ L1s (which would overcome one issue but possibly raise others).
But still, parallel concordances can be quite useful and make it easier for students to get involved in doing DDL work. For more info about uses and issues with parallel corpora/concordances I recommend reading ‘Frankenberg-Garcia, A. (2005). Pedagogical uses of monolingual and parallel concordances. ELT Journal, 59(3), 189-198.’
Finally, where are these parallel corpora? A simple google search will turn up numerous parallel corpora available for download, such as the Open Parallel Corpus (OPUS), but that means you need to run your own parallel concordancing software. Something like AntPConc might be a relatively easy-to-use piece of software for this. However, even if you are comfortable running an application like AntPConc, the parallel corpora you find might not be appropriate for your students unless you are in an ESP environment with students learning language for, say, international legal or technical contexts (like the EuroParl corpus).
Alternatively, I’ve compiled a very brief list of some parallel corpora and projects that have web-based interfaces. A caution, though, I am familiar only with the English-Japanese corpora on this list; although some of the others have been used for language learning, or designed with language learning as a goal, I cannot vouch for the pedagogic applicability or accuracy of the other language combinations here (I’ll leave that to folks who understand the languages in these corpora).
A couple weeks ago a tweet from Matthew Anderson of the BBC blew up a little bit. The tweet had an excerpt from Mark Forsyth’s book The Elements of Eloquence: Secrets of the Perfect Turn of Phrase. The excerpt asserted that “adjectives in English absolutely have to be in this order: opinion-size-age-shape-colour-origin-material-purpose Noun. So you can have a lovely little old rectangular green French silver whittling knife. But if you mess with that word order in the slightest you’ll sound like a maniac.”
The tweet has more than 51,000 retweets and nearly 75,000 likes. It has also spawned articles like the following:
This is too credulous. Fortunately, Language Log was asked about the claim and explained how adjective order is not so simple. And many folks have put forth “big bad wolf” (size-opinion) as a clear violation of this so-called rule.
Forsyth responded by saying “big bad wolf” is actually a case of ablaut reduplication (think of words like zig-zag or tick-tock, they are never zag-zig or tock-tick) and, I think, he’s saying that this means it’s not subject to the rule that “adjectives in English absolutely have to be in this order”. If that’s the case, then absolute ≠ absolute.
But even if we let there be an exception for ablaut reduplication and decide that the ‘absolute’ rule for adjective order doesn’t apply in such cases, does the rule hold otherwise?
We can look for evidence.
The Language Log article mentioned above noted several hits in COCA for patterns that don’t fit the rule, including “big ugly”. I got on COCA myself to see what some of the concordance lines looked like, and ran a few other searches as well (other than “big ugly”, in the searches I purposely used words found in the example from the excerpt). Here are a few of the many, many hits that violate the ‘rule’:
Adjective pattern: size-opinion; Search terms: big ugly
This is a cute little town with a bigugly fire problem right now
On the platform was a bigugly man who talked loud and waved his arms around until his face was red
The construction of the bigugly office building was going very well
seemed infinitely more beautiful for the presence of this newlovely white figure in their midst
At the doorway to the kitchen stood a talllovely woman with long black hair and wearing a pink linen dress
None of these constructions sound maniacal.
I think the pattern Forsyth describes is a general preference for order that is usually going to result in good patterning, but it’s definitely not an absolute rule. The rules governing adjective order are messier than that (see the Language Log post). I only felt the need to comment and remark on it because I noticed several ELT-related accounts passing it around and I thought “Really?? Did anyone check the data??” This post doesn’t have anything really in the way of using corpora in the classroom or anything like that, but hopefully it can be a reminder that corpora are excellent reference tools (especially to check sketchy ‘rules’).
I realized someone might object to the “silver Ethiopian cross” example since an Ethiopian cross is a particular style of cross. So here’s a different example from COCA that demonstrates the same pattern of material-origin (search terms: wooden [j*] boat):
a traditional woodenBahamianboat hand-built in the early 1960’s
The Sentence Corpus of Remedial English was updated and moved to a new website earlier this month: http://www.score-corpus.org. After following the link, there is a button in the top right corner of the page to switch to English.
The tools (Pattern Browser, Concordancer, Quiz Generator) seem to be the same, but it appears the corpus itself has expanded. The new website is much more comprehensive in the information it provides, including pages for things like SCoRE-related publications and user guides in both Japanese and English (there used to only be a Japanese language user guide). There is even a page where users can submit example sentences that may be included in future updates of the corpus.
Very good to see this project developing and continuing.
Last week, a passage in the textbook for one of my classes had the expressions ‘he tunes it by ear’ and ‘they learn to play by ear’. For many of the students (not all, tho) this was a new expression, ‘[verb] by ear’, but in the context of the passage most of them figured out, on their own, that it meant the people tuning or playing instruments were able to do so without any sheet music or other aids. The textbook had a question using the expression ‘play by ear’ in the chapter review, so I felt it would to pay to explore the pattern a little.
So we had this pattern, ‘[verb] by ear’. What does ‘by ear’ really mean? And what verbs could fit into that slot? Several students agreed that it meant using one’s ears to do something. That was a fine start. But would it make any sense to say ‘sleep by ear’ or ‘imagine by ear’ with senses of using one’s ears to sleep or using one’s ears to imagine? No, not really. I asked my students if they could replace ‘by ear’ with a different expression in the phrase ‘they learn to play by ear’. Pretty quickly, they came up with ‘they learn to play by listening’. That’s more like it. So ‘[verb] by ear’ means that someone is using listening skills to [verb].
At this point, I directed the students to open up the newly-mobile-friendly COCA (every one of my students has an iPad) and enter the following query under the List function: VERB by ear.
Looking at the frequency data, it was immediately clear to them that ‘play(s)/playing/played by ear’ is, by far, the most frequent formulation of this pattern. ‘Learn(s)/learning/learned by ear’ is also quite a bit higher in frequency than other formulations of the pattern, most of which have only 1 or 2 occurrences in the corpus (note: ‘birding by ear’ is the title of a book for birdwatchers about learning to recognize birdsongs, and references to this book in the corpus make this formulation appear higher in the frequency list than one might otherwise expect).
Using this data my students and I discussed how the pattern ‘[verb] by ear’ can be used with a lot of different verbs, but in general they are safe knowing that it’s most often used with forms of play and learn.
For homework, students were assigned one phrase from a set of phrases: ‘[verb] by mouth’ or ‘[verb] by hand’ or ‘[verb] by foot’ or ‘[verb] by eye’; and they were told to do a similar analysis to what we did in class. That is, explain how the expression is used and find out what verbs commonly fit into the empty slot. Also, they were to find 3 example sentences (using concordance lines from COCA, an online learner’s dictionary, SkELL, etc.).
In our next class students who were assigned the same phrase sat together and checked/discussed their results, and then they taught members of other groups about their assigned phrase.
Overall, this activity went over very well. A big part of that is due to the BYU Corpora interface redesign. It’s so much cleaner visually, easier to navigate, much easier to search, less intimidating, and so much more mobile-friendly.
Importantly, considering other class and content needs, this did not take up a lot of (in-class) time. The activity on the first day took around 15 mins (in a 90 min class), and the activity on the second day took around 25 mins. Within the activity there were elements of individual, group, and whole class work, which helped keep it from seeming tedious.
Of course, there were other options for how to exploit the corpus, too. For example, instead of assigning expressions for the students to research, we could have had some fun by having them find expressions of the ‘[verb] by [body part]’ variety themselves, maybe discovering some surprising combinations. I suspect you can think of a few ways to positively tweak this kind of activity, too.