Parallel corpora and concordances

This might be tl;dr … If you are just looking for a list or some links to parallel corpora, please go to the end of this post.


In response to my presentation at this years ETJ Tokyo conference, where I talked about the parallel corpus and DDL tool called SCoRE, I was asked whether there were parallel corpora available in other languages. Short answer: Yes! Caveat: They are not always straightforward to use.

First of all, a quick explanation of what is a parallel corpus. It is a kind of bi- or multilingual corpus. A parallel corpus is a corpus that contains text from one language that is aligned with translations from one or more other languages; so, for example, if I query talk * in the Japanese interface of SCoRE I will get concordance lines in English that contain talk+ any word and concordance lines in Japanese that are translations of those English lines. These are parallel concordances.

parlines
English-Japanese parallel concordancing in SCoRE

Here is another illustration  showing a sample of a concordance from the Portugese-English parallel corpus COMPARA. The query terms were “talk” “.*” (this is the syntax for the talk + any word search in COMPARA, quote marks included).

comparalines
Parallel concordancing in COMPARA

Parallel corpora are often used in translation and contrastive studies [see McEnery, A., & Xiao, R. (2007). Parallel and comparable corpora: What is happening. Incorporating Corpora. The Linguist and the Translator, 18-31]. Although they are not used as much in language learning, there has been promising work recently, particularly (as far as I’m aware) here in Japan [see Anthony, L., Chujo, K., & Oghigian, K. (2011). A novel, web-based, parallel concordancer for use in the ESL/EFL classroom. In Corpus-based studies in language use, language learning, and language documentation (pp. 123-138). Brill.; see also Chujo, K., Kobayashi, Y., Mizumoto, A., & Oghigian, K. (2016). Exploring the Effectiveness of Combined Web-based Corpus Tools for Beginner EFL DDL. Linguistics and Literature Studies, 4(4).]

Parallel concordancing can used for activities like translation tasks, of course, but they are also useful for DDL, at least in certain situations. In my experience, having translations of English concordance lines available in students’ L1 is very helpful for both lower-proficiency students and novice DDL students. Both the content and format of concordance lines can be difficult for such students, but in both cases the L1 support offered by parallel corpora allows students to quickly grasp the meaning of the English lines, letting them focus on the context or patterns in the lines. Even if they don’t always need the L1 support to really understand the English lines, they often feel more comfortable and are more receptive to doing activities and work that they are generally unaccustomed to doing. Perhaps as they become more familiar with concordance lines they can switch to monolingual lines.

Another benefit is that they can get a sense of how differently (or similarly) concepts, ideas, or notions may be expressed in the L2 as compared to their L1. Students can pick up on shades of meaning, nuance, and usage. I’ve seen this lead to lexical development where students have commented that they found a phrase or new (and natural-sounding) way to express something they had previously expressed inaccurately due to L1 interference, or had been completely unaware of because it wasn’t covered in any traditional way (i.e. it really is something they discovered for themselves). It’s only anecdotal, but I have spoken with my students about these mini ‘light-bulb’ moments and they react very positively to them.

There can be issues, though. There needs to be some understanding of, say, the directionality and relationship of the source material to the translations, or where the translations have come from and their quality, and of course that the translation seen in a concordance line is almost certainly not the only potential/accurate way to translate the source text. And another thing to keep in mind is that students’ need to share a single L1 unless the corpus is multilingual with translations available for all of the students’ L1s (which would overcome one issue but possibly raise others).

But still, parallel concordances can be quite useful and make it easier for students to get involved in doing DDL work. For more info about uses and issues with parallel corpora/concordances I recommend reading ‘Frankenberg-Garcia, A. (2005). Pedagogical uses of monolingual and parallel concordances. ELT Journal, 59(3), 189-198.’


Finally, where are these parallel corpora? A simple google search will turn up numerous parallel corpora available for download, such as the Open Parallel Corpus (OPUS), but that means you need to run your own parallel concordancing software. Something like AntPConc might be a relatively easy-to-use piece of software for this. However, even if you are comfortable running an application like AntPConc, the parallel corpora you find might not be appropriate for your students unless you are in an ESP environment with students learning language for, say, international legal or technical contexts (like the EuroParl corpus).

Alternatively, I’ve compiled a very brief list of some parallel corpora and projects that have web-based interfaces. A caution, though, I am familiar only with the English-Japanese corpora on this list; although some of the others have been used for language learning, or designed with language learning as a goal, I cannot vouch for the pedagogic applicability or accuracy of the other language combinations here (I’ll leave that to folks who understand the languages in these corpora).

Note: All of these are combined with English

Japanese

SCoRE (and WebSCoRE); WebParaNews

Chinese

E-C Concord (more info can be found here); BFSU CQPweb has several parallel corpora (for guests the user ID and password are both test)

Korean

MOA

Thai

ETPC

Polish

PACO for EPPC

Portugese

COMPARA (further information about COMPARA is available here)

Multilingual

Tatoeba; Linguee; Reverso Context

Learner Language

ENEJE (this parallel corpus aligns essays by Japanese EFL students with edits made by native English speakers)


I’m sure there are many more. Feel free to list others in the comments 🙂

 

CorpusMOOC week 8

And so ends CorpusMOOC 2016. Week 8’s lecture content focused on ‘bad language’: swears, insults, and other sorts of uncouth language. The build-up to week 8 didn’t quite match the actual event imo, but in a good way. It seemed to me that ‘bad language’ week was in-a-way marketed as a wild, fun, here-we-go kind of thing.

And it was fun! But not in a wild way. It was actually quite a sober analysis of the use of ‘bad language’, looked at from a variety of angles and variables such as age, class, and sex. That’s what made it fun: it was really dissecting the language, trying to understand it, contemplating it, and not just reveling in getting to use no-no words.

The practical activities wrapped up the CQPweb tutorials. I haven’t written very much about the practical activities, but that’s because I think they speak for themselves. They are PRACTICAL. If your new to CL, they will undoubtedly be helpful. Even if you’re not new, there’s probably something new, some way of searching a corpus that you didn’t know about before, or that you perhaps underutilize, and these activities are a good refresher.

I believe this course is a confidence builder, more than anything else. You come away thinking that even if I can’t do too sophisticated corpus work yet, I know my way around and how to begin doing meaningful things, and I know some of the range of work that can be done with corpora.

 

 

CorpusMOOC Weeks 6-7

These weeks are probably the most pertinent for this blog; they both focus on corpus linguistics influence and use in language learning.

In week 6, the influence of CL on materials development, such as dictionaries and textbooks was the main focus. The notion of a Lexical Syllabus was also introduced. LS is an approach that focuses on teaching the most frequent and widely used words; not only teaching them first, but teaching them deeply. In the lecture videos, this is described as a way of introducing learners “to a large amount of language, but not a large vocabulary”.

Week 7 discussed learner corpora and data-driven learning. As for learner corpora, both building and using them were discussed. Learner corpora can be used, for example, to identify common errors particular to a group of learners; in the lecture the example of Chinese learners of English and their difficulties with the article system in English were cited, but it is clear that this type of work could be applied to groups based not only on L1 or nationality, but age, educational system, geography, teaching approaches, etc., etc., etc.

DDL was only briefly discussed, I think, because it is expected that reading the literature about it would be clearer than trying to explain it in a lecture. I think that what DDL actually entails is best understood when there are examples or cases that be analyzed.

The practical activities of both weeks continued the CQPWeb activities.

Not just for reference: dictionaries and corpora

These are the slides for a presentation/workshop I am giving at the ETJ Expo (details) in Tokyo this weekend. I am talking about some of the uses of learner dictionaries and English/Japanese parallel corpora like SCoRE and WebParaNews.

Handout: etjhandout-2

Note: Slide 4 erroneously links to ‘oven’, not ‘stove’, from the monolingual dictionary link.

CorpusMOOC Weeks 4-5

Week 4

Two main things were going on in week 4: A) Lecture and activities for annotation and tagging, and B) Reviewing essays that used corpus analyses.

Annotation comes up a lot in CL readings and discussions, and I would imagine it seems extremely technical when you only read or hear about it. The practical activities/videos using CLAWS, TagAnt, and USAS do a lot to demystify the tagging process. I was particularly interested in USAS as I have not used a semantic tagger before.

Reading and reviewing undergraduate essays was also a nice activity because in a few ways it helped to demonstrate what one should look for when doing analyses, as well as stimulating thinking about how info and data should be presented. It was useful to me as a reflective assessment.

Week 5

In week 5 the main topics were, broadly, using CL to look at social issues and introducing CQPweb, which, if you’ve ever tried to use it before, has an interface that can be difficult to come to grips with (i.e. it’s a bit intimidating imo). Having a walk-thru like this really helps because CQPweb is a really powerful tool, and it allows you to access a lot of corpora (not only English).

The exploration of CL with forensic linguistics is also quite interesting, and is a nice example of CL being used to examine the context and setting of language use and not just a focus on grammatical or internal characteristics of language, but how those characteristics can inform our understanding of what is going in the case. And in turn how that understanding can affect and influence decide-making in the world.

“I agree —.” Referencing with COCA

In the past couple weeks I have had reason to pull up frequencies or concordances in response to student questions on several occasions. For a teacher this is a pretty basic but useful application of corpora.

For instance, this past week in a business-focused ESP course we were reading through a role play which used the expression ‘I agree wholeheartedly’. A few types, or maybe lines, of questions arose. Is ‘wholeheartedly’ very common? Can I say ‘I wholeheartedly agree’? Can I say ‘I agree completely’? Can I say ‘I agree perfectly’? What if I agree but not that strongly? These didn’t arise all at once, but came about through discussion (some after I pulled up results in COCA).

Notably, to me, meaning was not an issue at all. All of the questions were regarding usage.

I ran a quick and relatively simple search on COCA using: I agree [r*]

([r*] is the PoS code for adverb)

Here are the first 20 results:

i-agree-sized

Across the entire corpus we can see that I agree wholeheartedly is the second most common formulation of the I+agree+adverb pattern. I agree completely is by far the most common. The top 4, 6 of the top 10, and 8 of the top 20 seem to be synonymous expressions of the sense of ‘100% agreement’.

We did not have time to go through each expression but the students said they appreciated seeing several examples of alternative phrasings. For the expression I agree perfectly, I actually thought it sounded a little strange when my student asked if it was ok, but here it was, attested. Only 2 hits, but after opening the concordance lines I saw how it was used and my hesitation about it evaporated. For me, this was a good reminder of how corpus reference can help me (and by extension my students) avoid over-generalizing from my own intuition.

The 9th most frequent hit, I agree in, is also notable, and my students asked specifically about it. So we took a look at its 6 concordance lines. 4 of the lines revealed the expression I agree in part and 1 revealed I agree in general; so they found some useful expressions for expressing various degrees of agreement to go along with hits such as (13) I agree basically, (18) I agree strongly, or (19) I agree somewhat.

We also briefly looked at the query: I [r*] agree

This resulted in a very different frequency list:

adverb-agree-sized.jpg

There are more overall hits for this pattern. I totally agree is the top hit, with I completely agree the second. Interestingly, I certainly agree is the 4th most frequent hit despite I agree certainly occurring zero times in the corpus. In fact, the I+adverb+agree pattern revealed several expressions that had no lexical correlates in COCA to the I+agree+adverb pattern. This wasn’t the point of the referencing in class, so we didn’t dwell on it and focused on the variety of expressions that can be used, but, for me, it was definitely food for thought. Agree?

CorpusMOOC Week 3

This week featured an in-depth look at using corpora to inform discourse analysis, in this case analyses of British newspaper discourse regarding Muslims, immigration, and other related terms/ideas. The lecture and readings were full of interesting points and information. It’s a good demonstration of how keywords, collocations, word sketches and such can illuminate otherwise hidden and hard-to-see features in texts.

Some basic info about corpus building was also introduced, and I assume that corpus building will become a bigger topic in the coming weeks.

As for more technical matters, the Antconc tutorial continued with how to search for clusters of words and ngrams which are, IMO, extremely useful as a concept (at least) for helping learners to produce natural sounding expressions. Using tagged corpora in Antconc was introduced as well. I’m assuming, also, that tagging a corpus will be a topic in the next week or two.

I think the best things about this week were the provision of annotated LOB and Brown corpora and the free (limited duration) account for the Sketch Engine. I am particularly excited to use the Sketch Engine as I have only used it under a trial account before. Particularly for those already familiar or semi-familiar with corpus linguistics, the provision of these kinds of resources really increases the practical value of the course.