Another SkELL Update

SkELL has another new feature.

In the WordSketch function, there is now a button that provides more context for the words co-occuring with the target word/lemma.

Automatic PoS tagging in the corpus sometimes results in errors, and this feature is meant to help with this problem. It doesn’t really prevent the errors, but it should help users make correct identifications despite tagging errors.

For more info, see this post from the FB CorpusCall group and play around with this example page from SkELL.

Parallel corpora and concordances

This might be tl;dr … If you are just looking for a list or some links to parallel corpora, please go to the end of this post.


In response to my presentation at this years ETJ Tokyo conference, where I talked about the parallel corpus and DDL tool called SCoRE, I was asked whether there were parallel corpora available in other languages. Short answer: Yes! Caveat: They are not always straightforward to use.

First of all, a quick explanation of what is a parallel corpus. It is a kind of bi- or multilingual corpus. A parallel corpus is a corpus that contains text from one language that is aligned with translations from one or more other languages; so, for example, if I query talk * in the Japanese interface of SCoRE I will get concordance lines in English that contain talk+ any word and concordance lines in Japanese that are translations of those English lines. These are parallel concordances.

parlines
English-Japanese parallel concordancing in SCoRE

Here is another illustration  showing a sample of a concordance from the Portugese-English parallel corpus COMPARA. The query terms were “talk” “.*” (this is the syntax for the talk + any word search in COMPARA, quote marks included).

comparalines
Parallel concordancing in COMPARA

Parallel corpora are often used in translation and contrastive studies [see McEnery, A., & Xiao, R. (2007). Parallel and comparable corpora: What is happening. Incorporating Corpora. The Linguist and the Translator, 18-31]. Although they are not used as much in language learning, there has been promising work recently, particularly (as far as I’m aware) here in Japan [see Anthony, L., Chujo, K., & Oghigian, K. (2011). A novel, web-based, parallel concordancer for use in the ESL/EFL classroom. In Corpus-based studies in language use, language learning, and language documentation (pp. 123-138). Brill.; see also Chujo, K., Kobayashi, Y., Mizumoto, A., & Oghigian, K. (2016). Exploring the Effectiveness of Combined Web-based Corpus Tools for Beginner EFL DDL. Linguistics and Literature Studies, 4(4).]

Parallel concordancing can used for activities like translation tasks, of course, but they are also useful for DDL, at least in certain situations. In my experience, having translations of English concordance lines available in students’ L1 is very helpful for both lower-proficiency students and novice DDL students. Both the content and format of concordance lines can be difficult for such students, but in both cases the L1 support offered by parallel corpora allows students to quickly grasp the meaning of the English lines, letting them focus on the context or patterns in the lines. Even if they don’t always need the L1 support to really understand the English lines, they often feel more comfortable and are more receptive to doing activities and work that they are generally unaccustomed to doing. Perhaps as they become more familiar with concordance lines they can switch to monolingual lines.

Another benefit is that they can get a sense of how differently (or similarly) concepts, ideas, or notions may be expressed in the L2 as compared to their L1. Students can pick up on shades of meaning, nuance, and usage. I’ve seen this lead to lexical development where students have commented that they found a phrase or new (and natural-sounding) way to express something they had previously expressed inaccurately due to L1 interference, or had been completely unaware of because it wasn’t covered in any traditional way (i.e. it really is something they discovered for themselves). It’s only anecdotal, but I have spoken with my students about these mini ‘light-bulb’ moments and they react very positively to them.

There can be issues, though. There needs to be some understanding of, say, the directionality and relationship of the source material to the translations, or where the translations have come from and their quality, and of course that the translation seen in a concordance line is almost certainly not the only potential/accurate way to translate the source text. And another thing to keep in mind is that students’ need to share a single L1 unless the corpus is multilingual with translations available for all of the students’ L1s (which would overcome one issue but possibly raise others).

But still, parallel concordances can be quite useful and make it easier for students to get involved in doing DDL work. For more info about uses and issues with parallel corpora/concordances I recommend reading ‘Frankenberg-Garcia, A. (2005). Pedagogical uses of monolingual and parallel concordances. ELT Journal, 59(3), 189-198.’


Finally, where are these parallel corpora? A simple google search will turn up numerous parallel corpora available for download, such as the Open Parallel Corpus (OPUS), but that means you need to run your own parallel concordancing software. Something like AntPConc might be a relatively easy-to-use piece of software for this. However, even if you are comfortable running an application like AntPConc, the parallel corpora you find might not be appropriate for your students unless you are in an ESP environment with students learning language for, say, international legal or technical contexts (like the EuroParl corpus).

Alternatively, I’ve compiled a very brief list of some parallel corpora and projects that have web-based interfaces. A caution, though, I am familiar only with the English-Japanese corpora on this list; although some of the others have been used for language learning, or designed with language learning as a goal, I cannot vouch for the pedagogic applicability or accuracy of the other language combinations here (I’ll leave that to folks who understand the languages in these corpora).

Note: All of these are combined with English

Japanese

SCoRE (and WebSCoRE); WebParaNews

Chinese

E-C Concord (more info can be found here); BFSU CQPweb has several parallel corpora (for guests the user ID and password are both test)

Korean

MOA

Thai

ETPC

Polish

PACO for EPPC

Portugese

COMPARA (further information about COMPARA is available here)

Multilingual

Tatoeba; Linguee; Reverso Context

Learner Language

ENEJE (this parallel corpus aligns essays by Japanese EFL students with edits made by native English speakers)


I’m sure there are many more. Feel free to list others in the comments 🙂

 

CorpusMOOC Weeks 4-5

Week 4

Two main things were going on in week 4: A) Lecture and activities for annotation and tagging, and B) Reviewing essays that used corpus analyses.

Annotation comes up a lot in CL readings and discussions, and I would imagine it seems extremely technical when you only read or hear about it. The practical activities/videos using CLAWS, TagAnt, and USAS do a lot to demystify the tagging process. I was particularly interested in USAS as I have not used a semantic tagger before.

Reading and reviewing undergraduate essays was also a nice activity because in a few ways it helped to demonstrate what one should look for when doing analyses, as well as stimulating thinking about how info and data should be presented. It was useful to me as a reflective assessment.

Week 5

In week 5 the main topics were, broadly, using CL to look at social issues and introducing CQPweb, which, if you’ve ever tried to use it before, has an interface that can be difficult to come to grips with (i.e. it’s a bit intimidating imo). Having a walk-thru like this really helps because CQPweb is a really powerful tool, and it allows you to access a lot of corpora (not only English).

The exploration of CL with forensic linguistics is also quite interesting, and is a nice example of CL being used to examine the context and setting of language use and not just a focus on grammatical or internal characteristics of language, but how those characteristics can inform our understanding of what is going in the case. And in turn how that understanding can affect and influence decide-making in the world.

CorpusMOOC Week 2

A few thoughts for week 2:

  • I still think there is a lot of stuff, coming pretty fast, that might be overwhelming for true beginners. It’s good that folks can go back and review things (or skip/hold something for later).
  • After last week’s vocabulary and concept building, I appreciated the more practical bent of this week’s material. Especially, the discussions of language change in the lecture and main reading were very clear and stimulating. Also, a lot of the focus this week was on collocation, which, of course, is so fundamental to a lot of corpus work.
  • Part of what I think was good about the lecture and reading was the introduction of statistical measurements such as Mutual Information and log-likelihood scores. Statistical work can be daunting, but I was happy to see/read clear, if brief, descriptions of how such scores are used. This is just a personal theory (at this point), but I think the reticence to bother with corpora for a lot of folks is based on unfamiliarity with or lack of understanding of statistics.
  • In addition to further tutorials for AntConc, GraphColl was introduced. I had toyed around with GraphColl a little bit before, but this intro clarified quite a few things. The #Lancsbox package, of which GraphColl is a part, is able to do a number of customized statistical measurements on a corpus. The GraphColl feature allows the user to visualize collocations rather than just reading them in a list or table. On top of that, the visualization feature lets you see collocational relationships that maybe removed by a number of degrees. For example, A collocates with B, and B collocates with C, but what about A and C? GraphColl lets the user see how A and C may be connected through B. This is termed a ‘collocation network’. The #Lancsbox website has a citation paper that explains collocation networks in more depth.

Looking forward to week 3 🙂

**Update**

I highly recommend not skipping the video with Michael Hoey where he discusses priming and collocation in the mind. There are some very interesting anecdotes and concepts discussed that would be great, imo, to look into further.

 

Frequency lists in pre-reading activities

This Saturday (6/4/16) I am showing a poster at the JALTCALL Conference in Tokyo. The poster describes a process for using Antconc, or any software that can generate frequency lists, to create wordlists that can be used by students during pre-reading activities. The basic idea is that students can use these lists to mark the words they don’t know or don’t feel confident about, and by doing so they create differentiated/personalized study lists. As a pre-reading activity, this can help students reach (or confirm) a word knowledge percentage threshold needed for comprehensible reading of a particular text. The underlying idea is actually quite flexible and doesn’t need to follow the exact steps as outlined in the poster, rather the process can be tailored to different contexts and settings. For example, I don’t always use the process to determine exact percents of word knowledge for each student, instead I might let students work in a group and research any unknown words together just before reading (especially if it’s a relatively short reading and I just want to ensure they will recognize most, if not all, of the words in a text).

If anyone is interested, below are a .pdf of the poster and a .docx  of the stoplist I used in the example in the poster (I made the stoplist based mostly on frequency data found in COCA). If you want to use the stoplist with a program such as Antconc, you should save it as a .txt file first.

Poster

Stoplist

For some background on word knowledge percentage thresholds for reading, check out The Extensive Reading Foundation’s Guide to Extensive Reading.

 

 

 

SCoRE User Guide (in English)

SCoRE is the Sentence Corpus of Remedial English. It is a sentence corpus and website specially designed for Japanese EFL learners (though the project intends to create versions for learners of other backgrounds as well). On its homepage there is a Japanese language guide for using SCoRE, but no English guide. So, I have written up an unofficial English guide for SCoRE.

Although the current iteration of SCoRE is intended for Japanese EFL learners, sticking to and using the English versions of the tools available may be beneficial for non-Japanese learners, too. In this guide I tried to explain the main tools and their functions.

Anyway, I hope this is somewhat useful for someone 🙂

Here is the guide as a pdf.

**Update**

I forgot to mention that if you access SCoRE on a mobile device you may find issues. For example, on the iPad SCoRE does not seem to work in Safari, however it works as it should in Chrome.

SkELL: Homonymy and Polysemy

One drawback when using SkELL is that it won’t differentiate between, say, lead/lead or the various senses of ‘rat’. The word sketch function will differentiate between parts of speech, but the easy-to-read concordance lines initially generated will have the various words, meanings, and senses jumbled up. However, this drawback can be exploited for the teaching of various kinds of homonyms and polysemous words.

There are several ways to do this, but I’ll only discuss one basic approach here. Take the word ‘sweet’. Maybe you have students familiar with the taste sense of the word, as in “The berries are rather sweet and juicy”. You could show them (or have them look up) the SkELL concordance lines for ‘sweet’. Have them mark off the sentences that they recognize as referring to sweet taste. This would leave several sentences that use ‘sweet’ in different senses, and your students could discuss what ‘sweet’ might mean in those other sentences.

skell sweet
Screenshot: Partial SkELL output for ‘sweet’

In the screenshot above, for instance, lines 9, 10, 19, and 20 appear to be describing something about people’s personalities. Discuss with your students what it could mean to describe a person as ‘sweet’.

Alternatively, students could use a dictionary to look up all/several of the senses of ‘sweet’, and then try to categorize the SkELL sentences according to each sense.

Regardless of how exactly you approach it, there are a lot of ways to exploit this drawback for teaching and learning purposes.

Any other SkELL tips?