CorpusMOOC week 8

And so ends CorpusMOOC 2016. Week 8’s lecture content focused on ‘bad language’: swears, insults, and other sorts of uncouth language. The build-up to week 8 didn’t quite match the actual event imo, but in a good way. It seemed to me that ‘bad language’ week was in-a-way marketed as a wild, fun, here-we-go kind of thing.

And it was fun! But not in a wild way. It was actually quite a sober analysis of the use of ‘bad language’, looked at from a variety of angles and variables such as age, class, and sex. That’s what made it fun: it was really dissecting the language, trying to understand it, contemplating it, and not just reveling in getting to use no-no words.

The practical activities wrapped up the CQPweb tutorials. I haven’t written very much about the practical activities, but that’s because I think they speak for themselves. They are PRACTICAL. If your new to CL, they will undoubtedly be helpful. Even if you’re not new, there’s probably something new, some way of searching a corpus that you didn’t know about before, or that you perhaps underutilize, and these activities are a good refresher.

I believe this course is a confidence builder, more than anything else. You come away thinking that even if I can’t do too sophisticated corpus work yet, I know my way around and how to begin doing meaningful things, and I know some of the range of work that can be done with corpora.



CorpusMOOC Weeks 6-7

These weeks are probably the most pertinent for this blog; they both focus on corpus linguistics influence and use in language learning.

In week 6, the influence of CL on materials development, such as dictionaries and textbooks was the main focus. The notion of a Lexical Syllabus was also introduced. LS is an approach that focuses on teaching the most frequent and widely used words; not only teaching them first, but teaching them deeply. In the lecture videos, this is described as a way of introducing learners “to a large amount of language, but not a large vocabulary”.

Week 7 discussed learner corpora and data-driven learning. As for learner corpora, both building and using them were discussed. Learner corpora can be used, for example, to identify common errors particular to a group of learners; in the lecture the example of Chinese learners of English and their difficulties with the article system in English were cited, but it is clear that this type of work could be applied to groups based not only on L1 or nationality, but age, educational system, geography, teaching approaches, etc., etc., etc.

DDL was only briefly discussed, I think, because it is expected that reading the literature about it would be clearer than trying to explain it in a lecture. I think that what DDL actually entails is best understood when there are examples or cases that be analyzed.

The practical activities of both weeks continued the CQPWeb activities.

Not just for reference: dictionaries and corpora

These are the slides for a presentation/workshop I am giving at the ETJ Expo (details) in Tokyo this weekend. I am talking about some of the uses of learner dictionaries and English/Japanese parallel corpora like SCoRE and WebParaNews.

Handout: etjhandout-2

Note: Slide 4 erroneously links to ‘oven’, not ‘stove’, from the monolingual dictionary link.

CorpusMOOC Weeks 4-5

Week 4

Two main things were going on in week 4: A) Lecture and activities for annotation and tagging, and B) Reviewing essays that used corpus analyses.

Annotation comes up a lot in CL readings and discussions, and I would imagine it seems extremely technical when you only read or hear about it. The practical activities/videos using CLAWS, TagAnt, and USAS do a lot to demystify the tagging process. I was particularly interested in USAS as I have not used a semantic tagger before.

Reading and reviewing undergraduate essays was also a nice activity because in a few ways it helped to demonstrate what one should look for when doing analyses, as well as stimulating thinking about how info and data should be presented. It was useful to me as a reflective assessment.

Week 5

In week 5 the main topics were, broadly, using CL to look at social issues and introducing CQPweb, which, if you’ve ever tried to use it before, has an interface that can be difficult to come to grips with (i.e. it’s a bit intimidating imo). Having a walk-thru like this really helps because CQPweb is a really powerful tool, and it allows you to access a lot of corpora (not only English).

The exploration of CL with forensic linguistics is also quite interesting, and is a nice example of CL being used to examine the context and setting of language use and not just a focus on grammatical or internal characteristics of language, but how those characteristics can inform our understanding of what is going in the case. And in turn how that understanding can affect and influence decide-making in the world.

“I agree —.” Referencing with COCA

In the past couple weeks I have had reason to pull up frequencies or concordances in response to student questions on several occasions. For a teacher this is a pretty basic but useful application of corpora.

For instance, this past week in a business-focused ESP course we were reading through a role play which used the expression ‘I agree wholeheartedly’. A few types, or maybe lines, of questions arose. Is ‘wholeheartedly’ very common? Can I say ‘I wholeheartedly agree’? Can I say ‘I agree completely’? Can I say ‘I agree perfectly’? What if I agree but not that strongly? These didn’t arise all at once, but came about through discussion (some after I pulled up results in COCA).

Notably, to me, meaning was not an issue at all. All of the questions were regarding usage.

I ran a quick and relatively simple search on COCA using: I agree [r*]

([r*] is the PoS code for adverb)

Here are the first 20 results:


Across the entire corpus we can see that I agree wholeheartedly is the second most common formulation of the I+agree+adverb pattern. I agree completely is by far the most common. The top 4, 6 of the top 10, and 8 of the top 20 seem to be synonymous expressions of the sense of ‘100% agreement’.

We did not have time to go through each expression but the students said they appreciated seeing several examples of alternative phrasings. For the expression I agree perfectly, I actually thought it sounded a little strange when my student asked if it was ok, but here it was, attested. Only 2 hits, but after opening the concordance lines I saw how it was used and my hesitation about it evaporated. For me, this was a good reminder of how corpus reference can help me (and by extension my students) avoid over-generalizing from my own intuition.

The 9th most frequent hit, I agree in, is also notable, and my students asked specifically about it. So we took a look at its 6 concordance lines. 4 of the lines revealed the expression I agree in part and 1 revealed I agree in general; so they found some useful expressions for expressing various degrees of agreement to go along with hits such as (13) I agree basically, (18) I agree strongly, or (19) I agree somewhat.

We also briefly looked at the query: I [r*] agree

This resulted in a very different frequency list:


There are more overall hits for this pattern. I totally agree is the top hit, with I completely agree the second. Interestingly, I certainly agree is the 4th most frequent hit despite I agree certainly occurring zero times in the corpus. In fact, the I+adverb+agree pattern revealed several expressions that had no lexical correlates in COCA to the I+agree+adverb pattern. This wasn’t the point of the referencing in class, so we didn’t dwell on it and focused on the variety of expressions that can be used, but, for me, it was definitely food for thought. Agree?

CorpusMOOC Week 3

This week featured an in-depth look at using corpora to inform discourse analysis, in this case analyses of British newspaper discourse regarding Muslims, immigration, and other related terms/ideas. The lecture and readings were full of interesting points and information. It’s a good demonstration of how keywords, collocations, word sketches and such can illuminate otherwise hidden and hard-to-see features in texts.

Some basic info about corpus building was also introduced, and I assume that corpus building will become a bigger topic in the coming weeks.

As for more technical matters, the Antconc tutorial continued with how to search for clusters of words and ngrams which are, IMO, extremely useful as a concept (at least) for helping learners to produce natural sounding expressions. Using tagged corpora in Antconc was introduced as well. I’m assuming, also, that tagging a corpus will be a topic in the next week or two.

I think the best things about this week were the provision of annotated LOB and Brown corpora and the free (limited duration) account for the Sketch Engine. I am particularly excited to use the Sketch Engine as I have only used it under a trial account before. Particularly for those already familiar or semi-familiar with corpus linguistics, the provision of these kinds of resources really increases the practical value of the course.

CorpusMOOC Week 2

A few thoughts for week 2:

  • I still think there is a lot of stuff, coming pretty fast, that might be overwhelming for true beginners. It’s good that folks can go back and review things (or skip/hold something for later).
  • After last week’s vocabulary and concept building, I appreciated the more practical bent of this week’s material. Especially, the discussions of language change in the lecture and main reading were very clear and stimulating. Also, a lot of the focus this week was on collocation, which, of course, is so fundamental to a lot of corpus work.
  • Part of what I think was good about the lecture and reading was the introduction of statistical measurements such as Mutual Information and log-likelihood scores. Statistical work can be daunting, but I was happy to see/read clear, if brief, descriptions of how such scores are used. This is just a personal theory (at this point), but I think the reticence to bother with corpora for a lot of folks is based on unfamiliarity with or lack of understanding of statistics.
  • In addition to further tutorials for AntConc, GraphColl was introduced. I had toyed around with GraphColl a little bit before, but this intro clarified quite a few things. The #Lancsbox package, of which GraphColl is a part, is able to do a number of customized statistical measurements on a corpus. The GraphColl feature allows the user to visualize collocations rather than just reading them in a list or table. On top of that, the visualization feature lets you see collocational relationships that maybe removed by a number of degrees. For example, A collocates with B, and B collocates with C, but what about A and C? GraphColl lets the user see how A and C may be connected through B. This is termed a ‘collocation network’. The #Lancsbox website has a citation paper that explains collocation networks in more depth.

Looking forward to week 3 🙂


I highly recommend not skipping the video with Michael Hoey where he discusses priming and collocation in the mind. There are some very interesting anecdotes and concepts discussed that would be great, imo, to look into further.