N-grams, Skipgrams, and Concgrams

This post is a little more in the weeds than what I usually try to write about on corpling4efl, but for teachers with more than a casual interest in using corpora I think it can be a beneficial topic to think about. For a fuller (and better and clearer) treatment of the subject I’ve included reference links at the end of the post.

A couple weeks ago I was in a discussion with some other teachers and we were talking about finding collocations/lexical bundles and such. The topic of n-grams, or contiguous word sequences in a text/corpus, came up. Somebody asked about non-contiguous sequences, and the terms skipgram and concgram entered the conversation. I want to give a *very* truncated rundown of what these are (or at least what I understand them to be, which is a limited understanding).

The ‘n’ in n-grams is usually considered a number. So, for example, a 2-gram search will find all 2-item contiguous sequences in a corpus, while a 5-gram search will find all 5-item contiguous sequences in a corpus. Finding these sequences lets one analyze aspects of them such as frequency and strength of word association. A 4-gram search in a politics-oriented corpus, for example, might show that an expression like “dirty money in politics” has strong collocational qualities. However, the search output would miss an expression like “dirty money in our politics”.

Skipgrams take such patterns into account. Whereas an n-gram search can only find patterns such as [A+B], skipgrams can find both [A+B] and [A+C+B]. That is, they can find collocates even when there is variety in constituency. A skipgram would uncover “dirty money in politics” and “dirty money in our politics”, however it would still miss an expression like “money in politics is dirty” even though the collocational elements are similar.

Concgrams go a step further than skipgrams by taking position/order into account. Concgrams not only find patterns like [A+B] and [A+C+B], but also patterns such as [B+A]. In a concgram search the association strength in a corpus of words like ‘dirty’, ‘money’, and ‘politics’ may be more clear because the search finds patterns that vary in regard to both position and constituency.

So this sounds like something for researchers maybe to deal with, but what about teachers? Well, I’m not sure this has a ton of direct use for students (although a creative teacher might find a nice way to integrate it), but in lesson preparation or materials development the ideas underlying n-grams, skipgrams, and concgrams are good for teachers to think about: “clear sky” vs “the sky is clear” vs “clear blue sky” reflect various iterations of a collocational relationship that might not be intuitive to learners, so teachers can use these ideas to help students develop richer, fuller understandings of how words work together. And for those with a deeper interest in corpus linguistics, this is an important aspect to consider when analyzing phraseology and word associations in a corpus.

