Corpus and the Anatomy of Words, Tags and Clusters.
(Warning, this is a highly unstructured, a random-thoughts-externalized-type-post)
First a quick definition:
"A corpus is a collection of texts of written (or spoken) language presented in electronic form. It provides the evidence of how language is used in real situations, from which lexicographers can write accurate and meaningful dictionary entries."
The Oxford English Corpus uses English from novels, journals, newspapers, magazines as well as online chatrooms, emails, and blogs. Apparently, it is the largest English language corpus of its type.
Why am I telling about a corpus? Well, fellow word enthusiast, John Montgomery, pointed out earlier this week that there are now over 1 billion sentences and other examples of usage and spelling of the English language (not 1 billion words). I followed the link and ended up writing this.
But it is the software that OEC use, how they use it and the data that I want to mull over with you. There's a web interface into the database. It's not for public use it seems, so you can't play with the data directly (although you can create a test account and play with sample data from the British Library - go the Sketch Engine and sign up).
OEC has provided a few screencasts demo'ing the the application in action (no sounds though) showing all sorts of queries against their data. In this screencast we see the interface at work, using the 'Corpus Query Language' (CQL).
This CQL query is looking for derivatives of the word 'blog' - a concordance query [lemma*"blog*"&tag*"(NNS?|JJ|VB.")"] (or, look for any noun, adjective or verb including the word 'blog'). The result of this query looks like this after using the frequency filter over the results:
Interesting to note here is frequency (ordered by the most 'important') of the 'blogosphere' compared to the word 'blogworld' or 'blogland'. It appears as number 3 overall. Now as you can see, 'blogosphere' appears 6,000+ times. However, this count (it uses data collected from 2000-2006) is nothing compared to MSN's count (2,400,000+) or Google's results (47,000,000 <wtf!>) searching only English pages. So, my point here is that the corpus is representative and not absolute.
<The> Long Tail <of> Words
"Just ten different lemmas (the, be, to, of, and, a, in, that, have, and I) account for a remarkable 25% of all the one billion words used in the Oxford English Corpus. If you were to read through the corpus, one word in four would be an example of one of these ten lemmas. Similarly, the 100 most common lemmas account for 50% of the corpus, and the 1,000 most common lemmas account for 75%. But to account for 90% of the corpus you would need a vocabulary of 7,000 lemmas, and to get to 95% the figure would be around 50,000 lemmas.
(In case you're wondering...'Lemma')
Vocabulary size (no. lemmas) % of content in OEC Example lemmas 10 25% the, of, and, to, that, have 100 50% from, because, go, me, our, well, way 1000 75% girl, win, decide, huge, difficult, series 7000 90% tackle, peak, crude, purely, dude, modest 50,000 95% saboteur, autocracy, calyx, conformist >1,000,000 99% laggardly, endobenthic, pomological
"The long tail means that accounting for 99% of the Oxford English Corpus requires over a million lemmas. This would include some words which may occur only once or twice in the whole corpus: highly technical terms like chrondrogenesis or dicarboxylate, and one-off coinages like bootlickingly or unsurfworthy."
This is what one long-tailed tagcloud might look like:
With the mass of data collected, the software they use and analysis they do, the OEC team can track changes over time - changes in the meaning, use, bastardization, popularity of and correlations between words. It is their primary tool for maintaining the OED. Amazing stuff.
Random thought: I'm quite sure that the advent of the internet has accelerated the pace of language evolution (through memetic connectivity - see my thoughts on this here).
Now for a jump - related somewhat, but still a jump....forgive me....now, where was I? oh yeah...
Correlations between words...Which brings me to correlation between tags...
Raw Sugar is doing some very interesting work in the latter space. Below is a recommendation list based on your search (similar to Google Suggest). The list in this case are based on correlated tags, solving somewhat a few of the tag-hell issues I've highlighted previously:
The results can be clustered (see results for 'Attention'). Another example, 'cricket' has a 'sports' cluster to choose from in the results. (However, I do think the whole UI needs some work to really let the algorithms shine through).
Stuff You Should Read
Next month, Raw Sugar's Frank Smadja is presenting a short paper at the WWW2006 Collaborative Tagging Workshop. Frank invited me to review the paper earlier this year, I highly recommend it - (PDF) "Automated Tag Clustering: Improving search and exploration in the tag space."
Thanks to Micheal Braly and Geoff Froh I recommend another paper worth a read on the topic of tagging - The Structure of Collaborative Tagging Systems by HP's Scott Golder and Bernardo Huberman who studied del.cio.us users' tagging behavior and concluded that:
a) tagging's primary use is for personal gain (bookmarking) with network benefits as a secondary (but very powerful) effect (the del.icio.us lesson) and
b) that stable patterns emerge around proportions of tags used - highly popular tags vs. minority tags can co-exist and natural ratios appear between these. (I wonder what analysis would reveal if looking for long tails there among those ratios.)
I'm at the end of this post here, wondering how I got from the Oxford English Corpus to tagging. It must be the Seattle Mind camp effect. I had a call with Micheal and Geoff last night, who are also going, to plan one of the sessions we want kick-off tomorrow - 'Del.icio.us Inside' - it's along the line of the tagging behind-the-firewall thread.
Dennis also wants to hook up for the Attention / Info overload and MyData sessions. Should be fun.