Classifiers and Classification

For the last while, I’ve been working on a project that involves scanning large numbers of RSS/Atom feeds, and then using Bayesian1 classifiers to break it into one of a number of categories for summarization and display (the system that I’m using to do this is available as a sample website, but really needs more data in the training sets before it’s ready to entertain all of you.) The categories are pretty straightforward, and they fit into a somewhat neat controlled vocabulary (ontology/thesaurus/whatever.)

There’s a relation, though, between the different terms in this sort of classification and the training data used to build the Bayesian Classifier. If the terms are arranged in a hierarchy (and certain assumptions are made about that hierarchy, like subterms encompassing part of the range of meaning of their parent term and nothing else)2, then the training data used for classifying terms can be shared.

For example, all positive training data that belongs to the child terms can also be used for the parent. So, for (a constructed) example, positive training data for tamiflu also belongs in the positive data for bird flu vaccines. The reverse is true of negative training data. For negative data, the negative data for the parent can also be used for the child terms.

This is highly useful information when you’re making a large scale text classifier (and having it classify texts as belonging to categories or not, as opposed to just clustering texts into the categories that actually appear. It’s easier to use things like bayesian classifiers do to this if you’re looking for somewhat fine-grained detail.

Currently, I’ve been using Classifier4J for doing the classification and text summarization3. The text summarization is sort of annoying, though, because it’s based on a simple statistical choice of sentences which occasionally picks up date-lines and partial phrases because of what’s ‘important.’ I’m resorting the urge to go completely POS-tagging nuts on the whole thing and only selecting sentences of certain types or completeness because this is, after all, a side project. (The number of times I see things like ‘this sentence no verb.’ is astounding, though, and slowly driving me nuts.)

So, another day in the life.

1 although i’m also using a vector space classifier for a related, larger project and it’s driving me less nuts training it.
2 this is called a meronymous (‘part-of’) relationship, and given that half the people who regularly read this blog were in LIS530 or its equivalent at some point, you should remember this.
3 and will probably eventually switch to jNBC http://jbnc.sourceforge.net/ before i go nuts