Principles of Division
One of the recent changes that I had to make is changing whokno.ws the concept thesaurus, taking it from (roughly) pre-coordinate to (roughly) post-coordinate. The problem here was that the algorithm used to determine group membership wasn’t blending well with the way the principles of divison works. I think this is a problem with statistical techniques generally.
But here’s a simple way to explain it to you: Say I have two sets, and whether you’re in that set is defined as whether an article contains a particular concept. These sets aren’t disjoint, which is to say that there’s an overlap between the two sets. For Whokno.ws, because the modelling of set membership is statistical, doing a computation *combining* the parameters appeared to be the way to go — the ranker would give high scores to things that were about *both* topics. However, it also gives a high score to things that are only about one of the two topics, but *really* about that topic.
What this means is that, say, articles about “bird flu in vietnam” are best found by looking for articles about “bird flu,” and then looking in that set of articles for articles about “vietnam.” This is very interesting to me, because it means that by the proper way to do this is actually by using the “POM” or “BOTD” engines that I’ve already written. Strange.
Today’s whokno.ws task: improving the way that stopwords are calculated. It’s made a decent difference in the article scores, very much for the better.
Tomorrow’s whokno.ws task: adding a (blocking) queue system to the interface between the feed retriever and the parser.
Monday’s whokno.ws task: changing the way that the classifier matches concepts to phrases. (also known as word -> lexical element transition step 1/3.)