SIGIR06: formal models: LDA-based Document Models for Ad-hoc Retrieval
Document modeling is important to any IR approach — the bag of words approach assumes word independence, and this is simple, but inappropriate to natural language. There have been a bunch of approaches to this sort of thing in the past, but here’s a relatively new one that does well versus various TREC collections.
Here’s a link to the paper: LDA-based Document Models for Ad-hoc Retrieval.
The presentation was largely a crawl of the paper section by section, and I’m going to emulate that approach by just referring to the paper so you can have that experience.
However: it beats previous models because it maps { document vs. topic } for all topics and documents, as opposed to the cluster approaches, for example, which largely assume that all documents belong to one cluster, or for many practical approaches, belong to whatever cluster it matches best. Because documents belong to n topics with probability p(d[i], n), this is better than searching against bag of words models.
All papers in this section are pretty oriented towards the whole ‘topic searching autogenerated’ is better than word-based. See the papers in question for the differentiators, as a lot of it is math that I’m not going to break out the LaTeX for on the fly. I will also note that most presentations in this area are pretty high on the UMLS fetishism.
Technorati Tags: DocumentModels, puppy, sigir, sigir2006